Data | Tech News, Tutorials & Expert Insights

article-image-crud-create-read-update-delete-operations-elasticsearch

19 Feb 2018

5 min read

CRUD (Create Read, Update and Delete) Operations with Elasticsearch

19 Feb 2018

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Pranav Shukla and Sharath Kumar M N titled Learning Elastic Stack 6.0. This book is for beginners who want to start performing distributed search analytics and visualization using core functionalities of Elasticsearch, Kibana and Logstash.[/box] In this tutorial, we will look at how to perform basic CRUD operations using Elasticsearch. Elasticsearch has a very well designed REST API, and the CRUD operations are targeted at documents. To understand how to perform CRUD operations, we will cover the following APIs. These APIs fall under the category of Document APIs that deal with documents: Index API Get API Update API Delete API Index API In Elasticsearch terminology, adding (or creating) a document into a type within an index of Elasticsearch is called an indexing operation. Essentially, it involves adding the document to the index by parsing all fields within the document and building the inverted index. This is why this operation is known as an indexing operation. There are two ways we can index a document: Indexing a document by providing an ID Indexing a document without providing an ID Indexing a document by providing an ID We have already seen this version of the indexing operation. The user can provide the ID of the document using the PUT method. The format of this request is PUT /<index>/<type>/<id>, with the JSON document as the body of the request: PUT /catalog/product/1 { "sku": "SP000001", "title": "Elasticsearch for Hadoop", "description": "Elasticsearch for Hadoop", "author": "Vishal Shukla", "ISBN": "1785288997", "price": 26.99 } Indexing a document without providing an ID If you don't want to control the ID generation for the documents, you can use the POST method. The format of this request is POST /<index>/<type>, with the JSON document as the body of the request: POST /catalog/product { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } The ID in this case will be generated by Elasticsearch. It is a hash string, as highlighted in the response: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } As per pure REST conventions, POST is used for creating a new resource and PUT is used for updating an existing resource. Here, the usage of PUT is equivalent to saying I know the ID that I want to assign, so use this ID while indexing this document. Get API The Get API is useful for retrieving a document when you already know the ID of the document. It is essentially a get by primary key operation: GET /catalog/product/AVrASKqgaBGmnAMj1SBe The format of this request is GET /<index>/<type>/<id>. The response would be as Expected: { "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 1, "found": true, "_source": { "sku": "SP000003", "title": "Mastering Elasticsearch", "description": "Mastering Elasticsearch", "author": "Bharvi Dixit", "price": 54.99 } } Update API The Update API is useful for updating the existing document by ID. The format of an update request is POST <index>/<type>/<id>/_update with a JSON request as the body: POST /catalog/product/1/_update { "doc": { "price": "28.99" } } The properties specified under the "doc" element are merged into the existing document. The previous version of this document with ID 1 had price of 26.99. This update operation just updates the price and leaves the other fields of the document unchanged. This type of update means "doc" is specified and used as a partial document to merge with an existing document; there are other types of updates supported. The response of the update request is as follows: { "_index": "catalog", "_type": "product", "_id": "1", "_version": 2, "result": "updated", "_shards": { "total": 2, "successful": 1, "failed": 0 } } Internally, Elasticsearch maintains the version of each document. Whenever a document is updated, the version number is incremented. The partial update that we have seen above will work only if the document existed beforehand. If the document with the given id did not exist, Elasticsearch will return an error saying that document is missing. Let us understand how do we do an upsert operation using the Update API. The term upsert loosely means update or insert, i.e. update the document if it exists otherwise insert new document. The parameter doc_as_upsert checks if the document with the given id already exists and merges the provided doc with the existing document. If the document with the given id doesn't exist, it inserts a new document with the given document contents. The following example uses doc_as_upsert to merge into the document with id 3 or insert a new document if it doesn't exist. POST /catalog/product/3/_update { "doc": { "author": "Albert Paro", "title": "Elasticsearch 5.0 Cookbook", "description": "Elasticsearch 5.0 Cookbook Third Edition", "price": "54.99" }, "doc_as_upsert": true } We can update the value of a field based on the existing value of that field or another field in the document. The following update uses an inline script to increase the price by two for a specific product: POST /catalog/product/AVrASKqgaBGmnAMj1SBe/_update { "script": { "inline": "ctx._source.price += params.increment", "lang": "painless", "params": { "increment": 2 } } } Scripting support allows for the reading of the existing value, incrementing the value by a variable, and storing it back in a single operation. The inline script used here is Elasticsearch's own painless scripting language. The syntax for incrementing an existing variable is similar to most other programming languages. Delete API The Delete API lets you delete a document by ID: DELETE /catalog/product/AVrASKqgaBGmnAMj1SBe The response of the delete operations is as follows: { "found": true, "_index": "catalog", "_type": "product", "_id": "AVrASKqgaBGmnAMj1SBe", "_version": 4, "result": "deleted", "_shards": { "total": 2, "successful": 1, "failed": 0 } } This is how basic CRUD operations are performed with Elasticsearch using simple document APIs from any data source in any format securely and reliably. If you found this tutorial useful, do check out the book Learning Elastic Stack 6.0 and start building end-to-end real-time data processing solutions for your enterprise analytics applications.

0
0
31543

article-image-perform-predictive-forecasting-sap-analytics-cloud

Kunal Chaudhari

17 Feb 2018

7 min read

How to perform predictive forecasting in SAP Analytics Cloud

Kunal Chaudhari

17 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Riaz Ahmed titled Learning SAP Analytics Cloud. This book involves features of the SAP Analytics Cloud which will help you collaborate, predict and solve business intelligence problems with cloud computing.[/box] In this article we will learn how to use predictive forecasting with the help of a trend time series chart to see revenue trends in a range of a year. Time series forecasting is only supported for planning models in SAP Analytics Cloud. So, you need planning rights and a planning license to run a predictive time-series forecast. However, you can add predictive forecast by creating a trend time series chart based on an analytical model to estimate future values. In this article, you will use a trend time series chart to view net revenue trends throughout the range of a year. A predictive time-series forecast runs an algorithm on historical data to predict future values for specific measures. For this type of chart, you can forecast a maximum of three different measures, and you have to specify the time for the prediction and the past time periods to use as historical data. Add a blank chart from the Insert toolbar. Set Data Source to the BestRun_Demo model. Select the Time Series chart from the Trend category. In the Measures section, click on the Add Measure link, and select Net Revenue. Finally, click on the Add Dimension link in the Time section, and select Date as the chart’s dimension: The output of your selections is depicted in the first view in the following screenshot. Every chart you create on your story page has its own unique elements that let you navigate and drill into details. The trend time series chart also allows you to zoom in to different time periods and scroll across the entire timeline. For example, the first figure in the following illustration provides a one-year view (A) of net revenue trends, that is from January to December 2015. Click on the six months link (B) to see the corresponding output, as illustrated in the second view. Drag the rectangle box (C) to the left or right to scroll across the entire timeline: Adding a forecast Click on the last data point representing December 2015, and select Add Forecast from the More Actions menu (D) to add a forecast: You see the Predictive Forecast panel on the right side, which displays the maximum number of forecast periods. Using the slider (E) in this section, you can reduce the number of forecast periods. By default, you see the maximum number (in the current scenario, it is seven) in the slider, which is determined by the amount of historical data you have. In the Forecast On section, you see the measure (F) you selected for the chart. If required, you can forecast a maximum of three different measures in this type of chart that you can add in the Builder panel. For the time being, click on OK to accept the default values for the forecast, as illustrated in the following screenshot: The forecast will be added to the chart. It is indicated by a highlighted area (G) and a dotted line (H). Click on the 1 year link (I) to see an output similar to the one illustrated in the following screenshot under the Modifying forecast section. As you can see, there are several data points that represent forecast. The top and bottom of the highlighted area indicate the upper and lower bounds of the prediction range, and the data points fall in the middle (on the dotted line) of the forecast range for each time period. Select a data point to see the Upper Confidence Bound (J) and Lower Confidence Bound (K) values. Modifying forecast You can modify a forecast using the link provided in the Forecast section at the bottom of the Builder panel. Select the chart, and scroll to the bottom of the Builder panel. Click on the Edit icon (L) to see the Predictive Forecast panel again. Review your settings, and make the required changes in this panel. For example, drag the slider toward the left to set the Forecast Periods value to 3 (M). Click on OK to save your settings. The chart should now display the forecast for three months--January, February, and March 2016 (N): Adding a time calculation If you want to display values such as year-over-year sales trends or year-to-date totals in your chart, then you can utilize the time calculation feature of SAP Analytics Cloud. The time calculation feature provides you with several calculation options. In order to use this feature, your chart must contain a time dimension with the appropriate level of granularity. For example, if you want to see quarter-over-quarter results, the time dimension must include quarterly or even monthly results. The space constraint prevents us from going through all these options. However, we will utilize the year-over-year option to compare yearly results in this article to get an idea about this feature. Execute the following instructions to first create a bar chart that shows the sold quantities of the four product categories. Then, add a time calculation to the chart to reveal the year-over-year changes in quantity sold for each category. As usual, add a blank chart to the page using the chart option on the Insert toolbar. Select the Best Run model as Data Source for the chart. Select the Bar/Column chart from the Comparison category. In the Measures section, click on the Add Measure link, and select Quantity Sold. Click on the Add Dimension link in the Dimensions section, and select Product as the chart’s dimension, as shown here: The chart appears on the page. At this stage, if you click on the More icon representing Quantity sold, you will see that the Add Time Calculation option (A) is grayed out. This is because time calculations require a time dimension to the chart, which we will add next. Click on the Add Dimension link in the Dimensions section, and select Date to add this time dimension to the chart. The chart transforms, as illustrated in the following screenshot: To display the results in the chart at the year level, you need to apply a filter as follows: Click on the filter icon in the Date dimension, and select Filter by Member. In the Set Members for Date dialog box, expand the all node, and select 2014, 2015, and 2016, individually. Once again, the chart changes to reflect the application of filter, as illustrated in the following screenshot: Now that a time dimension has been added to the chart, we can add a time calculation to it as follows: Click on the More icon in the Quantity sold measure. Select Add Time Calculation from the menu. Choose Year Over Year. New bars (A) and a corresponding legend (B) will be added to the chart, which help you compare yearly results, as shown in the following screenshot: To summarize, we provided hands-on exposure on predictive forecasting in SAP Analytics Cloud, where you learned about how to use a trend time series chart to view net revenue trends throughout the range of a year. If you enjoyed this excerpt, check out the book Learning SAP Analytics Cloud, to get an understanding of SAP Analytics Cloud platform and how to create better BI solutions.

0
0
17375

article-image-implement-memory-oltp-sql-server-linux

Fatema Patrawala

17 Feb 2018

11 min read

How to implement In-Memory OLTP on SQL Server in Linux

Fatema Patrawala

17 Feb 2018

11 min read

0
0
26486

article-image-use-labview-data-acquisition

Fatema Patrawala

17 Feb 2018

14 min read

How to use LabVIEW for data acquisition

Fatema Patrawala

17 Feb 2018

14 min read

0
0
7491

article-image-4-ways-implement-feature-selection-python-machine-learning

Sugandha Lahoti

16 Feb 2018

13 min read

4 ways to implement feature selection in Python for machine learning

Sugandha Lahoti

16 Feb 2018

13 min read

[box type="note" align="" class="" width=""]This article is an excerpt from Ensemble Machine Learning. This book serves as a beginner's guide to combining powerful machine learning algorithms to build optimized models.[/box] In this article, we will look at different methods to select features from the dataset; and discuss types of feature selection algorithms with their implementation in Python using the Scikit-learn (sklearn) library: Univariate selection Recursive Feature Elimination (RFE) Principle Component Analysis (PCA) Choosing important features (feature importance) We have explained first three algorithms and their implementation in short. Further we will discuss Choosing important features (feature importance) part in detail as it is widely used technique in the data science community. Univariate selection Statistical tests can be used to select those features that have the strongest relationships with the output variable. The scikit-learn library provides the SelectKBest class, which can be used with a suite of different statistical tests to select a specific number of features. The following example uses the chi squared (chi^2) statistical test for non-negative features to select four of the best features from the Pima Indians onset of diabetes dataset: #Feature Extraction with Univariate Statistical Tests (Chi-squared for classification) #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import SelectKBest #Import chi2 for performing chi square test from sklearn.feature_selection import chi2 #URL for loading the dataset url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #We will select the features using chi square test = SelectKBest(score_func=chi2, k=4) #Fit the function for ranking the features by score fit = test.fit(X, Y) #Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_) #Apply the transformation on to dataset features = fit.transform(X) #Summarize selected features print(features[0:5,:]) You can see the scores for each attribute and the four attributes chosen (those with the highest scores): plas, test, mass, and age. Scores for each feature: [111.52 1411.887 17.605 53.108 2175.565 127.669 5.393 181.304] Selected Features: [[148. 0. 33.6 50. ] [85. 0. 26.6 31. ] [183. 0. 23.3 32. ] [89. 94. 28.1 21. ] [137. 168. 43.1 33. ]] Recursive Feature Elimination RFE works by recursively removing attributes and building a model on attributes that remain. It uses model accuracy to identify which attributes (and combinations of attributes) contribute the most to predicting the target attribute. You can learn more about the RFE class in the scikit-learn documentation. The following example uses RFE with the logistic regression algorithm to select the top three features. The choice of algorithm does not matter too much as long as it is skillful and consistent: #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE #Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction model = LogisticRegression() rfe = RFE(model, 3) fit = rfe.fit(X, Y) print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_) After execution, we will get: Num Features: 3 Selected Features: [ True False False False False True True False] Feature Ranking: [1 2 3 5 6 1 1 4] You can see that RFE chose the the top three features as preg, mass, and pedi. These are marked True in the support_ array and marked with a choice 1 in the ranking_ array. Principle Component Analysis PCA uses linear algebra to transform the dataset into a compressed form. Generally, it is considered a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result. In the following example, we use PCA and select three principal components: #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's PCA algorithm from sklearn.decomposition import PCA #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction pca = PCA(n_components=3) fit = pca.fit(X) #Summarize components print("Explained Variance: %s") % fit.explained_variance_ratio_ print(fit.components_) You can see that the transformed dataset (three principal components) bears little resemblance to the source data: Explained Variance: [ 0.88854663 0.06159078 0.02579012] [[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02 9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03] [ -2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-02 9.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01 [ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01 2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]] Choosing important features (feature importance) Feature importance is the technique used to select features using a trained supervised classifier. When we train a classifier such as a decision tree, we evaluate each attribute to create splits; we can use this measure as a feature selector. Let's understand it in detail. Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness, and ease of use. They also provide two straightforward methods for feature selection—mean decrease impurity and mean decrease accuracy. A random forest consists of a number of decision trees. Every node in a decision tree is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is known as impurity. For classification, it is typically either the Gini impurity or information gain/entropy, and for regression trees, it is the variance. Thus when training a tree, it can be computed by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Let's see how to do feature selection using a random forest classifier and evaluate the accuracy of the classifier before and after feature selection. We will use the Otto dataset. This dataset is available for free from kaggle (you will need to sign up to kaggle to be able to download this dataset). You can download training dataset, train.csv.zip, from the https://www.kaggle.com/c/otto-group-product-classification-challenge/data and place the unzipped train.csv file in your working directory. This dataset describes 93 obfuscated details of more than 61,000 products grouped into 10 product categories (for example, fashion, electronics, and so on). Input attributes are the counts of different events of some kind. The goal is to make predictions for new products as an array of probabilities for each of the 10 categories, and models are evaluated using multiclass logarithmic loss (also called cross entropy). We will start with importing all of the libraries: #Import the supporting libraries #Import pandas to load the dataset from csv file from pandas import read_csv #Import numpy for array based operations and calculations import numpy as np #Import Random Forest classifier class from sklearn from sklearn.ensemble import RandomForestClassifier #Import feature selector class select model of sklearn from sklearn.feature_selection import SelectFromModel np.random.seed(1) Let's define a method to split our dataset into training and testing data; we will train our dataset on the training part and the testing part will be used for evaluation of the trained model: #Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split): np.random.seed(0) training = [] testing = [] np.random.shuffle(dataset) shape = np.shape(dataset) trainlength = np.uint16(np.floor(split*shape[0])) for i in range(trainlength): training.append(dataset[i]) for i in range(trainlength,shape[0]): testing.append(dataset[i]) training = np.array(training) testing = np.array(testing) return training,testing We also need to add a function to evaluate the accuracy of the model; it will take the predicted and actual output as input to calculate the percentage accuracy: #Function to evaluate model performance def getAccuracy(pre,ytest): count = 0 for i in range(len(ytest)): if ytest[i]==pre[i]: count+=1 acc = float(count)/len(ytest) return acc This is the time to load the dataset. We will load the train.csv file; this file contains more than 61,000 training instances. We will use 50000 instances for our example, in which we will use 35,000 instances to train the classifier and 15,000 instances to test the performance of the classifier: #Load dataset as pandas data frame data = read_csv('train.csv') #Extract attribute names from the data frame feat = data.keys() feat_labels = feat.get_values() #Extract data values from the data frame dataset = data.values #Shuffle the dataset np.random.shuffle(dataset) #We will select 50000 instances to train the classifier inst = 50000 #Extract 50000 instances from the dataset dataset = dataset[0:inst,:] #Create Training and Testing data for performance evaluation train,test = getTrainTestData(dataset, 0.7) #Split data into input and output variable with selected features Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain) print("Shape of the dataset ",shape) #Print the size of Data in MBs print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6)) Let's take note of the data size here; as our dataset contains about 35000 training instances with 94 attributes; the size of our dataset is quite large. Let's see: Shape of the dataset (35000, 94) Size of Data set before feature selection: 26.32 MB As you can see, we are having 35000 rows and 94 columns in our dataset, which is more than 26 MB data. In the next code block, we will configure our random forest classifier; we will use 250 trees with a maximum depth of 30 and the number of random features will be 7. Other hyperparameters will be the default of sklearn: #Lets select the test data for model evaluation purpose Xtest = test[:,0:94] ytest = test[:,94] #Create a random forest classifier with the following Parameters trees = 250 max_feat = 7 max_depth = 30 min_sample = 2 clf = RandomForestClassifier(n_estimators=trees, max_features=max_feat, max_depth=max_depth, min_samples_split= min_sample, random_state=0, n_jobs=-1) #Train the classifier and calculate the training time import time start = time.time() clf.fit(Xtrain, ytrain) end = time.time() #Lets Note down the model training time print("Execution time for building the Tree is: %f"%(float(end)- float(start))) pre = clf.predict(Xtest) Let's see how much time is required to train the model on the training dataset: Execution time for building the Tree is: 2.913641 #Evaluate the model performance for the test data acc = getAccuracy(pre, ytest) print("Accuracy of model before feature selection is %.2f"%(100*acc)) The accuracy of our model is: Accuracy of model before feature selection is 98.82 As you can see, we are getting very good accuracy as we are classifying almost 99% of the test data into the correct categories. This means we are classifying about 14,823 instances out of 15,000 in correct classes. So, now my question is: should we go for further improvement? Well, why not? We should definitely go for more improvements if we can; here, we will use feature importance to select features. As you know, in the tree building process, we use impurity measurement for node selection. The attribute value that has the lowest impurity is chosen as the node in the tree. We can use similar criteria for feature selection. We can give more importance to features that have less impurity, and this can be done using the feature_importances_ function of the sklearn library. Let's find out the importance of each feature: #Once we have trained the model we will rank all the features for feature in zip(feat_labels, clf.feature_importances_): print(feature) ('id', 0.33346650420175183) ('feat_1', 0.0036186958628801214) ('feat_2', 0.0037243050888530957) ('feat_3', 0.011579217472062748) ('feat_4', 0.010297382675187445) ('feat_5', 0.0010359139416194116) ('feat_6', 0.00038171336038056165) ('feat_7', 0.0024867672489765021) ('feat_8', 0.0096689721610546085) ('feat_9', 0.007906150362995093) ('feat_10', 0.0022342480802130366) As you can see here, each feature has a different importance based on its contribution to the final prediction. We will use these importance scores to rank our features; in the following part, we will select those features that have feature importance more than 0.01 for model training: #Select features which have higher contribution in the final prediction sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain) Here, we will transform the input dataset according to the selected feature attributes. In the next code block, we will transform the dataset. Then, we will check the size and shape of the new dataset: #Transform input dataset Xtrain_1 = sfm.transform(Xtrain) Xtest_1 = sfm.transform(Xtest) #Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6)) shape = np.shape(Xtrain_1) print("Shape of the dataset ",shape) Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20) Do you see the shape of the dataset? We are left with only 20 features after the feature selection process, which reduces the size of the database from 26 MB to 5.60 MB. That's about 80% reduction from the original dataset. In the next code block, we will train a new random forest classifier with the same hyperparameters as earlier and test it on the testing dataset. Let's see what accuracy we get after modifying the training set: #Model training time start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time() print("Execution time for building the Tree is: %f"%(float(end)- float(start))) #Let's evaluate the model on test data pre = clf.predict(Xtest_1) count = 0 acc2 = getAccuracy(pre, ytest) print("Accuracy after feature selection %.2f"%(100*acc2)) Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97 Can you see that!! We have got 99.97 percent accuracy with the modified dataset, which means we are classifying 14,996 instances in correct classes, while previously we were classifying only 14,823 instances correctly. This is a huge improvement we have got with the feature selection process; we can summarize all the results in the following table: Evaluation criteria Before feature selection After feature selection Number of features 94 20 Size of dataset 26.32 MB 5.60 MB Training time 2.91 seconds 1.71 seconds Accuracy 98.82 percent 99.97 percent The preceding table shows the practical advantages of feature selection. You can see that we have reduced the number of features significantly, which reduces the model complexity and dimensions of the dataset. We are getting less training time after the reduction in dimensions, and at the end, we have overcome the overfitting issue, getting higher accuracy than before. To summarize the article, we explored 4 ways of feature selection in machine learning. If you found this post is useful, do check out the book Ensemble Machine Learning to know more about stacking generalization among other techniques.

0
4
99736

article-image-install-elasticsearch-ubuntu-windows

Fatema Patrawala

16 Feb 2018

3 min read

How to install Elasticsearch in Ubuntu and Windows

Fatema Patrawala

16 Feb 2018

3 min read

[box type="note" align="" class="" width=""]This article is an extract from the book, Mastering Elastic Stack co-authored by Ravi Kumar Gupta and Yuvraj Gupta.This book will brush you up with basic knowledge on implementing the Elastic Stack and then dives deep into complex and advanced implementations. [/box] In today’s tutorial we aim to learn Elasticsearch v5.1.1 installation for Ubuntu and Windows. Installation of Elasticsearch on Ubuntu 14.04 In order to install Elasticsearch on Ubuntu, refer to the following steps: Download Elasticsearch 5.1.1 as a debian package using terminal: wget https://artifacts.elastic.co /downloads/elasticsearch/elasticsearch-5.1.1.deb 2. Install the debian package using following command: sudo dpkg -i elasticsearch-5.1.1.deb Elasticsearch will be installed in /usr/share/elasticsearch directory. The configuration files will be present at /etc/elasticsearch. The init script will be present at /etc/init.d/elasticsearch. The log files will be present within /var/log/elasticsearch directory. 3. Configure Elasticsearch to run automatically on bootup . If you are using SysV init distribution, then run the following command: sudo update-rc.d elasticsearch defaults 95 10 The preceding command will print on screen: Adding system startup for, /etc/init.d/elasticsearch Check status of Elasticsearch using following command: sudo service elasticsearch status Run Elasticsearch as a service using following command: sudo service elasticsearch start Elasticsearch may not start if you have any plugin installed which is not supported in ES-5.0.x version onwards. As plugins have been deprecated, it is required to uninstall any plugin if exists in prior version of ES. Remove a plugin after going to ES Home using following command: bin/elasticsearch-plugin remove head Usage of Elasticsearch command: sudo service elasticsearch {start|stop|restart|force- reload|status} If you are using systemd distribution, then run following command: sudo /bin/systemctl daemon-reload sudo /bin/systemctl enable elasticsearch.service To verify elasticsearch installation open open http://localhost:9200 in browser or run the following command from command line: curl -X GET http://localhost:9200 Installation of Elasticsearch on Windows In order to install Elasticsearch on Windows, refer to the following steps: Download Elasticsearch 5.1.1 version from its site using the following link: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch -5.1.1.zip Upon opening the link, click on it and it will download the ZIP package. 2. Extract the downloaded ZIP package by unzipping it using WinRAR, 7-Zip, and other such extracting softwares (if you don't have one of these then download it). This will extract the files and folders in the directory. 3. Then click on the extracted folder and navigate the folder to reach inside the bin folder. 4. Click on the elasticsearch.bat file to run Elasticsearch. If this window is closed Elasticsearch will stop running, as the node will shut down. 5. To verify Elasticsearch installation, open http://localhost:9200 in the browser: Installation of Elasticsearch as a service After installing Elasticsearch as previously mentioned, open Command Prompt after navigating to the bin folder and use the following command: elasticsearch-service.bat install Usage: elasticsearch-service.bat install | remove | start | stop | manager To summarize, we learnt installation of Elasticsearch on Ubuntu and Windows. If you are keen to know more about how to work with the Elastic Stack in a production environment, you can grab our comprehensive guide Mastering Elastic Stack.

0
0
56496

article-image-build-enable-jenkins-mesos-plugin

Vijin Boricha

16 Feb 2018

4 min read

How to build and enable the Jenkins Mesos plugin

Vijin Boricha

16 Feb 2018

4 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book by David Blomquist and Tomasz Janiszewski titled Apache Mesos Cookbook. From this book, you will get to know tips and tricks along with best practices to follow when working with Mesos.[/box] In today’s tutorial, we will learn about building and enabling the Jenkins Mesos plugin. Building the Jenkins Mesos plugin By default, Jenkins uses statically created agents and runs jobs on them. We can extend this behavior with a plugin that will make Jenkins use Mesos as a resource manager. Jenkins will register as a Mesos framework and accept offers when it needs to run a job. How to do it The Jenkins Mesos plugin installation is a little bit harder than Marathon. There are no official binary packages for it, so it must be installed from sources: First of all, we need to download the source code: curl -L https://github.com/jenkinsci/mesos-plugin/archive/mesos-0.14.0.tar. gz | tar -zx cd jenkinsci-mesos-plugin-* The plugin is written in Java and to build it we need Maven (mvn): sudo apt install maven Finally, build the package: mvn package If everything goes smoothly, you should see information, that all tests passed and the plugin package will be placed in target/mesos.hpi. Jenkins is written in Java and presents an API for creating plugins. Plugins do not have to be written in Java, but must be compatible with those interfaces so most plugins are written in Java. The natural choice for building a Java application is Maven, although Gradle is getting more and more popular. The Jenkins Mesos plugin uses the Mesos native library to communicate with Mesos. This communication is now deprecated, so the plugin does not support all Mesos features that are available with the Mesos HTTP API. Enabling the Jenkins Mesos plugin Here you will learn how to enable the Mesos Jenkins plugin and configure a job to be run on Mesos. How to do it... The first step is to install the Mesos Jenkins plugin. To do so, navigate to the Plugin Manager by clicking Manage Jenkins | Manage Plugins, and select the Advanced tab. You should see the following screen: Click Choose file and select the previously built plugin to upload it. Once the plugin is installed, you have to configure it. To do so, go to the configuration (Manage Jenkins | Configure System). At the bottom of the page, the cloud section should appear. Fill in all the fields with the desired configuration values: This was the last step of the plugin installation. If you now disable Advanced On- demand framework registration, you should see the Jenkins Scheduler registered in the Mesos frameworks. Remember to configure Slave username to the existing system user on Mesos agents. It will be used to run your jobs. By default, it will be jenkins. You can create it on slaves with the following command: adduser jenkins Be careful when providing an IP or hostnames for Mesos and Jenkins. It must match the IP used later by the scheduler for communication. By default, the Mesos native library binds to the interface that the hostname resolves to. This could lead to problems in communication, especially when receiving messages from Mesos. If you see your Jenkins is connected, but jobs are stuck and agents do not start, check if Jenkins is registered with the proper IP. You can set the IP used by Jenkins by adding the following line in /etc/default/jenkins (in this example, we assume Jenkins should bind on 10.10.10.10): LIBPROCESS_IP=10.10.10.10 We learnt about building and enabling Jenkins Mesos plugin. You can know more about how to configure and maintain Apache Mesos from Apache Mesos Cookbook.

0
0
11204

article-image-manipulating-text-data-using-python-regular-expressions-regex

Sugandha Lahoti

16 Feb 2018

8 min read

Manipulating text data using Python Regular Expressions (regex)

Sugandha Lahoti

16 Feb 2018

8 min read

0
0
62366

article-image-auto-generate-texts-shakespeare-writing-using-deep-recurrent-neural-networks

Savia Lobo

16 Feb 2018

6 min read

How to auto-generate texts from Shakespeare writing using deep recurrent neural networks

Savia Lobo

16 Feb 2018

6 min read

[box type="note" align="" class="" width=""]Our article is an excerpt from a book co-authored by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled as Natural Language Processing with Python Cookbook. This book will give unique recipes to know various aspects of performing Natural Language Processing with NLTK—a leading Python platform for NLP.[/box] Today we will learn to use deep recurrent neural networks (RNN) to predict the next character based on the given length of a sentence. This way of training a model is able to generate automated text continuously, which can imitate the writing style of the original writer with enough training on the number of epochs and so on. Getting ready... The Project Gutenberg eBook of the complete works of William Shakespeare's dataset is used to train the network for automated text generation. Data can be downloaded from http:// www.gutenberg.org/ for the raw file used for training: >>> from future import print_function >>> import numpy as np >>> import random >>> import sys The following code is used to create a dictionary of characters to indices and vice-versa mapping, which we will be using to convert text into indices at later stages. This is because deep learning models cannot understand English and everything needs to be mapped into indices to train these models: >>> path = 'C:UsersprataDocumentsbook_codes NLP_DL shakespeare_final.txt' >>> text = open(path).read().lower() >>> characters = sorted(list(set(text))) >>> print('corpus length:', len(text)) >>> print('total chars:', len(characters)) >>> char2indices = dict((c, i) for i, c in enumerate(characters)) >>> indices2char = dict((i, c) for i, c in enumerate(characters)) How to do it… Before training the model, various preprocessing steps are involved to make it work. The following are the major steps involved: Preprocessing: Prepare X and Y data from the given entire story text file and converting them into indices vectorized format. Deep learning model training and validation: Train and validate the deep learning model. Text generation: Generate the text with the trained model. How it works... The following lines of code describe the entire modeling process of generating text from Shakespeare's writings. Here we have chosen character length. This needs to be considered as 40 to determine the next best single character, which seems to be very fair to consider. Also, this extraction process jumps by three steps to avoid any overlapping between two consecutive extractions, to create a dataset more fairly: # cut the text in semi-redundant sequences of maxlen characters >>> maxlen = 40 >>> step = 3 >>> sentences = [] >>> next_chars = [] >>> for i in range(0, len(text) - maxlen, step): ... sentences.append(text[i: i + maxlen]) ... next_chars.append(text[i + maxlen]) ... print('nb sequences:', len(sentences)) The following screenshot depicts the total number of sentences considered, 193798, which is enough data for text generation: The next code block is used to convert the data into a vectorized format for feeding into deep learning models, as the models cannot understand anything about text, words, sentences and so on. Initially, total dimensions are created with all zeros in the NumPy array and filled with relevant places with dictionary mappings: # Converting indices into vectorized format >>> X = np.zeros((len(sentences), maxlen, len(characters)), dtype=np.bool) >>> y = np.zeros((len(sentences), len(characters)), dtype=np.bool) >>> for i, sentence in enumerate(sentences): ... for t, char in enumerate(sentence): ... X[i, t, char2indices[char]] = 1 ... y[i, char2indices[next_chars[i]]] = 1 >>> from keras.models import Sequential >>> from keras.layers import Dense, LSTM,Activation,Dropout >>> from keras.optimizers import RMSprop The deep learning model is created with RNN, more specifically Long Short-Term Memory networks with 128 hidden neurons, and the output is in the dimensions of the characters. The number of columns in the array is the number of characters. Finally, the softmax function is used with the RMSprop optimizer. We encourage readers to try with other various parameters to check out how results vary: #Model Building >>> model = Sequential() >>> model.add(LSTM(128, input_shape=(maxlen, len(characters)))) >>> model.add(Dense(len(characters))) >>> model.add(Activation('softmax')) >>> model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01)) >>> print (model.summary()) As mentioned earlier, deep learning models train on number indices to map input to output (given a length of 40 characters, the model will predict the next best character). The following code is used to convert the predicted indices back to the relevant character by determining the maximum index of the character: # Function to convert prediction into index >>> def pred_indices(preds, metric=1.0): ... preds = np.asarray(preds).astype('float64') ... preds = np.log(preds) / metric ... exp_preds = np.exp(preds) ... preds = exp_preds/np.sum(exp_preds) ... probs = np.random.multinomial(1, preds, 1) ... return np.argmax(probs) The model will be trained over 30 iterations with a batch size of 128. And also, the diversity has been changed to see the impact on the predictions: # Train and Evaluate the Model >>> for iteration in range(1, 30): ... print('-' * 40) ... print('Iteration', iteration) ... model.fit(X, y,batch_size=128,epochs=1).. ... start_index = random.randint(0, len(text) - maxlen - 1) ... for diversity in [0.2, 0.7,1.2]: ... print('n----- diversity:', diversity) ... generated = '' ... sentence = text[start_index: start_index + maxlen] ... generated += sentence ... print('----- Generating with seed: "' + sentence + '"') ... sys.stdout.write(generated) ... for i in range(400): ... x = np.zeros((1, maxlen, len(characters))) ... for t, char in enumerate(sentence): ... x[0, t, char2indices[char]] = 1. ... preds = model.predict(x, verbose=0)[0] ... next_index = pred_indices(preds, diversity) ... pred_char = indices2char[next_index] ... generated += pred_char ... sentence = sentence[1:] + pred_char ... sys.stdout.write(pred_char) ... sys.stdout.flush() ... print("nOne combination completed n") The results are shown in the next screenshot to compare the first iteration (Iteration 1) and final iteration (Iteration 29). It is apparent that with enough training, the text generation seems to be much better than with Iteration 1: Text generation after Iteration 29 is shown in this image: Though the text generation seems to be magical, we have generated text using Shakespeare's writings, proving that with the right training and handling, we can imitate any style of writing of a particular writer. If you found this post useful, you may check out this book Natural Language Processing with Python Cookbook to analyze sentence structure and master lexical analysis, syntactic and semantic analysis, pragmatic analysis, and other NLP techniques.

0
0
33693

article-image-6-popular-regression-techniques-must-know

Amey Varangaonkar

15 Feb 2018

8 min read

6 Popular Regression Techniques you must know

Amey Varangaonkar

15 Feb 2018

8 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Statistics for Data Science, authored by IBM expert James D. Miller. This book gives a statistical view of building smart data models to help you get unique insights from the data.[/box] In this article, we introduce you to the concept of regression analysis, one of the most popular machine learning algorithms. -You will learn what is regression analysis, the different types of regression, and how to choose the right regression technique to build your data model. What is Regression Analysis? For starters, regression analysis or statistical regression is a process for estimating the relationships among variables. This process encompasses numerous techniques for modeling and analyzing variables, focusing on the relationship between a dependent variable and one (or more) independent variable (or predictors). Regression analysis is the work done to identify and understand how the (best representative) value of a dependent variable (a variable that depends on other factors) changes when any one of the independent variables (a variable that stands alone and isn't changed by the other variables) is changed while the other independent variables stay the same. A simple example might be how the total dollars spent on marketing (an independent variable example) impacts the total sales dollars (a dependent variable example) over a period of time (is it really as simple as more marketing equates to higher sales?), or perhaps there is a correlation between the total marketing dollars spent (independent variable), discounting a products price (another independent variable), and the amount of sales (a dependent variable)? [box type="info" align="" class="" width=""]Keep in mind this key point that regression analysis is used to understand which among the independent variables are related to the dependent variable(s), not just the relationship of these variables. Also, the inference of causal relationships (between the independent and dependent variables) is an important objective. However, this can lead to illusions or false relationships, so caution is recommended![/box] Overall, regression analysis can be thought of as estimating the conditional expectations of the value of the dependent variable, given the independent variables being observed, that is, endeavoring to predict the average value of the dependent variable when the independent variables are set to certain values. I call this the lever affect—meaning when one increases or decreases a value of one component, it directly affects the value at least one other (variable). An alternate objective of the process of regression analysis is the establishment of location parameters or the quantile of a distribution. In other words, this idea is to determine values that may be a cutoff, dividing a range of a probability distribution values. You'll find that regression analysis can be a great tool for prediction and forecasting (not just complex machine learning applications). We'll explore some real-world examples later, but for now, let's us look at some techniques for the process. Popular regression techniques and approaches You'll find that various techniques for carrying out regression analysis have been developed and accepted.These are: Linear Logistic Polynomial Stepwise Ridge Lasso Linear regression Linear regression is the most basic type of regression and is commonly used for predictive analysis projects. In fact, when you are working with a single predictor (variable), we call it simple linear regression, and if there are multiple predictor variables, we call it multiple linear regression. Simply put, linear regression uses linear predictor functions whose values are estimated from the data in the model. Logistic regression Logistic regression is a regression model where the dependent variable is a categorical variable. This means that the variable only has two possible values, for example, pass/fail, win/lose, alive/dead, or healthy/sick. If the dependent variable has more than two possible values, one can use various modified logistic regression techniques, such as multinomial logistic regression, ordinal logistic regression, and so on. Polynomial regression When we speak of polynomial regression, the focus of this technique is on modeling the relationship between the independent variable and the dependent variable as an nth degree polynomial. Polynomial regression is considered to be a special case of multiple linear regressions. The predictors resulting from the polynomial expansion of the baseline predictors are known as interactive features. Stepwise regression Stepwise regression is a technique that uses some kind of automated procedure to continually execute a step of logic, that is, during each step, a variable is considered for addition to or subtraction from the set of independent variables based on some prespecified criterion. Ridge regression Often predictor variables are identified as being interrelated. When this occurs, the regression coefficient of any one variable depends on which other predictor variables are included in the model and which ones are left out. Ridge regression is a technique where a small bias factor is added to the selected variables in order to improve this situation. Therefore, ridge regression is actually considered a remedial measure to alleviate multicollinearity amongst predictor variables. Lasso regression Lasso (Least Absolute Shrinkage Selector Operator) regression is a technique where both predictor variable selection and regularization are performed in order to improve the prediction accuracy and interpretability of the result it produces. Which technique should I choose? In addition to the aforementioned regression techniques, there are numerous others to consider with, most likely, more to come. With so many options, it's important to choose the technique that is right for your data and your project. Rather than selecting the right regression approach, it is more about selecting the most effective regression approach. Typically, you use the data to identify the regression approach you'll use. You start by establishing statistics or a profile for your data. With this effort, you need to identify and understand the importance of the different variables, their relationships, coefficient signs, and their effect. Overall, here's some generally good advice for choosing the right regression approach from your project: Copy what others have done and had success with. Do the research. Incorporate the results of other projects into yours. Don't reinvent the wheel. Also, even if an observed approach doesn't quite fit as it was used, perhaps some simple adjustments would make it a good choice. Keep your approach as simple as possible. Many studies show that simpler models generally produce better predictions. Start simple, and only make the model more complex as needed. The more complex you make your model, the more likely it is that you are tailoring the model to your dataset specifically, and generalizability suffers. Check your work. As you evaluate methods, check the residual plots (more on this in the next section of this chapter) because they can help you avoid inadequate models and adjust your model for better results. Use your subject matter expertise. No statistical method can understand the underlying process or subject area the way you do. Your knowledge is a crucial part and, most likely, the most reliable way of determining the best regression approach for your project. Does it fit? After selecting a model that you feel is appropriate for use with your data (also known as determining that the approach is the best fit), you need to validate your selection, that is, determine its fit. A well-fitting regression model results in predicted values close to the observed data values. The mean model (which uses the mean for every predicted value) would generally be used if there were no informative predictor variables. The fit of a proposed regression model should, therefore, be better than the fit of the mean model. As a data scientist, you will need to scrutinize the coefficients of determination, measure the standard error of estimate, analyze the significance of regression parameters and confidence intervals. [box type="info" align="" class="" width=""]Remember that the better the fit of a regression model, most likely the better the precision in, or just better, the results.[/box] Finally, it has been proven that simple models produce more accurate results! Keep this in mind always when selecting an approach or a technique, and even when the problem might be complex, it is not always obligatory to adopt a complex regression approach. Choosing the right technique, though, goes a long way in developing an accurate model. If you found this excerpt useful, make sure to check out this book Statistics for Data Science for tips on building effective data models by leveraging the power of the statistical tools and techniques.

0
0
32430

article-image-how-to-use-r-to-boost-your-data-model

Amey Varangaonkar

15 Feb 2018

8 min read

How to use R to boost your Data Model

Amey Varangaonkar

15 Feb 2018

8 min read

[box type="note" align="" class="" width=""]The following book excerpt is taken from the title Statistics for Data Science, written by James D. Miller. This book is a comprehensive primer on the basic concepts of statistics and their application in different data science tasks.[/box] In this article, we explain the implementation of boosting - a popular technique used to improve the performance of a data model - using the popular R programming language. We will take a high-level look at a thought-provoking prediction problem drawn from Mastering Predictive Analytics with R, Second Edition, by James D. Miller and Rui Miguel Forte. Here, an original example of patterns made by radiation on a telescope camera are analyzed in an attempt to predict whether a certain pattern came from gamma rays leaking into the atmosphere or from regular background radiation. Gamma rays leave distinctive elliptical patterns and so we can create a set of features to describe these. The dataset used is the MAGIC Gamma Telescope Data Set, hosted by the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope This data consists of 19,020 observations, holding the following list of attributes: Prepping the data First, various steps need to be performed on our example data. The data is first loaded into an R data frame object named magic, recoding the CLASS output variable to use classes 1 and -1 for gamma rays and background radiation respectively: > magic <- read.csv("magic04.data", header = FALSE) > names(magic) <- c("FLENGTH", "FWIDTH", "FSIZE", "FCONC", "FCONC1", "FASYM", "FM3LONG", "FM3TRANS", "FALPHA", "FDIST", "CLASS") > magic$CLASS <- as.factor(ifelse(magic$CLASS =='g', 1, -1)) Next, the data is split into two files: a training data and a test data frame using an 80-20 split: > library(caret) > set.seed(33711209) > magic_sampling_vector <- createDataPartition(magic$CLASS, p = 0.80, list = FALSE) > magic_train <- magic[magic_sampling_vector, 1:10] > magic_train_output <- magic[magic_sampling_vector, 11] > magic_test <- magic[-magic_sampling_vector, 1:10] > magic_test_output <- magic[-magic_sampling_vector, 11] The model used for boosting is a simple multilayer perceptron with a single hidden layer leveraging R's nnet package. Neural networks, often produce higher accuracy when inputs are normalized, so, in this example, before training any models, this preprocessing is performed: > magic_pp <- preProcess(magic_train, method = c("center", "scale")) > magic_train_pp <- predict(magic_pp, magic_train) > magic_train_df_pp <- cbind(magic_train_pp, CLASS = magic_train_output) > magic_test_pp <- predict(magic_pp, magic_test) Training Boosting is designed to work best with weak learners, so a very small number of hidden neurons in the model's hidden layer are used. Concretely, we will begin with the simplest possible multilayer perceptron that uses a single hidden neuron. To understand the effect of using boosting, a baseline performance is established by training a single neural network (and measuring its performance). This is to accomplish the following: > library(nnet) > n_model <- nnet(CLASS ~ ., data = magic_train_df_pp, size = 1) > n_test_predictions <- predict(n_model, magic_test_pp, type = "class") > (n_test_accuracy <- mean(n_test_predictions == magic_test_output)) [1] 0.7948988 This establishes that we have a baseline accuracy of around 79.5 percent. Not too bad, but can boost to improve upon this score? To that end, the function AdaBoostNN(), which is shown as follows, is used. This function will take input from a data frame, the name of the output variable, the number of single hidden layer neural network models to be built, and finally, the number of hidden units these neural networks will have. The function will then implement the AdaBoost algorithm and return a list of models with their corresponding weights. Here is the function: AdaBoostNN <- function(training_data, output_column, M, hidden_units) { require("nnet") models <- list() alphas <- list() n <- nrow(training_data) model_formula <- as.formula(paste(output_column, '~ .', sep = '')) w <- rep((1/n), n) for (m in 1:M) { model <- nnet(model_formula, data = training_data, size = hidden_units, weights = w) models[[m]] <- model predictions <- as.numeric(predict(model, training_data[, -which(names(training_data) == output_column)], type = "class")) errors <- predictions != training_data[, output_column] error_rate <- sum(w * as.numeric(errors)) / sum(w) alpha <- 0.5 * log((1 - error_rate) / error_rate) alphas[[m]] <- alpha temp_w <- mapply(function(x, y) if (y) { x * exp(alpha) } else { x * exp(-alpha)}, w, errors) w <- temp_w / sum(temp_w) } return(list(models = models, alphas = unlist(alphas))) } The preceding function uses the following logic: First, initialize empty lists of models and model weights (alphas). Compute the number of observations in the training data, storing this in the variable n. The name of the output column provided is then used to create a formula that describes the neural network that will be built. In the dataset used, this formula will be CLASS ~ ., meaning that the neural network will compute CLASS as a function of all the other columns as input features. Next, initialize the weights vector and define a loop that will run for M iterations in order to build M models. In every iteration, the first step is to use the current setting of the weights vector to train a neural network using as many hidden units as specified in the input, hidden_units. Then, compute a vector of predictions that the model generates on the training data using the predict() function. By comparing these predictions to the output column of the training data, calculate the errors that the current model makes on the training data. This then allows the computation of the error rate. This error rate is set as the weight of the current model and, finally, the observation weights to be used in the next iteration of the loop are updated according to whether each observation was correctly classified. The weight vector is then normalized and we are ready to begin the next iteration! After completing M iterations, output a list of models and their corresponding model weights. Ready for boosting There is now a function able to train our ensemble classifier using AdaBoost, but we also need a function to make the actual predictions. This function will take in the output list produced by our training function, AdaBoostNN(), along with a test dataset. This function is AdaBoostNN.predict() and it is shown as follows: AdaBoostNN.predict <- function(ada_model, test_data) { models <- ada_model$models alphas <- ada_model$alphas prediction_matrix <- sapply(models, function (x) as.numeric(predict(x, test_data, type = "class"))) weighted_predictions <- t(apply(prediction_matrix, 1, function(x) mapply(function(y, z) y * z, x, alphas))) final_predictions <- apply(weighted_predictions, 1, function(x) sign(sum(x))) return(final_predictions) } This function first extracts the models and the model weights (from the list produced by the previous function). A matrix of predictions is created, where each column corresponds to the vector of predictions made by a particular model. Thus, there will be as many columns in this matrix as the models that we used for boosting. We then multiply the predictions produced by each model with their corresponding model weight. For example, every prediction from the first model is in the first column of the prediction matrix and will have its value multiplied by the first model weight α1. .Lastly, the matrix of weighted observations is reduced into a single vector of observations by summing the weighted predictions for each observation and taking the sign of the result. This vector of predictions is then returned by the function. As an experiment, we will train ten neural network models with a single hidden unit and see if boosting improves accuracy: > ada_model <- AdaBoostNN(magic_train_df_pp, 'CLASS', 10, 1) > predictions <- AdaBoostNN.predict(ada_model, magic_test_pp, 'CLASS') > mean(predictions == magic_test_output) [1] 0.804365 We see in this example, boosting ten models shows a marginal improvement in accuracy, but perhaps training more models might make more of a difference. What we learned From the preceding example, you may conclude that, for the neural networks with one hidden unit, as the number of boosting models increases, we see an improvement in accuracy, but after 100 models, this tapers off and is actually slightly less for 200 models. The improvement over the baseline of a single model is substantial for these networks. When we increase the complexity of our learner by having a hidden layer with three hidden neurons, we get a much smaller improvement in performance. At 200 models, both ensembles perform at a similar level, indicating that, at this point, our accuracy is being limited by the type of model trained. If you found this article useful, make sure to check out the book Statistics for Data Science for interesting statistical techniques and their implementation in R.

0
0
15020

article-image-choose-r-data-mining-project

Fatema Patrawala

15 Feb 2018

9 min read

Why choose R for your data mining project

Fatema Patrawala

15 Feb 2018

9 min read

[box type="note" align="" class="" width=""]Our article is an excerpt taken from the book R Data Mining, written by Andrea Cirillo. If you are a budding data scientist or a data analyst with basic knowledge of R, and you want to get into the intricacies of data mining in a practical manner, be sure to check out this book.[/box] In today’s post, we will analyze R's strengths, and understand why it is a savvy idea to learn this programming language for data mining. R's strengths You know that R is really popular, but why? R is not the only data analysis language out there, and neither is it the oldest one; so why is it so popular? If looking at the root causes of R's popularity, we definitely have to mention these three: Open source inside Plugin ready Data visualization friendly Open source inside One of the main reasons the adoption of R is spreading is its open source nature. R binary code is available for everyone to download, modify, and share back again (only in an open source way). Technically, R is released with a GNU general public license, meaning that you can take it and use it for whatever purpose; but you have to share every derivative with a GNU general public license as well. These attributes fit well for almost every target user of a statistical analysis language: Academic user: Knowledge sharing is a must for an academic environment, and having the ability to share work without the worry of copyright and license questions makes R very practical for academic research purposes Business user: Companies are always worried about budget constraints; having professional statistical analysis software at their disposal for free sounds like a dream come true Private user: This user merges together both of the benefits already mentioned, because they will find it great to have a free instrument with which to learn and share their own statistical analyses Plugin ready You could imagine the R language as an expandable board game. You know, games like 7 Wonders or Carcassonne, with a base set of characters and places and further optional places and characters, increasing the choices at your disposal and maximizing the fun. The R language can be compared to this kind of game. There is a base version of R, containing a group of default packages that are delivered along with the standard version of the software (you can skip to the Installing R and writing R code section for more on how to obtain and install it). The functionalities available through the base version are mainly related to file system manipulation, statistical analysis, and data Visualization. While this base version is regularly maintained and updated by the R core team, virtually every R user can add further new functionalities to those available within the package, developing and sharing custom packages. This is basically how the package development and sharing flow works: The R user develops a new package, for example a package introducing a new machine learning algorithm exposed within a freshly published academic paper. The user submits the package to the CRAN repository or a similar repository. The Comprehensive R Archive Network (CRAN) is the official repository for R related documents and packages. Every R user can gain access to the additional features introduced with any given package, installing and loading them into their R environment. If the package has been submitted to CRAN, installing and loading the package will result in running just the two following lines of R code (similar commands are available for alternative repositories such as Bioconductor): install.packages("ggplot2") library(ggplot2) As you can see, this is a really convenient and effective way to expand R functionalities, and you will soon see how wide the range of functionalities added through additional packages developed by R users is. More than 9,000 packages are available on CRAN, and this number is sure to increase further, making more and more additional features available to the R community. Data visualization friendly As a discipline data visualization encompasses all of the principles and techniques employable to effectively display the information and messages contained within a set of data. Since we are living in an information-heavy age, the ability to effectively and concisely communicate articulated and complex messages through data visualization is a core asset for any professional. This is exactly why R is experiencing a great response in academic and professional fields: the data visualization capabilities of R place it at the cutting edge of these fields. R has been noticed for its amazing data visualization features right from its beginning; when some of its peers still showed x axes-built aggregating + signs, R was already able to produce astonishing 3D plots. Nevertheless, a major improvement of R as a data visualization technique came when Auckland's Hadley Wickham developed the highly famous ggplot2 package based on The Grammar of Graphics, introducing into the R world an organic framework for data visualization tasks: This package alone introduced the R community to a highly flexible way of producing and visualizing almost every kind of data visualization, having also been designed as an expandable tool, in order to add the possibility of incorporating new data visualization techniques as soon as they emerge. Finally, ggplot2 gives you the ability to highly customize your plot, adding every kind of graphical or textual annotation to it. Nowadays, R is being used by the biggest tech companies, such as Facebook and Google, and by widely circulated publications such as the Economist and the New York Times to visualize their data and convey their information to their stakeholders and readers. To sum all this up—should you invest your precious time learning R? If you are a professional or a student who could gain advantages from knowing effective and cutting-edge techniques to manipulate, model, and present data, I can only give you a positive opinion: yes. You should definitely learn R, and consider it a long-term investment, since the points of strength we have seen place it in a great position to further expand its influence in the coming years in every industry and academic field. Engaging with the community to learn R Now that we are aware of R’s popularity we need to engage with the community to take advantage of it. We will look at alternative and non-exclusive ways of engaging with the community: Employing community-driven learning material Asking for help from the community Staying ahead of language developments Employing community-driven learning material: There are two main kinds of R learning materials developed by the community: Papers, manuals, and books Online interactive courses Papers, manuals, and books: The first one is for sure the more traditional one, but you shouldn't neglect it, since those kinds of learning materials are always able to give you a more organic and systematic understanding of the topics they treat. You can find a lot of free material online in the form of papers, manuals, and books. Let me point out to you the more useful ones: Advanced R R for Data Science Introduction to Statistical Learning OpenIntro Statistics The R Journal Online interactive courses: This is probably the most common learning material nowadays. You can find different platforms delivering good content on the R language, the most famous of which are probably DataCamp, Udemy, and Packt itself. What all of them share is a practical and interactive approach that lets you learn the topic directly, applying it through exercises rather than passively looking at someone explaining theoretical stuff. Asking for help from the community: As soon as you start writing your first lines of R code, and perhaps before you even actually start writing it, you will come up with some questions related to your work. The best thing you can do when this happens is to resort to the community to solve those questions. You will probably not be the first one to come up with that question, and you should therefore first of all look online for previous answers to your question. Where should you look for answers? You can look everywhere, but most of the time you will find the answer you are looking for on one of the following (listed by the probability of finding the answer there): Stack Overflow R-help mailing list R packages documentation I wouldn't suggest you look for answers on Twitter, G+, and similar networks, since they were not conceived to handle these kinds of processes and you will expose yourself to the peril of reading answers that are out of date, or simply incorrect, because no review system is considered. If it is the case that you are asking an innovative question never previously asked by anyone, first of all, congratulations! That said, in that happy circumstance, you can ask your question in the same places that you previously looked for answers. Staying ahead of language developments: The R language landscape is constantly changing, thanks to the contributions of many enthusiastic users who take it a step further every day. How can you stay ahead of those changes? This is where social networks come in handy. Following the #rstats hashtag on Twitter, Google+ groups, and similar places, will give you the pulse of the language. Moreover, you will find the R-bloggers aggregator, which delivers a daily newsletter comprised of the R-related blog posts that were published the previous day really useful. Finally, annual R conferences and similar occasions constitute a great opportunity to get in touch with the most notorious R experts, gaining from them useful insights and inspiring speeches about the future of the language. To summarize, we looked why to choose R as your programming language for data mining and how we can engage with the R community. If you think this post is useful, you may further check out this book R Data Mining, to leverage data mining techniques across many different industries, including finance, medicine, scientific research, and more.

0
0
6057

article-image-make-efficient-data-driven-decisions

Aaron Lazar

15 Feb 2018

7 min read

How to make efficient data-driven decisions

Aaron Lazar

15 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an extract from the book Predictive Analytics with TensorFlow, authored by Md. Rezaul Karim. The book will help you build, tune, and deploy predictive data models with TensorFlow.[/box] Today we’ll learn to take decisions driven by data with the help of few examples. The growing demand for data is a key challenge. Decision support teams such as institutional research and business intelligence often cannot take the right decisions on how to expand their business and research outcomes from a huge collection of data. Although data plays an important role in driving the decision, however, in reality, taking the right decision at right time is the goal. In other words, the goal is the decision support, not the data support. This can be achieved through an advanced use of data management and analytics. Data value chain for making decisions The following diagram in figure 1 (source: H. Gilbert Miller and Peter Mork, From Data to Decisions: A Value Chain for Big Data, Proc. Of IT Professional, Volume: 15, Issue: 1, Jan.-Feb. 2013, DOI: 10.1109/MITP.2013.11) shows the data chain towards taking actual decisions–that is, the goal. The value chains start through the data discovery stage consisting of several steps such as data collection and annotating data preparation, and then organizing them in a logical order having the desired flow. Then comes the data integration for establishing a common data representation of the data. Since the target is to take the right decision, for future reference having the appropriate provenance of the data–that is, where it comes from, is important: Well, now your data is somehow integrated into a presentable format, it's time for the data exploration stage, which consists of several steps such as analyzing the integrated data and visualization before taking the actions to take on the basis of the interpreted results. However, is this enough before taking the right decision? Probably not! The reason is that it lacks enough analytics, which eventually helps to take the decision with an actionable insight. Predictive analytics comes in here to fill the gap between. Now let's see an example of how in the following section. From disaster to decision – Titanic survival example Here is the challenge, Titanic–Machine Learning from Disaster from Kaggle (https://www.kaggle.com/c/titanic): "The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy" But going into this deeper, we need to know about the data of passengers travelling in the Titanic during the disaster so that we can develop a predictive model that can be used for survival analysis. The dataset can be downloaded from the preceding URL. Table 1 here shows the metadata about the Titanic survival dataset: A snapshot of the dataset can be seen as follows: The ultimate target of using this dataset is to predict what kind of people survived the Titanic disaster. However, a bit of exploratory analysis of the dataset is a mandate. At first, we need to import necessary packages and libraries: import pandas as pd import matplotlib.pyplot as plt import numpy as np Now read the dataset and create a panda's DataFrame: df = pd.read_csv('/home/asif/titanic_data.csv') Before drawing the distribution of the dataset, let's specify the parameters for the graph: fig = plt.figure(figsize=(18,6), dpi=1600) alpha=alpha_scatterplot = 0.2 alpha_bar_chart = 0.55 fig = plt.figure() ax = fig.add_subplot(111) Draw a bar diagram for showing who survived versus who did not: ax1 = plt.subplot2grid((2,3),(0,0)) ax1.set_xlim(-1, 2) df.Survived.value_counts().plot(kind='bar', alpha=alpha_bar_chart) plt.title("Survival distribution: 1 = survived") Plot a graph showing survival by Age: plt.subplot2grid((2,3),(0,1)) plt.scatter(df.Survived, df.Age, alpha=alpha_scatterplot) plt.ylabel("Age") plt.grid(b=True, which='major', axis='y') plt.title("Survival by Age: 1 = survived") Plot a graph showing distribution of the passengers classes: ax3 = plt.subplot2grid((2,3),(0,2)) df.Pclass.value_counts().plot(kind="barh", alpha=alpha_bar_chart) ax3.set_ylim(-1, len(df.Pclass.value_counts())) plt.title("Class dist. of the passengers") Plot a kernel density estimate of the subset of the 1st class passengers' age: plt.subplot2grid((2,3),(1,0), colspan=2) df.Age[df.Pclass == 1].plot(kind='kde') df.Age[df.Pclass == 2].plot(kind='kde') df.Age[df.Pclass == 3].plot(kind='kde') plt.xlabel("Age") plt.title("Age dist. within class") plt.legend(('1st Class', '2nd Class','3rd Class'),loc='best') Plot a graph showing passengers per boarding location: ax5 = plt.subplot2grid((2,3),(1,2)) df.Embarked.value_counts().plot(kind='bar', alpha=alpha_bar_chart) ax5.set_xlim(-1, len(df.Embarked.value_counts())) plt.title("Passengers per boarding location") Finally, we show all the subplots together: plt.show() >>> The figure shows the survival distribution, survival by age, age distribution, and the passengers per boarding location: However, to execute the preceding code, you need to install several packages such as matplotlib, pandas, and scipy. They are listed as follows: Installing pandas: Pandas is a Python package for data manipulation. It can be installed as follows: $ sudo pip3 install pandas #For Python 2.7, use the following: $ sudo pip install pandas Installing matplotlib: In the preceding code, matplotlib is a plotting library for mathematical objects. It can be installed as follows: $ sudo apt-get install python-matplotlib # for Python 2.7 $ sudo apt-get install python3-matplotlib # for Python 3.x Installing scipy: Scipy is a Python package for scientific computing. Installing blas and lapack and gfortran are a prerequisite for this one. Now just execute the following command on your terminal: $ sudo apt-get install libblas-dev liblapack-dev $ sudo apt-get install gfortran $ sudo pip3 install scipy # for Python 3.x $ sudo pip install scipy # for Python 2.7 For Mac, use the following command to install the above modules: $ sudo easy_install pip $ sudo pip install matplotlib $ sudo pip install libblas-dev liblapack-dev $ sudo pip install gfortran $ sudo pip install scipy For windows, I am assuming that Python 2.7 is already installed at C:Python27. Then open the command prompt and type the following command: C:Usersadmin-karim>cd C:/Python27 C:Python27> python -m pip install <package_name> # provide package name accordingly. For Python3, issue the following commands: C:Usersadmin-karim>cd C:Usersadmin-karimAppDataLocalPrograms PythonPython35Scripts C:Usersadmin-karimAppDataLocalProgramsPythonPython35 Scripts>python3 -m pip install <package_name> Well, we have seen the data. Now it's your turn to do some analytics on top of the data. Say predicting what kinds of people survived from that disaster. Don't you agree that we have enough information about the passengers, but how could we do the predictive modeling so that we can draw some fairly straightforward conclusions from this data? For example, say being a woman, being in 1st class, and being a child were all factors that could boost passenger chances of survival during this disaster. In a brute-force approach–for example, using if/else statements with some sort of weighted scoring system, you could write a program to predict whether a given passenger would survive the disaster. However, does writing such a program in Python make much sense? Naturally, it would be very tedious to write, difficult to generalize, and would require extensive fine tuning for each variable and samples (that is, passenger). This is where predictive analytics with machine learning algorithms and emerging tools comes in so that you could build a program that learns from sample data to predict whether a given passenger would survive. If you found this post useful and would like to explore more, head over to grab the book, Predictive Analytics with TensorFlow written by Md. Rezaul Karim.

0
0
2129

article-image-build-generative-chatbot-using-recurrent-neural-networks-lstm-rnns

Savia Lobo

15 Feb 2018

8 min read

Build a generative chatbot using recurrent neural networks (LSTM RNNs)

Savia Lobo

15 Feb 2018

8 min read

In today’s tutorial we will learn to build generative chatbot using recurrent neural networks. The RNN used here is Long Short Term Memory(LSTM). Generative chatbots are very difficult to build and operate. Even today, most workable chatbots are retrieving in nature; they retrieve the best response for the given question based on semantic similarity, intent, and so on. For further reading, refer to the paper Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation by Kyunghyun Cho et. al. (https://arxiv.org/pdf/1406.1078.pdf). [box type="note" align="" class="" width=""]This article is an excerpt from a book written by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled Natural Language Processing with Python Cookbook. In this book you will come across various recipes covering natural language understanding, Natural Language Processing, and syntactic analysis.[/box] Getting ready... The A.L.I.C.E Artificial Intelligence Foundation dataset bot.aiml Artificial Intelligence Markup Language (AIML), which is customized syntax such as XML file has been used to train the model. In this file, questions and answers are mapped. For each question, there is a particular answer. Complete .aiml files are available at aiml-en-us-foundation-alice.v1-9 from https://code.google.com/archive/p/aiml-en-us-foundation-alice/downloads. Unzip the folder to see the bot.aiml file and open it using Notepad. Save as bot.txt to read in Python: >>> import os """ First change the following directory link to where all input files do exist """ >>> os.chdir("C:UsersprataDocumentsbook_codesNLP_DL") >>> import numpy as np >>> import pandas as pd # File reading >>> with open('bot.txt', 'r') as content_file: ... botdata = content_file.read() >>> Questions = [] >>> Answers = [] AIML files have unique syntax, similar to XML. The pattern word is used to represent the question and the template word for the answer. Hence, we are extracting respectively: >>> for line in botdata.split("</pattern>"): ... if "<pattern>" in line: ... Quesn = line[line.find("<pattern>")+len("<pattern>"):] ... Questions.append(Quesn.lower()) >>> for line in botdata.split("</template>"): ... if "<template>" in line: ... Ans = line[line.find("<template>")+len("<template>"):] ... Ans = Ans.lower() ... Answers.append(Ans.lower()) >>> QnAdata = pd.DataFrame(np.column_stack([Questions,Answers]),columns = ["Questions","Answers"]) >>> QnAdata["QnAcomb"] = QnAdata["Questions"]+" "+QnAdata["Answers"] >>> print(QnAdata.head()) The question and answers are joined to extract the total vocabulary used in the modeling, as we need to convert all words/characters into numeric representation. The reason is the same as mentioned before—deep learning models can't read English and everything is in numbers for the model. How to do it... After extracting the question-and-answer pairs, the following steps are needed to process the data and produce the results: Preprocessing: Convert the question-and-answer pairs into vectorized format, which will be utilized in model training. Model building and validation: Develop deep learning models and validate the data. Prediction of answers from trained model: The trained model will be used to predict answers for given questions. How it works... The question and answers are utilized to create the vocabulary of words to index mapping, which will be utilized for converting words into vector mappings: # Creating Vocabulary >>> import nltk >>> import collections >>> counter = collections.Counter() >>> for i in range(len(QnAdata)): ... for word in nltk.word_tokenize(QnAdata.iloc[i][2]): ... counter[word]+=1 >>> word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())} >>> idx2word = {v:k for k,v in word2idx.items()} >>> idx2word[0] = "PAD" >>> vocab_size = len(word2idx)+1 >>> print (vocab_size) Encoding and decoding functions are used to convert text to indices and indices to text respectively. As we know, Deep learning models work on numeric values rather than text or character data: >>> def encode(sentence, maxlen,vocab_size): ... indices = np.zeros((maxlen, vocab_size)) ... for i, w in enumerate(nltk.word_tokenize(sentence)): ... if i == maxlen: break ... indices[i, word2idx[w]] = 1 ... return indices >>> def decode(indices, calc_argmax=True): ... if calc_argmax: ... indices = np.argmax(indices, axis=-1) ... return ' '.join(idx2word[x] for x in indices) The following code is used to vectorize the question and answers with the given maximum length for both questions and answers. Both might be different lengths. In some pieces of data, the question length is greater than answer length, and in a few cases, it's length is less than answer length. Ideally, the question length is good to catch the right answers. Unfortunately in this case, question length is much less than the answer length, which is a very bad example to develop generative models: >>> question_maxlen = 10 >>> answer_maxlen = 20 >>> def create_questions(question_maxlen,vocab_size): ... question_idx = np.zeros(shape=(len(Questions),question_maxlen, vocab_size)) ... for q in range(len(Questions)): ... question = encode(Questions[q],question_maxlen,vocab_size) ... question_idx[q] = question ... return question_idx >>> quesns_train = create_questions(question_maxlen=question_maxlen, vocab_size=vocab_size) >>> def create_answers(answer_maxlen,vocab_size): ... answer_idx = np.zeros(shape=(len(Answers),answer_maxlen, vocab_size)) ... for q in range(len(Answers)): ... answer = encode(Answers[q],answer_maxlen,vocab_size) ... answer_idx[q] = answer ... return answer_idx >>> answs_train = create_answers(answer_maxlen=answer_maxlen,vocab_size= vocab_size) >>> from keras.layers import Input,Dense,Dropout,Activation >>> from keras.models import Model >>> from keras.layers.recurrent import LSTM >>> from keras.layers.wrappers import Bidirectional >>> from keras.layers import RepeatVector, TimeDistributed, ActivityRegularization The following code is an important part of the chatbot. Here we have used recurrent networks, repeat vector, and time-distributed networks. The repeat vector used to match dimensions of input to output values. Whereas time-distributed networks are used to change the column vector to the output dimension's vocabulary size: >>> n_hidden = 128 >>> question_layer = Input(shape=(question_maxlen,vocab_size)) >>> encoder_rnn = LSTM(n_hidden,dropout=0.2,recurrent_dropout=0.2) (question_layer) >>> repeat_encode = RepeatVector(answer_maxlen)(encoder_rnn) >>> dense_layer = TimeDistributed(Dense(vocab_size))(repeat_encode) >>> regularized_layer = ActivityRegularization(l2=1)(dense_layer) >>> softmax_layer = Activation('softmax')(regularized_layer) >>> model = Model([question_layer],[softmax_layer]) >>> model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) >>> print (model.summary()) The following model summary describes the change in flow of model size across the model. The input layer matches the question's dimension and the output matches the answer's dimension: # Model Training >>> quesns_train_2 = quesns_train.astype('float32') >>> answs_train_2 = answs_train.astype('float32') >>> model.fit(quesns_train_2, answs_train_2,batch_size=32,epochs=30, validation_split=0.05) The results are a bit tricky in the following screenshot even though the accuracy is significantly higher. The chatbot model might produce complete nonsense, as most of the words are padding here. The reason? The number of words in this data is less: # Model prediction >>> ans_pred = model.predict(quesns_train_2[0:3]) >>> print (decode(ans_pred[0])) >>> print (decode(ans_pred[1])) The following screenshot depicts the sample output on test data. The output does not seem to make sense, which is an issue with generative models: Our model did not work well in this case, but still some areas of improvement are possible going forward with generative chatbot models. Readers can give it a try: Have a dataset with lengthy questions and answers to catch signals well Create a larger architecture of deep learning models and train over longer iterations Make question-and-answer pairs more generic rather than factoid-based, such as retrieving knowledge and so on, where generative models fail miserably. Here, you saw how to build chatbots using LSTM. You can go ahead and try building one of your own generative chatbots using the example above. If you found this post useful, do check out this book Natural Language Processing with Python Cookbook to efficiently use NLTK and implement text classification, identify parts of speech, tag words, and more.

0
4
58183

article-image-deploy-rethinkdb-using-docker

Vijin Boricha

14 Feb 2018

7 min read

How to deploy RethinkDB using Docker

Vijin Boricha

14 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will help you develop efficient and real-time applications in RethinkDB with ease.[/box] In today’s tutorial, we will learn to install Docker, create an Docker image and deploy RethinkDB using Docker. Your code is not working in Production? But it's working on the QA (quality analysis server)! I am sure you have heard statements like these in your team during the deployment phase. Well no more of that, Docker everything and forget about the infrastructure of different environments, say, QA, Staging and Production, because your code is going to run Docker container not in those machines, hence write once, run everywhere. In this section, we will learn how to use Docker to deploy a RethinkDB Server or PaaS services. I am going to cover a few docker basics too; if you are already aware of them, please skip to the next section. Installing Docker Docker is available for all major platforms, such as, Linux-based distributions, Mac, and Windows. Visit the official website at h t t p s ://w w w . d o c k e r . c o m / and download the package suitable for your platform. We are installing Docker in our machine to create a new Docker image. Docker images are independent of platform and should not be confused with Docker for Mac or Docker for Windows. It's referred to as a Docker client too. Once you have installed the Docker, you need to start the Daemon process first; I am using a Mac so I can view this in the launchpad, as shown here: Upon clicking that, it will open up a nice console showing the Docker official logo and an indication that Docker is successfully booted, as shown in the following screenshot: Now we can begin creating our Docker image that in turn will run RethinkDB. Creating a Docker image For installing our RethinkDB on Ubuntu inside the Docker, we need to install the Ubuntu operating system. Run the following command to install a Ubuntu image from the official Docker hub repository: docker pull ubuntu This will download and install the Ubuntu image in our system. We will later use this Ubuntu image and install our RethinkDB instance; you can choose different operating systems as well. Before going to the Docker configuration code, I would like to point out the steps we require to install RethinkDB on a fresh Ubuntu installation: Update the system Add the RethinkDB repository to the known repository list Install RethinkDB Set the data folder Expose the port We are going to do this using Docker. To create a Docker image, we require Dockerfile. Create a file called Dockerfile with no extension and apply the code shown here: FROM ubuntu:latest # Install RethinkDB. RUN apt-get update && echo "deb http://download.rethinkdb.com/apt `lsb_release -cs` main" > /etc/apt/sources.list.d/rethinkdb.list && apt-get install -y wget && wget -O- http://download.rethinkdb.com/apt/pubkey.gpg | apt-key add - && apt-get update && apt-get install -y rethinkdb python-pip && rm -rf /var/lib/apt/lists/* # Install python driver for rethinkdb RUN pip install rethinkdb # Define mountable directories. VOLUME ["/data"] # Define working directory. WORKDIR /data # Define default command. CMD ["rethinkdb", "--bind", "all"] # Expose ports. # - 8080: web UI # - 28015: process # - 29015: cluster EXPOSE 8080 EXPOSE 28015 EXPOSE 29015 The first line is our entry point to the Ubuntu operating system, then we are performing an update of the system and using the installation commands recommended by RethinkDB here: h t t p s ://w w w . r e t h i n k d b . c o m /d o c s /i n s t a l l /u b u n t u /. Once the installation is complete, we install the rethinkdb python driver to perform the import/export operation. The next two commands mount a new volume in Ubuntu and telling RethinkDB to use that volume. The next command runs rethinkdb by binding all the ports and exposing the ports to be used by the client driver and web console. In order to make this a docker image, save the file and run the following command within the project directory: docker build -t docker-rethinkdb. Here, we are building our docker image and giving it a name docker-rethinkdb; upon running this command, Docker will execute the Dockerfile and you're on. The representation of the previous steps is shown here: Once everything works, and I am sure it will, you will see a success message in the console, as shown here: Congratulations! You have successfully created a docker image for RethinkDB. If you want to see your image and its properties, run the following command: docker images And this will list all the images of Docker, as shown in the following screenshot: Awesome! Now let's run it. To access the web portal, we need to run our docker image and bind port 8080 of the docker image to some port of our machine; here is the command to do so: docker run -p 3000:8080 -d docker-rethinkdb As per the command above, -p is used to specify port binding, the first is the target and second port is source, that is, Docker port and -d is used to run it in the background or Daemon. This will run the docker image in the background; to extract more information about this process, we need to run the following command: docker ps This will list all the running images called as a container, along with the information, as shown in the following screenshot: You can also check the logs of specific containers using the following command: docker logs <container id> Now, in order to access the RethinkDB web console from our machine, we need to find out the IP address on which the Docker machine is running. To get that, we need to run the following command: docker-machine ip default This will print out the IP. Copy the IP and hit IP:3000 from the browser to view the RethinkDB web console, as shown here: So we have docker running and accessible from the browser. In order to import and export the data, we need to log in to our Docker image. To do that, run the following command: docker exec -i -t <container-id> /bin/bash This will log in to the docker image running Ubuntu; refer to the following screenshot: You can now run the rethinkdb command to perform the data import to the existing RethinkDB cluster. Deploying the Docker image Almost every PaaS service we have covered in earlier sections provides support for Docker. You can submit your Dockerfile to git and clone it anywhere if you want to create Docker image. You can submit the whole docker image (not Dockerfile) to Dockerhub and pull your docker image directly using the docker pull command, which is no doubt an easy way because you will be directly working on the image running on the server. We covered RethinkDB deployment using Docker and learned how to create our own RethinkDB image. You can learn more about RethinkDB Query Language and Performance Tuning in RethinkDB from this book Mastering RethinkDB.

0
0
57987

How-To Tutorials - Data

CRUD (Create Read, Update and Delete) Operations with Elasticsearch

How to perform predictive forecasting in SAP Analytics Cloud

How to implement In-Memory OLTP on SQL Server in Linux

How to use LabVIEW for data acquisition

4 ways to implement feature selection in Python for machine learning

How to install Elasticsearch in Ubuntu and Windows

How to build and enable the Jenkins Mesos plugin

Manipulating text data using Python Regular Expressions (regex)

How to auto-generate texts from Shakespeare writing using deep recurrent neural networks

6 Popular Regression Techniques you must know

Trending Topics

How to use R to boost your Data Model

Why choose R for your data mining project

How to make efficient data-driven decisions

Build a generative chatbot using recurrent neural networks (LSTM RNNs)

How to deploy RethinkDB using Docker

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access