Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-install-elasticsearch-ubuntu-windows
Fatema Patrawala
16 Feb 2018
3 min read
Save for later

How to install Elasticsearch in Ubuntu and Windows

Fatema Patrawala
16 Feb 2018
3 min read
[box type="note" align="" class="" width=""]This article is an extract from the book, Mastering Elastic Stack  co-authored by Ravi Kumar Gupta and Yuvraj Gupta.This book will brush you up with basic knowledge on implementing the Elastic Stack and then dives deep into complex and advanced implementations. [/box] In today’s tutorial we aim to learn Elasticsearch v5.1.1 installation for Ubuntu and Windows. Installation of Elasticsearch on Ubuntu 14.04 In order to install Elasticsearch on Ubuntu, refer to the following steps: Download Elasticsearch 5.1.1 as a debian package using terminal: wget https://artifacts.elastic.co /downloads/elasticsearch/elasticsearch-5.1.1.deb 2. Install the debian package using following command: sudo dpkg -i elasticsearch-5.1.1.deb Elasticsearch will be installed in /usr/share/elasticsearch directory. The configuration files will be present at /etc/elasticsearch. The init script will be present at /etc/init.d/elasticsearch. The log files will be present within /var/log/elasticsearch directory. 3. Configure Elasticsearch to run automatically on bootup . If you are using SysV init distribution, then run the following command: sudo update-rc.d elasticsearch defaults 95 10 The preceding command will print on screen: Adding system startup for, /etc/init.d/elasticsearch Check status of Elasticsearch using following command: sudo service elasticsearch status Run Elasticsearch as a service using following command: sudo service elasticsearch start Elasticsearch may not start if you have any plugin installed which is not supported in ES-5.0.x version onwards. As plugins have been deprecated, it is required to uninstall any plugin if exists in prior version of ES. Remove a plugin after going to ES Home using following command: bin/elasticsearch-plugin remove head Usage of Elasticsearch command: sudo service elasticsearch {start|stop|restart|force- reload|status} If you are using systemd distribution, then run following command: sudo /bin/systemctl daemon-reload sudo /bin/systemctl enable elasticsearch.service To verify elasticsearch installation open open http://localhost:9200 in browser or run the following command from command line: curl -X GET http://localhost:9200 Installation of Elasticsearch on Windows In order to install Elasticsearch on Windows, refer to the following steps: Download Elasticsearch 5.1.1 version from its site using the following link: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch -5.1.1.zip Upon opening the link, click on it and it will download the ZIP package. 2. Extract the downloaded ZIP package by unzipping it using WinRAR, 7-Zip, and other such extracting softwares (if you don't have one of these then download it). This will extract the files and folders in the directory. 3. Then click on the extracted folder and navigate the folder to reach inside the bin folder. 4. Click on the elasticsearch.bat file to run Elasticsearch. If this window is closed Elasticsearch will stop running, as the node will shut down. 5. To verify Elasticsearch installation, open http://localhost:9200 in the browser: Installation of Elasticsearch as a service After installing Elasticsearch as previously mentioned, open Command Prompt after navigating to the bin folder and use the following command: elasticsearch-service.bat install Usage: elasticsearch-service.bat install | remove | start | stop | manager To summarize, we learnt installation of Elasticsearch on Ubuntu and Windows. If you are keen to know more about how to work with the Elastic Stack in a production environment, you can grab our comprehensive guide Mastering Elastic Stack.  
Read more
  • 0
  • 0
  • 56496

article-image-manipulating-text-data-using-python-regular-expressions-regex
Sugandha Lahoti
16 Feb 2018
8 min read
Save for later

Manipulating text data using Python Regular Expressions (regex)

Sugandha Lahoti
16 Feb 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Allan Visochek titled Practical Data Wrangling. This book covers practical data wrangling techniques in Python and R to turn your noisy data into relevant, insight-ready information.[/box] In today’s tutorial, we will learn how to manipulate text data using regular expressions in Python. What is a Regular expression A regular expression, or regex for short, is simply a sequence of characters that specifies a certain search pattern. Regular expressions have been around for quite a while and are a field of computer science in and of themselves. In Python, regular expression operations are handled using Python's built in re module. In this section, I will walk through the basics of creating regular expressions and using them to  You can implement a regular expression with the following steps: Specify a pattern string. Compile the pattern string to a regular expression object. Use the regular expression object to search a string for the pattern. Optional: Extract the matched pattern from the string. Writing and using a regular expression The first step to creating a regular expression in Python is to import the re module: import re Python regular expressions are expressed using pattern strings, which are strings that specify the desired search pattern. In its simplest form, a pattern string can consist only of letters, numbers, and spaces. The following pattern string expresses a search query for an exact sequence of characters. You can think of each character as an individual pattern. In later examples, I will discuss more sophisticated patterns: import re pattern_string = "this is the pattern" The next step is to process the pattern string into an object that Python can use in order to search for the pattern. This is done using the compile() method of the re module. The compile() method takes the pattern string as an argument and returns a regex object: import re pattern_string = "this is the pattern" regex = re.compile(pattern_string) Once you have a regex object, you can use it to search within a search string for the pattern specified in the pattern string. A search string is just the name for the string in which you are looking for a pattern. To search for the pattern, you can use the search() method of the regex object as follows: import re pattern_string = "this is the pattern" regex = re.compile(pattern_string) match = regex.search("this is the pattern") If the pattern specified in the pattern string is in the search string, the search() method will return a match object. Otherwise, it returns the None data type, which is an empty value. Since Python interprets True and False values rather loosely, the result of the search function can be used like a Boolean value in an if statement, which can be rather convenient: .... match = regex.search("this is the pattern") if match: print("this was a match!") The search string this is the pattern should produce a match, because it matches exactly the pattern specified in the pattern string. The search function will produce a match if the pattern is found at any point in the search string as the following demonstrates: .... match = regex.search("this is the pattern") if match: print("this was a match!") if regex.search("*** this is the pattern ***"): print("this was not a match!") if not regex.search("this is not the pattern"): print("this was not a match!") Special Characters Regular expressions depend on the use of certain special characters in order to express patterns. Due to this, the following characters should not be used directly unless they are used for their intended purpose: . ^ $ * + ? {} () [] | If you do need to use any of the previously mentioned characters in a pattern string to search for that character, you can write the character preceded by a backslash character. This is called escaping characters. Here's an example: pattern string = "c*b" ## matches "c*b" If you need to search for the backslash character itself, you use two backslash characters, as follows: pattern string = "cb" ## matches "cb" Matching whitespace Using s at any point in the pattern string matches a whitespace character. This is more general then the space character, as it applies to tabs and newline characters: .... a_space_b = re.compile("asb") if a_space_b.search("a b"): print("'a b' is a match!") if a_space_b.search("1234 a b 1234"): print("'1234 a b 1234' is a match") if a_space_b.search("ab"): print("'1234 a b 1234' is a match") Matching the start of string If the ^ character is used at the beginning of the pattern string, the regular expression will only produce a match if the pattern is found at the beginning of the search string: .... a_at_start = re.compile("^a") if a_at_start.search("a"): print("'a' is a match") if a_at_start.search("a 1234"): print("'a 1234' is a match") if a_at_start.search("1234 a"): print("'1234 a' is a match") Matching the end of a string Similarly, if the $ symbol is used at the end of the pattern string, the regular expression will only produce a match if the pattern appears at the end of the search string: .... a_at_end = re.compile("a$") if a_at_end.search("a"): print("'a' is a match") if a_at_end.search("a 1234"): print("'a 1234' is a match") if a_at_end.search("1234 a"): print("'1234 a' is a match") Matching a range of characters It is possible to match a range of characters instead of just one. This can add some flexibility to the pattern: [A-Z] matches all capital letters [a-z] matches all lowercase letters [0-9] matches all digits .... lower_case_letter = re.compile("[a-z]") if lower_case_letter.search("a"): print("'a' is a match") if lower_case_letter.search("B"): print("'B' is a match") if lower_case_letter.search("123 A B 2"): print("'123 A B 2' is a match") digit = re.compile("[0-9]") if digit.search("1"): print("'a' is a match") if digit.search("342"): print("'a' is a match") if digit.search("asdf abcd"): print("'a' is a match") Matching any one of several patterns If there is a fixed number of patterns that would constitute a match, they can be combined using the following syntax: (<pattern1>|<pattern2>|<pattern3>) The following a_or_b regular expression will match any string where there is either an a character or a b character: .... a_or_b = re.compile("(a|b)") if a_or_b.search("a"): print("'a' is a match") if a_or_b.search("b"): print("'b' is a match") if a_or_b.search("c"): print("'c' is a match") Matching a sequence instead of just one character If the + character comes after another character or pattern, the regular expression will match an arbitrarily long sequence of that pattern. This is quite useful, because it makes it easy to express something like a word or number that can be of arbitrary length. Putting patterns together More sophisticated patterns can be produced by combining pattern strings one after the other. In the following example, I've created a regular expression that searches for a number strictly followed by a word. The pattern string that generates the regular expression is composed of the following: A pattern string that matches a sequence of digits: [0-9]+ A pattern string that matches a whitespace character: s  A pattern string that matches a sequence of letters: [a-z]+ A pattern string that matches either the end of the string or a whitespace character: (s|$) .... number_then_word = re.compile("[0-9]+s[a-z]+(s|$)") The regex split() function Regex objects in Python also have a split() method. The split method splits the search string into an array of substrings. The splits occur at each location along the string where the pattern is identified. The result is an array of strings that occur between instances of the pattern. If the pattern occurs at the beginning or end of the search string, an empty string is included at the beginning or end of the resulting array, respectively: .... print(a_or_b.split("123a456b789")) print(a_or_b.split("a1b")) If you are interested, the Python documentation has a more complete coverage of regular expressions. It can be found at https://docs.python.org/3.6/library/re.html. We saw various ways of using regular expressions in Python. To know more about data wrangling techniques using simple and real-world data-sets you may check out this book Practical Data Wrangling.  
Read more
  • 0
  • 0
  • 62366

article-image-build-enable-jenkins-mesos-plugin
Vijin Boricha
16 Feb 2018
4 min read
Save for later

How to build and enable the Jenkins Mesos plugin

Vijin Boricha
16 Feb 2018
4 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by David Blomquist and Tomasz Janiszewski titled Apache Mesos Cookbook. From this book, you will get to know tips and tricks along with best practices to follow when working with Mesos.[/box] In today’s tutorial, we will learn about building and enabling the Jenkins Mesos plugin. Building the Jenkins Mesos plugin By default, Jenkins uses statically created agents and runs jobs on them. We can extend this behavior with a plugin that will make Jenkins use Mesos as a resource manager. Jenkins will register as a Mesos framework and accept offers when it needs to run a job. How to do it The Jenkins Mesos plugin installation is a little bit harder than Marathon. There are no official binary packages for it, so it must be installed from sources:  First of all, we need to download the source code: curl -L https://github.com/jenkinsci/mesos-plugin/archive/mesos-0.14.0.tar. gz | tar -zx cd jenkinsci-mesos-plugin-*  The plugin is written in Java and to build it we need Maven (mvn): sudo apt install maven  Finally, build the package: mvn package If everything goes smoothly, you should see information, that all tests passed and the plugin package will be placed in target/mesos.hpi. Jenkins is written in Java and presents an API for creating plugins. Plugins do not have to be written in Java, but must be compatible with those interfaces so most plugins are written in Java. The natural choice for building a Java application is Maven, although Gradle is getting more and more popular. The Jenkins Mesos plugin uses the Mesos native library to communicate with Mesos. This communication is now deprecated, so the plugin does not support all Mesos features that are available with the Mesos HTTP API. Enabling the Jenkins Mesos plugin Here  you will learn how to enable the Mesos Jenkins plugin and configure a job to be run on Mesos. How to do it... The first step is to install the Mesos Jenkins plugin. To do so, navigate to the Plugin Manager by clicking Manage Jenkins | Manage Plugins, and select the Advanced tab. You should see the following screen: Click Choose  file and select the previously built plugin to upload it. Once the plugin is installed, you have to configure it. To do so, go to the configuration (Manage Jenkins | Configure  System).  At the bottom of the page, the cloud section  should appear. Fill in all the fields with  the desired configuration values: This was the last step of the plugin installation. If you now disable Advanced On- demand framework registration, you should see the Jenkins Scheduler registered in the Mesos frameworks. Remember to configure Slave  username to the existing system user on Mesos agents. It will be used to run your jobs. By default, it will be jenkins. You can create it on slaves with the following command: adduser jenkins Be careful when providing an IP or hostnames for Mesos and Jenkins. It must match the IP used later by the scheduler for communication. By default, the Mesos native library binds to the interface that the hostname resolves to. This could lead to problems in communication, especially when receiving messages from Mesos. If you see your Jenkins is connected, but jobs are stuck and agents do not start, check if Jenkins is registered with the proper IP. You can set the IP used by Jenkins by adding the following line in /etc/default/jenkins (in this example, we assume Jenkins should bind on 10.10.10.10): LIBPROCESS_IP=10.10.10.10 We learnt about building and enabling Jenkins Mesos plugin. You can know more about how to configure and maintain Apache Mesos from Apache Mesos Cookbook.    
Read more
  • 0
  • 0
  • 11204

article-image-auto-generate-texts-shakespeare-writing-using-deep-recurrent-neural-networks
Savia Lobo
16 Feb 2018
6 min read
Save for later

How to auto-generate texts from Shakespeare writing using deep recurrent neural networks

Savia Lobo
16 Feb 2018
6 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from a book co-authored by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled as Natural Language Processing with Python Cookbook. This book will give unique recipes to know various aspects of performing Natural Language Processing with NLTK—a leading Python platform for NLP.[/box] Today we will learn to use deep recurrent neural networks (RNN) to predict the next character based on the given length of a sentence. This way of training a model is able to generate automated text continuously, which can imitate the writing style of the original writer with enough training on the number of epochs and so on. Getting ready... The Project Gutenberg eBook of the complete works of William Shakespeare's dataset is used to train the network for automated text generation. Data can be downloaded from http:// www.gutenberg.org/ for the raw file used for training: >>> from  future import print_function >>> import numpy as np >>> import random >>> import sys The following code is used to create a dictionary of characters to indices and vice-versa mapping, which we will be using to convert text into indices at later stages. This is because deep learning models cannot understand English and everything needs to be mapped into indices to train these models: >>> path = 'C:UsersprataDocumentsbook_codes NLP_DL shakespeare_final.txt' >>> text = open(path).read().lower() >>> characters = sorted(list(set(text))) >>> print('corpus length:', len(text)) >>> print('total chars:', len(characters)) >>> char2indices = dict((c, i) for i, c in enumerate(characters)) >>> indices2char = dict((i, c) for i, c in enumerate(characters)) How to do it… Before training the model, various preprocessing steps are involved to make it work. The following are the major steps involved: Preprocessing: Prepare X and Y data from the given entire story text file and converting them into indices vectorized format. Deep learning model training and validation: Train and validate the deep learning model. Text generation: Generate the text with the trained model. How it works... The following lines of code describe the entire modeling process of generating text from Shakespeare's writings. Here we have chosen character length. This needs to be considered as 40 to determine the next best single character, which seems to be very fair to consider. Also, this extraction process jumps by three steps to avoid any overlapping between two consecutive extractions, to create a dataset more fairly: # cut the text in semi-redundant sequences of maxlen characters >>> maxlen = 40 >>> step = 3 >>> sentences = [] >>> next_chars = [] >>> for i in range(0, len(text) - maxlen, step): ... sentences.append(text[i: i + maxlen]) ... next_chars.append(text[i + maxlen]) ... print('nb sequences:', len(sentences)) The following screenshot depicts the total number of sentences considered, 193798, which is enough data for text generation: The next code block is used to convert the data into a vectorized format for feeding into deep learning models, as the models cannot understand anything about text, words, sentences and so on. Initially, total dimensions are created with all zeros in the NumPy array and filled with relevant places with dictionary mappings: # Converting indices into vectorized format >>> X = np.zeros((len(sentences), maxlen, len(characters)), dtype=np.bool) >>> y = np.zeros((len(sentences), len(characters)), dtype=np.bool) >>> for i, sentence in enumerate(sentences): ... for t, char in enumerate(sentence): ... X[i, t, char2indices[char]] = 1 ... y[i, char2indices[next_chars[i]]] = 1 >>> from keras.models import Sequential >>> from keras.layers import Dense, LSTM,Activation,Dropout >>> from keras.optimizers import RMSprop The deep learning model is created with RNN, more specifically Long Short-Term Memory networks with 128 hidden neurons, and the output is in the dimensions of the characters. The number of columns in the array is the number of characters. Finally, the softmax function is used with the RMSprop optimizer. We encourage readers to try with other various parameters to check out how results vary: #Model Building >>> model = Sequential() >>> model.add(LSTM(128, input_shape=(maxlen, len(characters)))) >>> model.add(Dense(len(characters))) >>> model.add(Activation('softmax')) >>> model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01)) >>> print (model.summary()) As mentioned earlier, deep learning models train on number indices to map input to output (given a length of 40 characters, the model will predict the next best character). The following code is used to convert the predicted indices back to the relevant character by determining the maximum index of the character: # Function to convert prediction into index >>> def pred_indices(preds, metric=1.0): ... preds = np.asarray(preds).astype('float64') ... preds = np.log(preds) / metric ... exp_preds = np.exp(preds) ... preds = exp_preds/np.sum(exp_preds) ... probs = np.random.multinomial(1, preds, 1) ... return np.argmax(probs) The model will be trained over 30 iterations with a batch size of 128. And also, the diversity has been changed to see the impact on the predictions: # Train and Evaluate the Model >>> for iteration in range(1, 30): ... print('-' * 40) ... print('Iteration', iteration) ... model.fit(X, y,batch_size=128,epochs=1).. ... start_index = random.randint(0, len(text) - maxlen - 1) ... for diversity in [0.2, 0.7,1.2]: ... print('n----- diversity:', diversity) ... generated = '' ... sentence = text[start_index: start_index + maxlen] ... generated += sentence ... print('----- Generating with seed: "' + sentence + '"') ... sys.stdout.write(generated) ... for i in range(400): ... x = np.zeros((1, maxlen, len(characters))) ... for t, char in enumerate(sentence): ... x[0, t, char2indices[char]] = 1. ... preds = model.predict(x, verbose=0)[0] ... next_index = pred_indices(preds, diversity) ... pred_char = indices2char[next_index] ... generated += pred_char ... sentence = sentence[1:] + pred_char ... sys.stdout.write(pred_char) ... sys.stdout.flush() ... print("nOne combination completed n") The results are shown in the next screenshot to compare the first iteration (Iteration 1) and final iteration (Iteration 29). It is apparent that with enough training, the text generation seems to be much better than with Iteration 1: Text generation after Iteration 29 is shown in this image: Though the text generation seems to be magical, we have generated text using Shakespeare's writings, proving that with the right training and handling, we can imitate any style of writing of a particular writer. If you found this post useful, you may check out this book Natural Language Processing with Python Cookbook to analyze sentence structure and master lexical analysis, syntactic and semantic analysis, pragmatic analysis, and other NLP techniques.  
Read more
  • 0
  • 0
  • 33693

article-image-4-ways-implement-feature-selection-python-machine-learning
Sugandha Lahoti
16 Feb 2018
13 min read
Save for later

4 ways to implement feature selection in Python for machine learning

Sugandha Lahoti
16 Feb 2018
13 min read
[box type="note" align="" class="" width=""]This article is an excerpt from Ensemble Machine Learning. This book serves as a beginner's guide to combining powerful machine learning algorithms to build optimized models.[/box] In this article, we will look at different methods to select features from the dataset; and discuss types of feature selection algorithms with their implementation in Python using the Scikit-learn (sklearn) library: Univariate selection Recursive Feature Elimination (RFE) Principle Component Analysis (PCA) Choosing important features (feature importance) We have explained first three algorithms and their implementation in short. Further we will discuss Choosing important features (feature importance) part in detail as it is widely used technique in the data science community. Univariate selection Statistical tests can be used to select those features that have the strongest relationships with the output variable. The scikit-learn library provides the SelectKBest class, which can be used with a suite of different statistical tests to select a specific number of features. The following example uses the chi squared (chi^2) statistical test for non-negative features to select four of the best features from the Pima Indians onset of diabetes dataset: #Feature Extraction with Univariate Statistical Tests (Chi-squared for classification) #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import SelectKBest #Import chi2 for performing chi square test from sklearn.feature_selection import chi2 #URL for loading the dataset url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #We will select the features using chi square test = SelectKBest(score_func=chi2, k=4) #Fit the function for ranking the features by score fit = test.fit(X, Y) #Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_) #Apply the transformation on to dataset features = fit.transform(X) #Summarize selected features print(features[0:5,:]) You can see the scores for each attribute and the four attributes chosen (those with the highest scores): plas, test, mass, and age. Scores for each feature: [111.52   1411.887 17.605 53.108  2175.565   127.669 5.393 181.304] Selected Features: [[148. 0. 33.6 50. ] [85. 0. 26.6 31. ] [183. 0. 23.3 32. ] [89. 94. 28.1 21. ] [137. 168. 43.1 33. ]] Recursive Feature Elimination RFE works by recursively removing attributes and building a model on attributes that remain. It uses model accuracy to identify which attributes (and combinations of attributes) contribute the most to predicting the target attribute. You can learn more about the RFE class in the scikit-learn documentation. The following example uses RFE with the logistic regression algorithm to select the top three features. The choice of algorithm does not matter too much as long as it is skillful and consistent: #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE #Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction model = LogisticRegression() rfe = RFE(model, 3) fit = rfe.fit(X, Y) print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_) After execution, we will get: Num Features: 3 Selected Features: [ True False False False False   True  True False] Feature Ranking: [1 2 3 5 6 1 1 4] You can see that RFE chose the the top three features as preg, mass, and pedi. These are marked True in the support_ array and marked with a choice 1 in the ranking_ array. Principle Component Analysis PCA uses linear algebra to transform the dataset into a compressed form. Generally, it is considered a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result. In the following example, we use PCA and select three principal components: #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's PCA algorithm from sklearn.decomposition import PCA #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction pca = PCA(n_components=3) fit = pca.fit(X) #Summarize components print("Explained Variance: %s") % fit.explained_variance_ratio_ print(fit.components_) You can see that the transformed dataset (three principal components) bears little resemblance to the source data: Explained Variance: [ 0.88854663   0.06159078  0.02579012] [[ -2.02176587e-03    9.78115765e-02 1.60930503e-02    6.07566861e-02 9.93110844e-01          1.40108085e-02 5.37167919e-04   -3.56474430e-03] [ -2.26488861e-02   -9.72210040e-01              -1.41909330e-01  5.78614699e-02 9.46266913e-02   -4.69729766e-02               -8.16804621e-04  -1.40168181e-01 [ -2.24649003e-02 1.43428710e-01                 -9.22467192e-01  -3.07013055e-01 2.09773019e-02   -1.32444542e-01                -6.39983017e-04  -1.25454310e-01]] Choosing important features (feature importance) Feature importance is the technique used to select features using a trained supervised classifier. When we train a classifier such as a decision tree, we evaluate each attribute to create splits; we can use this measure as a feature selector. Let's understand it in detail. Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness, and ease of use. They also provide two straightforward methods for feature selection—mean decrease impurity and mean decrease accuracy. A random forest consists of a number of decision trees. Every node in a decision tree is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is known as impurity. For classification, it is typically either the Gini impurity or information gain/entropy, and for regression trees, it is the variance. Thus when training a tree, it can be computed by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Let's see how to do feature selection using a random forest classifier and evaluate the accuracy of the classifier before and after feature selection. We will use the Otto dataset. This dataset is available for free from kaggle (you will need to sign up to kaggle to be able to download this dataset). You can download training dataset, train.csv.zip, from the https://www.kaggle.com/c/otto-group-product-classification-challenge/data and place the unzipped train.csv file in your working directory. This dataset describes 93 obfuscated details of more than 61,000 products grouped into 10 product categories (for example, fashion, electronics, and so on). Input attributes are the counts of different events of some kind. The goal is to make predictions for new products as an array of probabilities for each of the 10 categories, and models are evaluated using multiclass logarithmic loss (also called cross entropy). We will start with importing all of the libraries: #Import the supporting libraries #Import pandas to load the dataset from csv file from pandas import read_csv #Import numpy for array based operations and calculations import numpy as np #Import Random Forest classifier class from sklearn from sklearn.ensemble import RandomForestClassifier #Import feature selector class select model of sklearn         from sklearn.feature_selection         import SelectFromModel          np.random.seed(1) Let's define a method to split our dataset into training and testing data; we will train our dataset on the training part and the testing part will be used for evaluation of the trained model: #Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split): np.random.seed(0) training = [] testing = [] np.random.shuffle(dataset) shape = np.shape(dataset) trainlength = np.uint16(np.floor(split*shape[0])) for i in range(trainlength): training.append(dataset[i]) for i in range(trainlength,shape[0]): testing.append(dataset[i]) training = np.array(training) testing = np.array(testing) return training,testing We also need to add a function to evaluate the accuracy of the model; it will take the predicted and actual output as input to calculate the percentage accuracy: #Function to evaluate model performance def getAccuracy(pre,ytest): count = 0 for i in range(len(ytest)): if ytest[i]==pre[i]: count+=1 acc = float(count)/len(ytest) return acc This is the time to load the dataset. We will load the train.csv file; this file contains more than 61,000 training instances. We will use 50000 instances for our example, in which we will use 35,000 instances to train the classifier and 15,000 instances to test the performance of the classifier: #Load dataset as pandas data frame data = read_csv('train.csv') #Extract attribute names from the data frame feat = data.keys() feat_labels = feat.get_values() #Extract data values from the data frame dataset = data.values #Shuffle the dataset np.random.shuffle(dataset) #We will select 50000 instances to train the classifier inst = 50000 #Extract 50000 instances from the dataset dataset = dataset[0:inst,:] #Create Training and Testing data for performance evaluation train,test = getTrainTestData(dataset, 0.7) #Split data into input and output variable with selected features Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain) print("Shape of the dataset ",shape) #Print the size of Data in MBs print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6)) Let's take note of the data size here; as our dataset contains about 35000 training instances with 94 attributes; the size of our dataset is quite large. Let's see: Shape of the dataset (35000, 94) Size of Data set before feature selection: 26.32 MB As you can see, we are having 35000 rows and 94 columns in our dataset, which is more than 26 MB data. In the next code block, we will configure our random forest classifier; we will use 250 trees with a maximum depth of 30 and the number of random features will be 7. Other hyperparameters will be the default of sklearn: #Lets select the test data for model evaluation purpose Xtest = test[:,0:94] ytest = test[:,94] #Create a random forest classifier with the following Parameters trees            = 250 max_feat     = 7 max_depth = 30 min_sample = 2 clf = RandomForestClassifier(n_estimators=trees, max_features=max_feat, max_depth=max_depth, min_samples_split= min_sample, random_state=0, n_jobs=-1) #Train the classifier and calculate the training time import time start = time.time() clf.fit(Xtrain, ytrain) end = time.time() #Lets Note down the model training time print("Execution time for building the Tree is: %f"%(float(end)- float(start))) pre = clf.predict(Xtest) Let's see how much time is required to train the model on the training dataset: Execution time for building the Tree is: 2.913641 #Evaluate the model performance for the test data acc = getAccuracy(pre, ytest) print("Accuracy of model before feature selection is %.2f"%(100*acc)) The accuracy of our model is: Accuracy of model before feature selection is 98.82 As you can see, we are getting very good accuracy as we are classifying almost 99% of the test data into the correct categories. This means we are classifying about 14,823 instances out of 15,000 in correct classes. So, now my question is: should we go for further improvement? Well, why not? We should definitely go for more improvements if we can; here, we will use feature importance to select features. As you know, in the tree building process, we use impurity measurement for node selection. The attribute value that has the lowest impurity is chosen as the node in the tree. We can use similar criteria for feature selection. We can give more importance to features that have less impurity, and this can be done using the feature_importances_ function of the sklearn library. Let's find out the importance of each feature: #Once we have trained the model we will rank all the features for feature in zip(feat_labels, clf.feature_importances_): print(feature) ('id', 0.33346650420175183) ('feat_1', 0.0036186958628801214) ('feat_2', 0.0037243050888530957) ('feat_3', 0.011579217472062748) ('feat_4', 0.010297382675187445) ('feat_5', 0.0010359139416194116) ('feat_6', 0.00038171336038056165) ('feat_7', 0.0024867672489765021) ('feat_8', 0.0096689721610546085) ('feat_9', 0.007906150362995093) ('feat_10', 0.0022342480802130366) As you can see here, each feature has a different importance based on its contribution to the final prediction. We will use these importance scores to rank our features; in the following part, we will select those features that have feature importance more than 0.01 for model training: #Select features which have higher contribution in the final prediction sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain) Here, we will transform the input dataset according to the selected feature attributes. In the next code block, we will transform the dataset. Then, we will check the size and shape of the new dataset: #Transform input dataset Xtrain_1 = sfm.transform(Xtrain) Xtest_1      = sfm.transform(Xtest) #Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6)) shape = np.shape(Xtrain_1) print("Shape of the dataset ",shape) Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20) Do you see the shape of the dataset? We are left with only 20 features after the feature selection process, which reduces the size of the database from 26 MB to 5.60 MB. That's about 80% reduction from the original dataset. In the next code block, we will train a new random forest classifier with the same hyperparameters as earlier and test it on the testing dataset. Let's see what accuracy we get after modifying the training set: #Model training time start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time() print("Execution time for building the Tree is: %f"%(float(end)- float(start))) #Let's evaluate the model on test data pre = clf.predict(Xtest_1) count = 0 acc2 = getAccuracy(pre, ytest) print("Accuracy after feature selection %.2f"%(100*acc2)) Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97 Can you see that!! We have got 99.97 percent accuracy with the modified dataset, which means we are classifying 14,996 instances in correct classes, while previously we were classifying only 14,823 instances correctly. This is a huge improvement we have got with the feature selection process; we can summarize all the results in the following table: Evaluation criteria Before feature selection After feature selection Number of features 94 20 Size of dataset 26.32 MB 5.60 MB Training time 2.91 seconds 1.71 seconds Accuracy 98.82 percent 99.97 percent The preceding table shows the practical advantages of feature selection. You can see that we have reduced the number of features significantly, which reduces the model complexity and dimensions of the dataset. We are getting less training time after the reduction in dimensions, and at the end, we have overcome the overfitting issue, getting higher accuracy than before. To summarize the article, we explored 4 ways of feature selection in machine learning. If you found this post is useful, do check out the book Ensemble Machine Learning to know more about stacking generalization among other techniques.
Read more
  • 0
  • 4
  • 99736

article-image-6-popular-regression-techniques-must-know
Amey Varangaonkar
15 Feb 2018
8 min read
Save for later

6 Popular Regression Techniques you must know

Amey Varangaonkar
15 Feb 2018
8 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Statistics for Data Science, authored by IBM expert James D. Miller. This book gives a statistical view of building smart data models to help you get unique insights from the data.[/box] In this article, we introduce you to the concept of regression analysis, one of the most popular machine learning algorithms. -You will learn what is regression analysis, the different types of regression, and how to choose the right regression technique to build your data model. What is Regression Analysis? For starters, regression analysis or statistical regression is a process for estimating the relationships among variables. This process encompasses numerous techniques for modeling and analyzing variables, focusing on the relationship between a dependent variable and one (or more) independent variable (or predictors). Regression analysis is the work done to identify and understand how the (best representative) value of a dependent variable (a variable that depends on other factors) changes when any one of the independent variables (a variable that stands alone and isn't changed by the other variables) is changed while the other independent variables stay the same. A simple example might be how the total dollars spent on marketing (an independent variable example) impacts the total sales dollars (a dependent variable example) over a period of time (is it really as simple as more marketing equates to higher sales?), or perhaps there is a correlation between the total marketing dollars spent (independent variable), discounting a products price (another independent variable), and the amount of sales (a dependent variable)? [box type="info" align="" class="" width=""]Keep in mind this key point that regression analysis is used to understand which among the independent variables are related to the dependent variable(s), not just the relationship of these variables. Also, the inference of causal relationships (between the independent and dependent variables) is an important objective. However, this can lead to illusions or false relationships, so caution is recommended![/box] Overall, regression analysis can be thought of as estimating the conditional expectations of the value of the dependent variable, given the independent variables being observed, that is, endeavoring to predict the average value of the dependent variable when the independent variables are set to certain values. I call this the lever affect—meaning when one increases or decreases a value of one component, it directly affects the value at least one other (variable). An alternate objective of the process of regression analysis is the establishment of location parameters or the quantile of a distribution. In other words, this idea is to determine values that may be a cutoff, dividing a range of a probability distribution values. You'll find that regression analysis can be a great tool for prediction and forecasting (not just complex machine learning applications). We'll explore some real-world examples later, but for now, let's us look at some techniques for the process. Popular regression techniques and approaches You'll find that various techniques for carrying out regression analysis have been developed and accepted.These are: Linear Logistic Polynomial Stepwise Ridge Lasso Linear regression Linear regression is the most basic type of regression and is commonly used for predictive analysis projects. In fact, when you are working with a single predictor (variable), we call it simple linear regression, and if there are multiple predictor variables, we call it multiple linear regression. Simply put, linear regression uses linear predictor functions whose values are estimated from the data in the model. Logistic regression Logistic regression is a regression model where the dependent variable is a categorical variable. This means that the variable only has two possible values, for example, pass/fail, win/lose, alive/dead, or healthy/sick. If the dependent variable has more than two possible values, one can use various modified logistic regression techniques, such as multinomial logistic regression, ordinal logistic regression, and so on. Polynomial regression When we speak of polynomial regression, the focus of this technique is on modeling the relationship between the independent variable and the dependent variable as an nth degree polynomial. Polynomial regression is considered to be a special case of multiple linear regressions. The predictors resulting from the polynomial expansion of the baseline predictors are known as interactive features. Stepwise regression Stepwise regression is a technique that uses some kind of automated procedure to continually execute a step of logic, that is, during each step, a variable is considered for addition to or subtraction from the set of independent variables based on some prespecified criterion. Ridge regression Often predictor variables are identified as being interrelated. When this occurs, the regression coefficient of any one variable depends on which other predictor variables are included in the model and which ones are left out. Ridge regression is a technique where a small bias factor is added to the selected variables in order to improve this situation. Therefore, ridge regression is actually considered a remedial measure to alleviate multicollinearity amongst predictor variables. Lasso regression Lasso (Least Absolute Shrinkage Selector Operator) regression is a technique where both predictor variable selection and regularization are performed in order to improve the prediction accuracy and interpretability of the result it produces. Which technique should I choose? In addition to the aforementioned regression techniques, there are numerous others to consider with, most likely, more to come. With so many options, it's important to choose the technique that is right for your data and your project. Rather than selecting the right regression approach, it is more about selecting the most effective regression approach. Typically, you use the data to identify the regression approach you'll use. You start by establishing statistics or a profile for your data. With this effort, you need to identify and understand the importance of the different variables, their relationships, coefficient signs, and their effect. Overall, here's some generally good advice for choosing the right regression approach from your project: Copy what others have done and had success with. Do the research. Incorporate the results of other projects into yours. Don't reinvent the wheel. Also, even if an observed approach doesn't quite fit as it was used, perhaps some simple adjustments would make it a good choice. Keep your approach as simple as possible. Many studies show that simpler models generally produce better predictions. Start simple, and only make the model more complex as needed. The more complex you make your model, the more likely it is that you are tailoring the model to your dataset specifically, and generalizability suffers. Check your work. As you evaluate methods, check the residual plots (more on this in the next section of this chapter) because they can help you avoid inadequate models and adjust your model for better results. Use your subject matter expertise. No statistical method can understand the underlying process or subject area the way you do. Your knowledge is a crucial part and, most likely, the most reliable way of determining the best regression approach for your project. Does it fit? After selecting a model that you feel is appropriate for use with your data (also known as determining that the approach is the best fit), you need to validate your selection, that is, determine its fit. A well-fitting regression model results in predicted values close to the observed data values. The mean model (which uses the mean for every predicted value) would generally be used if there were no informative predictor variables. The fit of a proposed regression model should, therefore, be better than the fit of the mean model. As a data scientist, you will need to scrutinize the coefficients of determination, measure the standard error of estimate, analyze the significance of regression parameters and confidence intervals. [box type="info" align="" class="" width=""]Remember that the better the fit of a regression model, most likely the better the precision in, or just better, the results.[/box] Finally, it has been proven that simple models produce more accurate results! Keep this in mind always when selecting an approach or a technique, and even when the problem might be complex, it is not always obligatory to adopt a complex regression approach. Choosing the right technique, though, goes a long way in developing an accurate model. If you found this excerpt useful, make sure to check out this book Statistics for Data Science for tips on building effective data models by leveraging the power of the statistical tools and techniques.
Read more
  • 0
  • 0
  • 32430
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-how-to-use-r-to-boost-your-data-model
Amey Varangaonkar
15 Feb 2018
8 min read
Save for later

How to use R to boost your Data Model

Amey Varangaonkar
15 Feb 2018
8 min read
[box type="note" align="" class="" width=""]The following book excerpt is taken from the title Statistics for Data Science, written by James D. Miller. This book is a comprehensive primer on the basic concepts of statistics and their application in different data science tasks.[/box] In this article, we explain the implementation of boosting - a popular technique used to improve the performance of a data model - using the popular R programming language. We will take a high-level look at a thought-provoking prediction problem drawn from Mastering Predictive Analytics with R, Second Edition, by James D. Miller and Rui Miguel Forte. Here, an original example of patterns made by radiation on a telescope camera are analyzed in an attempt to predict whether a certain pattern came from gamma rays leaking into the atmosphere or from regular background radiation. Gamma rays leave distinctive elliptical patterns and so we can create a set of features to describe these. The dataset used is the MAGIC Gamma Telescope Data Set, hosted by the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope This data consists of 19,020 observations, holding the following list of attributes: Prepping the data First, various steps need to be performed on our example data. The data is first loaded into an R data frame object named magic, recoding the CLASS output variable to use classes 1 and -1 for gamma rays and background radiation respectively: > magic <- read.csv("magic04.data", header = FALSE) > names(magic) <- c("FLENGTH", "FWIDTH", "FSIZE", "FCONC", "FCONC1",  "FASYM", "FM3LONG", "FM3TRANS", "FALPHA", "FDIST", "CLASS") > magic$CLASS <- as.factor(ifelse(magic$CLASS =='g', 1, -1)) Next, the data is split into two files: a training data and a test data frame using an 80-20 split: > library(caret) > set.seed(33711209) > magic_sampling_vector <- createDataPartition(magic$CLASS,  p = 0.80, list = FALSE) > magic_train <- magic[magic_sampling_vector, 1:10] > magic_train_output <- magic[magic_sampling_vector, 11] > magic_test <- magic[-magic_sampling_vector, 1:10] > magic_test_output <- magic[-magic_sampling_vector, 11] The model used for boosting is a simple multilayer perceptron with a single hidden layer leveraging R's nnet package. Neural networks, often produce higher accuracy when inputs are normalized, so, in this example, before training any models, this preprocessing is performed: > magic_pp <- preProcess(magic_train, method = c("center",  "scale")) > magic_train_pp <- predict(magic_pp, magic_train) > magic_train_df_pp <- cbind(magic_train_pp,  CLASS = magic_train_output) > magic_test_pp <- predict(magic_pp, magic_test) Training Boosting is designed to work best with weak learners, so a very small number of hidden neurons in the model's hidden layer are used. Concretely, we will begin with the simplest possible multilayer perceptron that uses a single hidden neuron. To understand the effect of using boosting, a baseline performance is established by training a single neural network (and measuring its performance). This is to accomplish the following: > library(nnet) > n_model <- nnet(CLASS ~ ., data = magic_train_df_pp, size = 1) > n_test_predictions <- predict(n_model, magic_test_pp,  type = "class") > (n_test_accuracy <- mean(n_test_predictions ==  magic_test_output))  [1] 0.7948988 This establishes that we have a baseline accuracy of around 79.5 percent. Not too bad, but can boost to improve upon this score? To that end, the function AdaBoostNN(), which is shown as follows, is used. This function will take input from a data frame, the name of the output variable, the number of single hidden layer neural network models to be built, and finally, the number of hidden units these neural networks will have. The function will then implement the AdaBoost algorithm and return a list of models with their corresponding weights. Here is the function: AdaBoostNN <- function(training_data, output_column, M, hidden_units) { require("nnet") models <- list() alphas <- list() n <- nrow(training_data) model_formula <- as.formula(paste(output_column, '~ .', sep = '')) w <- rep((1/n), n) for (m in 1:M) { model <- nnet(model_formula, data = training_data, size = hidden_units, weights = w) models[[m]] <- model predictions <- as.numeric(predict(model, training_data[, -which(names(training_data) == output_column)], type = "class")) errors <- predictions != training_data[, output_column] error_rate <- sum(w * as.numeric(errors)) / sum(w) alpha <- 0.5 * log((1 - error_rate) / error_rate) alphas[[m]] <- alpha temp_w <- mapply(function(x, y) if (y) { x * exp(alpha) } else { x * exp(-alpha)}, w, errors) w <- temp_w / sum(temp_w) } return(list(models = models, alphas = unlist(alphas))) } The preceding function uses the following logic: First, initialize empty lists of models and model weights (alphas). Compute the number of observations in the training data, storing this in the variable n. The name of the output column provided is then used to create a formula that describes the neural network that will be built. In the dataset used, this formula will be CLASS ~ ., meaning that the neural network will compute CLASS as a function of all the other columns as input features. Next, initialize the weights vector and define a loop that will run for M iterations in order to build M models. In every iteration, the first step is to use the current setting of the weights vector to train a neural network using as many hidden units as specified in the input, hidden_units. Then, compute a vector of predictions that the model generates on the training data using the predict() function. By comparing these predictions to the output column of the training data, calculate the errors that the current model makes on the training data. This then allows the computation of the error rate. This error rate is set as the weight of the current model and, finally, the observation weights to be used in the next iteration of the loop are updated according to whether each observation was correctly classified. The weight vector is then normalized and we are ready to begin the next iteration! After completing M iterations, output a list of models and their corresponding model weights. Ready for boosting There is now a function able to train our ensemble classifier using AdaBoost, but we also need a function to make the actual predictions. This function will take in the output list produced by our training function, AdaBoostNN(), along with a test dataset. This function is AdaBoostNN.predict() and it is shown as follows: AdaBoostNN.predict <- function(ada_model, test_data) { models <- ada_model$models alphas <- ada_model$alphas prediction_matrix <- sapply(models, function (x) as.numeric(predict(x, test_data, type = "class"))) weighted_predictions <- t(apply(prediction_matrix, 1, function(x) mapply(function(y, z) y * z, x, alphas))) final_predictions <- apply(weighted_predictions, 1, function(x) sign(sum(x))) return(final_predictions) } This function first extracts the models and the model weights (from the list produced by the previous function). A matrix of predictions is created, where each column corresponds to the vector of predictions made by a particular model. Thus, there will be as many columns in this matrix as the models that we used for boosting. We then multiply the predictions produced by each model with their corresponding model weight. For example, every prediction from the first model is in the first column of the prediction matrix and will have its value multiplied by the first model weight α1. .Lastly, the matrix of weighted observations is reduced into a single vector of observations by summing the weighted predictions for each observation and taking the sign of the result. This vector of predictions is then returned by the function. As an experiment, we will train ten neural network models with a single hidden unit and see if boosting improves accuracy: > ada_model <- AdaBoostNN(magic_train_df_pp, 'CLASS', 10, 1) > predictions <- AdaBoostNN.predict(ada_model, magic_test_pp,   'CLASS') > mean(predictions == magic_test_output) [1] 0.804365 We see in this example, boosting ten models shows a marginal improvement in accuracy, but perhaps training more models might make more of a difference. What we learned From the preceding example, you may conclude that, for the neural networks with one hidden unit, as the number of boosting models increases, we see an improvement in accuracy, but after 100 models, this tapers off and is actually slightly less for 200 models. The improvement over the baseline of a single model is substantial for these networks. When we increase the complexity of our learner by having a hidden layer with three hidden neurons, we get a much smaller improvement in performance. At 200 models, both ensembles perform at a similar level, indicating that, at this point, our accuracy is being limited by the type of model trained. If you found this article useful, make sure to check out the book Statistics for Data Science for interesting statistical techniques and their implementation in R.  
Read more
  • 0
  • 0
  • 15020

article-image-choose-r-data-mining-project
Fatema Patrawala
15 Feb 2018
9 min read
Save for later

Why choose R for your data mining project

Fatema Patrawala
15 Feb 2018
9 min read
[box type="note" align="" class="" width=""]Our article is an excerpt taken from the book R Data Mining, written by Andrea Cirillo. If you are a budding data scientist or a data analyst with basic knowledge of R, and you want to get into the intricacies of data mining in a practical manner, be sure to check out this book.[/box] In today’s post, we will analyze R's strengths, and understand why it is a savvy idea to learn this programming language for data mining. R's strengths You know that R is really popular, but why? R is not the only data analysis language out there, and neither is it the oldest one; so why is it so popular? If looking at the root causes of R's popularity, we definitely have to mention these three: Open source inside Plugin ready Data visualization friendly Open source inside One of the main reasons the adoption of R is spreading is its open source nature. R binary code is available for everyone to download, modify, and share back again (only in an open source way). Technically, R is released with a GNU general public license, meaning that you can take it and use it for whatever purpose; but you have to share every derivative with a GNU general public license as well. These attributes fit well for almost every target user of a statistical analysis language: Academic user: Knowledge sharing is a must for an academic environment, and having the ability to share work without the worry of copyright and license questions makes R very practical for academic research purposes Business user: Companies are always worried about budget constraints; having professional statistical analysis software at their disposal for free sounds like a dream come true Private user: This user merges together both of the benefits already mentioned, because they will find it great to have a free instrument with which to learn and share their own statistical analyses Plugin ready You could imagine the R language as an expandable board game. You know, games like 7 Wonders or Carcassonne, with a base set of characters and places and further optional places and characters, increasing the choices at your disposal and maximizing the fun. The R language can be compared to this kind of game. There is a base version of R, containing a group of default packages that are delivered along with the standard version of the software (you can skip to the Installing R and writing R code section for more on how to obtain and install it). The functionalities available through the base version are mainly related to file system manipulation, statistical analysis, and data Visualization. While this base version is regularly maintained and updated by the R core team, virtually every R user can add further new functionalities to those available within the package, developing and sharing custom packages. This is basically how the package development and sharing flow works: The R user develops a new package, for example a package introducing a new machine learning algorithm exposed within a freshly published academic paper. The user submits the package to the CRAN repository or a similar repository. The Comprehensive R Archive Network (CRAN) is the official repository for R related documents and packages. Every R user can gain access to the additional features introduced with any given package, installing and loading them into their R environment. If the package has been submitted to CRAN, installing and loading the package will result in running just the two following lines of R code (similar commands are available for alternative repositories such as Bioconductor): install.packages("ggplot2") library(ggplot2) As you can see, this is a really convenient and effective way to expand R functionalities, and you will soon see how wide the range of functionalities added through additional packages developed by R users is. More than 9,000 packages are available on CRAN, and this number is sure to increase further, making more and more additional features available to the R community. Data visualization friendly As a discipline data visualization encompasses all of the principles and techniques employable to effectively display the information and messages contained within a set of data. Since we are living in an information-heavy age, the ability to effectively and concisely communicate articulated and complex messages through data visualization is a core asset for any professional. This is exactly why R is experiencing a great response in academic and professional fields: the data visualization capabilities of R place it at the cutting edge of these fields. R has been noticed for its amazing data visualization features right from its beginning; when some of its peers still showed x axes-built aggregating + signs, R was already able to produce astonishing 3D plots. Nevertheless, a major improvement of R as a data visualization technique came when Auckland's Hadley Wickham developed the highly famous ggplot2 package based on The Grammar of Graphics, introducing into the R world an organic framework for data visualization tasks: This package alone introduced the R community to a highly flexible way of producing and visualizing almost every kind of data visualization, having also been designed as an expandable tool, in order to add the possibility of incorporating new data visualization techniques as soon as they emerge. Finally, ggplot2 gives you the ability to highly customize your plot, adding every kind of graphical or textual annotation to it. Nowadays, R is being used by the biggest tech companies, such as Facebook and Google, and by widely circulated publications such as the Economist and the New York Times to visualize their data and convey their information to their stakeholders and readers. To sum all this up—should you invest your precious time learning R? If you are a professional or a student who could gain advantages from knowing effective and cutting-edge techniques to manipulate, model, and present data, I can only give you a positive opinion: yes. You should definitely learn R, and consider it a long-term investment, since the points of strength we have seen place it in a great position to further expand its influence in the coming years in every industry and academic field. Engaging with the community to learn R Now that we are aware of R’s popularity we need to engage with the community to take advantage of it. We will look at alternative and non-exclusive ways of engaging with the community: Employing community-driven learning material Asking for help from the community Staying ahead of language developments Employing community-driven learning material: There are two main kinds of R learning materials developed by the community: Papers, manuals, and books Online interactive courses Papers, manuals, and books: The first one is for sure the more traditional one, but you shouldn't neglect it, since those kinds of learning materials are always able to give you a more organic and systematic understanding of the topics they treat. You can find a lot of free material online in the form of papers, manuals, and books. Let me point out to you the more useful ones: Advanced R R for Data Science Introduction to Statistical Learning OpenIntro Statistics The R Journal Online interactive courses: This is probably the most common learning material nowadays. You can find different platforms delivering good content on the R language, the most famous of which are probably DataCamp, Udemy, and Packt itself. What all of them share is a practical and interactive approach that lets you learn the topic directly, applying it through exercises rather than passively looking at someone explaining theoretical stuff. Asking for help from the community: As soon as you start writing your first lines of R code, and perhaps before you even actually start writing it, you will come up with some questions related to your work. The best thing you can do when this happens is to resort to the community to solve those questions. You will probably not be the first one to come up with that question, and you should therefore first of all look online for previous answers to your question. Where should you look for answers? You can look everywhere, but most of the time you will find the answer you are looking for on one of the following (listed by the probability of finding the answer there): Stack Overflow R-help mailing list R packages documentation I wouldn't suggest you look for answers on Twitter, G+, and similar networks, since they were not conceived to handle these kinds of processes and you will expose yourself to the peril of reading answers that are out of date, or simply incorrect, because no review system is considered. If it is the case that you are asking an innovative question never previously asked by anyone, first of all, congratulations! That said, in that happy circumstance, you can ask your question in the same places that you previously looked for answers. Staying ahead of language developments: The R language landscape is constantly changing, thanks to the contributions of many enthusiastic users who take it a step further every day. How can you stay ahead of those changes? This is where social networks come in handy. Following the #rstats hashtag on Twitter, Google+ groups, and similar places, will give you the pulse of the language. Moreover, you will find the R-bloggers aggregator, which delivers a daily newsletter comprised of the R-related blog posts that were published the previous day really useful. Finally, annual R conferences and similar occasions constitute a great opportunity to get in touch with the most notorious R experts, gaining from them useful insights and inspiring speeches about the future of the language. To summarize, we looked why to choose R as your programming language for data mining and how we can engage with the R community. If you think this post is useful, you may further check out this book R Data Mining, to leverage data mining techniques across many different industries, including finance, medicine, scientific research, and more.    
Read more
  • 0
  • 0
  • 6057

article-image-make-efficient-data-driven-decisions
Aaron Lazar
15 Feb 2018
7 min read
Save for later

How to make efficient data-driven decisions

Aaron Lazar
15 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an extract from the book Predictive Analytics with TensorFlow, authored by Md. Rezaul Karim. The book will help you build, tune, and deploy predictive data models with TensorFlow.[/box] Today we’ll learn to take decisions driven by data with the help of few examples. The growing demand for data is a key challenge. Decision support teams such as institutional research and business intelligence often cannot take the right decisions on how to expand their business and research outcomes from a huge collection of data. Although data plays an important role in driving the decision, however, in reality, taking the right decision at right time is the goal. In other words, the goal is the decision support, not the data support. This can be achieved through an advanced use of data management and analytics. Data value chain for making decisions The following diagram in figure 1 (source: H. Gilbert Miller and Peter Mork, From Data to Decisions: A Value Chain for Big Data, Proc. Of IT Professional, Volume: 15, Issue: 1, Jan.-Feb. 2013, DOI: 10.1109/MITP.2013.11) shows the data chain towards taking actual decisions–that is, the goal. The value chains start through the data discovery stage consisting of several steps such as data collection and annotating data preparation, and then organizing them in a logical order having the desired flow. Then comes the data integration for establishing a common data representation of the data. Since the target is to take the right decision, for future reference having the appropriate provenance of the data–that is, where it comes from, is important: Well, now your data is somehow integrated into a presentable format, it's time for the data exploration stage, which consists of several steps such as analyzing the integrated data and visualization before taking the actions to take on the basis of the interpreted results. However, is this enough before taking the right decision? Probably not! The reason is that it lacks enough analytics, which eventually helps to take the decision with an actionable insight. Predictive analytics comes in here to fill the gap between. Now let's see an example of how in the following section. From disaster to decision – Titanic survival example Here is the challenge, Titanic–Machine Learning from Disaster from Kaggle (https://www.kaggle.com/c/titanic): "The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy" But going into this deeper, we need to know about the data of passengers travelling in the Titanic during the disaster so that we can develop a predictive model that can be used for survival analysis. The dataset can be downloaded from the preceding URL. Table 1 here shows the metadata about the Titanic survival dataset: A snapshot of the dataset can be seen as follows: The ultimate target of using this dataset is to predict what kind of people survived the Titanic disaster. However, a bit of exploratory analysis of the dataset is a mandate. At first, we need to import necessary packages and libraries: import pandas as pd import matplotlib.pyplot as plt import numpy as np Now read the dataset and create a panda's DataFrame: df = pd.read_csv('/home/asif/titanic_data.csv') Before drawing the distribution of the dataset, let's specify the parameters for the graph: fig = plt.figure(figsize=(18,6), dpi=1600) alpha=alpha_scatterplot = 0.2 alpha_bar_chart = 0.55 fig = plt.figure() ax = fig.add_subplot(111) Draw a bar diagram for showing who survived versus who did not: ax1 = plt.subplot2grid((2,3),(0,0)) ax1.set_xlim(-1, 2) df.Survived.value_counts().plot(kind='bar', alpha=alpha_bar_chart) plt.title("Survival distribution: 1 = survived") Plot a graph showing survival by Age: plt.subplot2grid((2,3),(0,1)) plt.scatter(df.Survived, df.Age, alpha=alpha_scatterplot) plt.ylabel("Age") plt.grid(b=True, which='major', axis='y') plt.title("Survival by Age: 1 = survived") Plot a graph showing distribution of the passengers classes: ax3 = plt.subplot2grid((2,3),(0,2)) df.Pclass.value_counts().plot(kind="barh", alpha=alpha_bar_chart) ax3.set_ylim(-1, len(df.Pclass.value_counts())) plt.title("Class dist. of the passengers") Plot a kernel density estimate of the subset of the 1st class passengers' age: plt.subplot2grid((2,3),(1,0), colspan=2) df.Age[df.Pclass == 1].plot(kind='kde') df.Age[df.Pclass == 2].plot(kind='kde') df.Age[df.Pclass == 3].plot(kind='kde') plt.xlabel("Age") plt.title("Age dist. within class") plt.legend(('1st Class', '2nd Class','3rd Class'),loc='best') Plot a graph showing passengers per boarding location: ax5 = plt.subplot2grid((2,3),(1,2)) df.Embarked.value_counts().plot(kind='bar', alpha=alpha_bar_chart) ax5.set_xlim(-1, len(df.Embarked.value_counts())) plt.title("Passengers per boarding location") Finally, we show all the subplots together: plt.show() >>> The figure shows the survival distribution, survival by age, age distribution, and the passengers per boarding location: However, to execute the preceding code, you need to install several packages such as matplotlib, pandas, and scipy. They are listed as follows: Installing pandas: Pandas is a Python package for data manipulation. It can be installed as follows: $ sudo pip3 install pandas #For Python 2.7, use the following: $ sudo pip install pandas Installing matplotlib: In the preceding code, matplotlib is a plotting library for mathematical objects. It can be installed as follows: $ sudo apt-get install python-matplotlib # for Python 2.7 $ sudo apt-get install python3-matplotlib # for Python 3.x Installing scipy: Scipy is a Python package for scientific computing. Installing blas and lapack and gfortran are a prerequisite for this one. Now just execute the following command on your terminal: $ sudo apt-get install libblas-dev liblapack-dev $ sudo apt-get install gfortran $ sudo pip3 install scipy # for Python 3.x $ sudo pip install scipy # for Python 2.7 For Mac, use the following command to install the above modules: $ sudo easy_install pip $ sudo pip install matplotlib $ sudo pip install libblas-dev liblapack-dev $ sudo pip install gfortran $ sudo pip install scipy For windows, I am assuming that Python 2.7 is already installed at C:Python27. Then open the command prompt and type the following command: C:Usersadmin-karim>cd C:/Python27 C:Python27> python -m pip install <package_name> # provide package name accordingly. For Python3, issue the following commands: C:Usersadmin-karim>cd C:Usersadmin-karimAppDataLocalPrograms PythonPython35Scripts C:Usersadmin-karimAppDataLocalProgramsPythonPython35 Scripts>python3 -m pip install <package_name> Well, we have seen the data. Now it's your turn to do some analytics on top of the data. Say predicting what kinds of people survived from that disaster. Don't you agree that we have enough information about the passengers, but how could we do the predictive modeling so that we can draw some fairly straightforward conclusions from this data? For example, say being a woman, being in 1st class, and being a child were all factors that could boost passenger chances of survival during this disaster. In a brute-force approach–for example, using if/else statements with some sort of weighted scoring system, you could write a program to predict whether a given passenger would survive the disaster. However, does writing such a program in Python make much sense? Naturally, it would be very tedious to write, difficult to generalize, and would require extensive fine tuning for each variable and samples (that is, passenger). This is where predictive analytics with machine learning algorithms and emerging tools comes in so that you could build a program that learns from sample data to predict whether a given passenger would survive. If you found this post useful and would like to explore more, head over to grab the book, Predictive Analytics with TensorFlow written by Md. Rezaul Karim.    
Read more
  • 0
  • 0
  • 2129

article-image-build-generative-chatbot-using-recurrent-neural-networks-lstm-rnns
Savia Lobo
15 Feb 2018
8 min read
Save for later

Build a generative chatbot using recurrent neural networks (LSTM RNNs)

Savia Lobo
15 Feb 2018
8 min read
In today’s tutorial we will learn to build generative chatbot using recurrent neural networks. The RNN used here is Long Short Term Memory(LSTM). Generative chatbots are very difficult to build and operate. Even today, most workable chatbots are retrieving in nature; they retrieve the best response for the given question based on semantic similarity, intent, and so on. For further reading, refer to the paper Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation by Kyunghyun Cho et. al. (https://arxiv.org/pdf/1406.1078.pdf). [box type="note" align="" class="" width=""]This article is an excerpt from a book written by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled Natural Language Processing with Python Cookbook. In this book you will come across various recipes covering natural language understanding, Natural Language Processing, and syntactic analysis.[/box] Getting ready... The A.L.I.C.E Artificial Intelligence Foundation dataset bot.aiml Artificial Intelligence Markup Language (AIML), which is customized syntax such as XML file has been used to train the model. In this file, questions and answers are mapped. For each question, there is a particular answer. Complete .aiml files are available at aiml-en-us-foundation-alice.v1-9 from https://code.google.com/archive/p/aiml-en-us-foundation-alice/downloads. Unzip the folder to see the bot.aiml file and open it using Notepad. Save as bot.txt to read in Python: >>> import os """ First change the following directory link to where all input files do exist """ >>> os.chdir("C:UsersprataDocumentsbook_codesNLP_DL") >>> import numpy as np >>> import pandas as pd # File reading >>> with open('bot.txt', 'r') as content_file: ... botdata = content_file.read() >>> Questions = [] >>> Answers = [] AIML files have unique syntax, similar to XML. The pattern word is used to represent the question and the template word for the answer. Hence, we are extracting respectively: >>> for line in botdata.split("</pattern>"): ... if "<pattern>" in line: ... Quesn = line[line.find("<pattern>")+len("<pattern>"):] ... Questions.append(Quesn.lower()) >>> for line in botdata.split("</template>"): ... if "<template>" in line: ... Ans = line[line.find("<template>")+len("<template>"):] ... Ans = Ans.lower() ... Answers.append(Ans.lower()) >>> QnAdata = pd.DataFrame(np.column_stack([Questions,Answers]),columns = ["Questions","Answers"]) >>> QnAdata["QnAcomb"] = QnAdata["Questions"]+" "+QnAdata["Answers"] >>> print(QnAdata.head()) The question and answers are joined to extract the total vocabulary used in the modeling, as we need to convert all words/characters into numeric representation. The reason is the same as mentioned before—deep learning models can't read English and everything is in numbers for the model. How to do it... After extracting the question-and-answer pairs, the following steps are needed to process the data and produce the results: Preprocessing: Convert the question-and-answer pairs into vectorized format, which will be utilized in model training. Model building and validation: Develop deep learning models and validate the data. Prediction of answers from trained model: The trained model will be used to predict answers for given questions. How it works... The question and answers are utilized to create the vocabulary of words to index mapping, which will be utilized for converting words into vector mappings: # Creating Vocabulary >>> import nltk >>> import collections >>> counter = collections.Counter() >>> for i in range(len(QnAdata)): ... for word in nltk.word_tokenize(QnAdata.iloc[i][2]): ... counter[word]+=1 >>> word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())} >>> idx2word = {v:k for k,v in word2idx.items()} >>> idx2word[0] = "PAD" >>> vocab_size = len(word2idx)+1 >>> print (vocab_size) Encoding and decoding functions are used to convert text to indices and indices to text respectively. As we know, Deep learning models work on numeric values rather than text or character data: >>> def encode(sentence, maxlen,vocab_size): ... indices = np.zeros((maxlen, vocab_size)) ... for i, w in enumerate(nltk.word_tokenize(sentence)): ... if i == maxlen: break ... indices[i, word2idx[w]] = 1 ... return indices >>> def decode(indices, calc_argmax=True): ... if calc_argmax: ... indices = np.argmax(indices, axis=-1) ... return ' '.join(idx2word[x] for x in indices) The following code is used to vectorize the question and answers with the given maximum length for both questions and answers. Both might be different lengths. In some pieces of data, the question length is greater than answer length, and in a few cases, it's length is less than answer length. Ideally, the question length is good to catch the right answers. Unfortunately in this case, question length is much less than the answer length, which is a very bad example to develop generative models: >>> question_maxlen = 10 >>> answer_maxlen = 20 >>> def create_questions(question_maxlen,vocab_size): ... question_idx = np.zeros(shape=(len(Questions),question_maxlen, vocab_size)) ... for q in range(len(Questions)): ... question = encode(Questions[q],question_maxlen,vocab_size) ... question_idx[q] = question ... return question_idx >>> quesns_train = create_questions(question_maxlen=question_maxlen, vocab_size=vocab_size) >>> def create_answers(answer_maxlen,vocab_size): ... answer_idx = np.zeros(shape=(len(Answers),answer_maxlen, vocab_size)) ... for q in range(len(Answers)): ... answer = encode(Answers[q],answer_maxlen,vocab_size) ... answer_idx[q] = answer ... return answer_idx >>> answs_train = create_answers(answer_maxlen=answer_maxlen,vocab_size= vocab_size) >>> from keras.layers import Input,Dense,Dropout,Activation >>> from keras.models import Model >>> from keras.layers.recurrent import LSTM >>> from keras.layers.wrappers import Bidirectional >>> from keras.layers import RepeatVector, TimeDistributed, ActivityRegularization The following code is an important part of the chatbot. Here we have used recurrent networks, repeat vector, and time-distributed networks. The repeat vector used to match dimensions of input to output values. Whereas time-distributed networks are used to change the column vector to the output dimension's vocabulary size: >>> n_hidden = 128 >>> question_layer = Input(shape=(question_maxlen,vocab_size)) >>> encoder_rnn = LSTM(n_hidden,dropout=0.2,recurrent_dropout=0.2) (question_layer) >>> repeat_encode = RepeatVector(answer_maxlen)(encoder_rnn) >>> dense_layer = TimeDistributed(Dense(vocab_size))(repeat_encode) >>> regularized_layer = ActivityRegularization(l2=1)(dense_layer) >>> softmax_layer = Activation('softmax')(regularized_layer) >>> model = Model([question_layer],[softmax_layer]) >>> model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) >>> print (model.summary()) The following model summary describes the change in flow of model size across the model. The input layer matches the question's dimension and the output matches the answer's dimension: # Model Training >>> quesns_train_2 = quesns_train.astype('float32') >>> answs_train_2 = answs_train.astype('float32') >>> model.fit(quesns_train_2, answs_train_2,batch_size=32,epochs=30, validation_split=0.05) The results are a bit tricky in the following screenshot even though the accuracy is significantly higher. The chatbot model might produce complete nonsense, as most of the words are padding here. The reason? The number of words in this data is less: # Model prediction >>> ans_pred = model.predict(quesns_train_2[0:3]) >>> print (decode(ans_pred[0])) >>> print (decode(ans_pred[1])) The following screenshot depicts the sample output on test data. The output does not seem to make sense, which is an issue with generative models: Our model did not work well in this case, but still some areas of improvement are possible going forward with generative chatbot models. Readers can give it a try: Have a dataset with lengthy questions and answers to catch signals well Create a larger architecture of deep learning models and train over longer iterations Make question-and-answer pairs more generic rather than factoid-based, such as retrieving knowledge and so on, where generative models fail miserably. Here, you saw how to build chatbots using LSTM. You can go ahead and try building one of your own generative chatbots using the example above. If you found this post useful, do check out this book Natural Language Processing with Python Cookbook to efficiently use NLTK and implement text classification, identify parts of speech, tag words, and more.  
Read more
  • 0
  • 4
  • 58183
article-image-hands-on-table-calculation-techniques-with-tableau
Amarabha Banerjee
14 Feb 2018
4 min read
Save for later

Hands on Table Calculation Techniques with Tableau

Amarabha Banerjee
14 Feb 2018
4 min read
[box type="note" align="" class="" width=""]This article is a book excerpt from the title Mastering Tableau written by David Baldwin. This book will help you master the intricacies of Tableau to create effective data visualizations.[/box] Today, we shall explore Table Calculation Techniques with Tableau and explore a real world example of using these techniques. In this article, we provide a simple schema for understanding table calculations. This schema is communicated via two questions: What is the function? How is the function applied? These two questions are inexorably connected. You cannot reliably apply something until you know what it is. And you cannot get useful results from something until you correctly apply it. The sections below will help you to get a head start into Table calculation techniques and how to use Tableau functions effectively for implementing Table calculation. Basics of Table Calculation Calculated fields can be categorized as: row level, aggregate level, and table level. For row- and aggregate-level calculations, the underlying data source engine does most (if not all) of the computational work, and Tableau merely visualizes the results. For table calculations, Tableau also relies on the underlying data source engine to execute computational tasks; however, after that work is completed and a dataset is returned, Tableau performs additional processing before rendering the results. This can be seen within the following process flow diagram, in the circled part titled Tableau performs additional processing. This is where table calculations are processed. We will continue with a definition: A table calculation is a function performed on a dataset in cache that has been generated as a result of a query from Tableau to the data source. Let's consider a couple of points regarding the dataset in cache mentioned in the preceding definition. First, it is important to understand that this cache is not simply the returned results of a query. Tableau may adjust the returned results. For example,, Tableau may expand the cache via data densification. Secondly, it's important to consider how the cache is structured. Basically, the dataset in cache is a table and, like all tables, is made up of rows and columns. This is particularly important for table calculations since a table calculation may be computed as it moves along the cache. Such a table calculation is directional. Alternatively, a table calculation may be computed based on the entire cache with no directional consideration. Table calculations such as this are non-directional. Directional and nondirectional table calculations will be considered more fully in the following section. Directional and non-directional table calculation functions As of Tableau 10, there are 32 table calculation functions in Tableau. However, many of these are simply variations of a theme; for example, there are five Running functions, including RUNNING_SUM and RUNNING_AVG. If we narrow our consideration to unique groups of table calculations functions, we will discover that there are only 11. The following table shows these 11 functions organized in two categories: As mentioned previously, non-directional table calculation functions operate on the entire cache and thus are not computed based on movement through the cache. For example, the SIZE function doesn't change based on the value of a previous row in the cache. On the other hand, RUNNING_SUM does change based on previous rows in the cache and is therefore considered directional. Practical Example In this example, we'll see directional and non-directional table calculation functions in action: Navigate to h t t p s ://p u b l i c . t a b l e a u . c o m /p r o f i l e /d a v i d 1. . b a l d w i n #!/ to locate and download the workbook associated with this chapter. Navigate to the worksheet entitled Directional/Non-Directional. Create the following calculated fields: Place Category and Ship Mode on the Rows shelf. Double-click on Sales, Lookup, Size, Window Sum, Window Sum w/Start&End, and Running Sum to populate the view. Compare the following screenshot with the notes in step 3 of this exercise for better understanding: We discussed techniques for implementing table calculations with Tableau. If you liked our post, be sure to check out Mastering Tableau which consists of other useful data visualization and data analysis techniques.    
Read more
  • 0
  • 0
  • 16059

article-image-sentiment-analysis-of-the-2017-us-elections-on-twitter
Vijin Boricha
14 Feb 2018
12 min read
Save for later

Sentiment Analysis of the 2017 US elections on Twitter

Vijin Boricha
14 Feb 2018
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Andrew Morgan and Antoine Amend titled Mastering Spark for Data Science. In this book, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly.[/box] In this article, we will learn to extract and analyse large number of tweets related to the 2017 US elections on Twitter. Following the US elections on Twitter On November 8, 2016, American citizens went in millions to polling stations to cast their votes for the next President of the United States. Counting began almost immediately and, although not officially confirmed until sometime later, the forecasted result was well known by the next morning. Let's start our investigation a couple of days before the major event itself, on November 6, 2016, so that we can preserve some context in the run-up. Although we do not exactly know what we will find in advance, we know that Twitter will play an oversized role in the political commentary given its influence in the build-up, and it makes sense to start collecting data as soon as possible. In fact, data scientists may sometimes experience this as a gut feeling - a strange and often exciting notion that compels us to commence working on something without a clear plan or absolute justification, just a sense that it will pay off. And actually, this approach can be vital since, given the normal time required to formulate and realize such a plan and the transient nature of events, a major news event may occur , a new product may have been released, or the stock market may be trending differently; by this time, the original dataset may no longer be available. Acquiring data in stream The first action is to start acquiring Twitter data. As we plan to download more than 48 hours worth of tweets, the code should be robust enough to not fail somewhere in the middle of the process; there is nothing more frustrating than a fatal NullPointerException occurring after many hours of intense processing. We know we will be working on sentiment analysis at some point down the line, but for now we do not wish to over-complicate our code with large dependencies as this can decrease stability and lead to more unchecked exceptions. Instead, we will start by collecting and storing the data and subsequent processing will be done offline on the collected data, rather than applying this logic to the live stream. We create a new Streaming context reading from Twitter 1% firehose using the utility methods. We also use the excellent GSON library to serialize Java class Status (Java class embedding Twitter4J records) to JSON objects. <dependency> <groupId>com.google.code.gson</groupId> <artifactId>gson</artifactId> <version>2.3.1</version> </dependency> We read Twitter data every 5 minutes and have a choice to optionally supply Twitter filters as command line arguments. Filters can be keywords such as Trump, Clinton or #MAGA, #StrongerTogether. However, we must bear in mind that by doing this we may not capture all relevant tweets as we can never be fully up to date with the latest hashtag trends (such as #DumpTrump, #DrainTheSwamp, #LockHerUp, or #LoveTrumpsHate) and many tweets will be overlooked with an inadequate filter, so we will use an empty filter list to ensure that we catch everything. val sparkConf = new SparkConf().setAppName("Twitter Extractor") val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Minutes(5)) val filter = args val twitterStream = createTwitterStream(ssc, filter) .mapPartitions { it => val gson = new GsonBuilder().create() it.map { s: Status => Try(gson.toJson(s)).toOption } } We serialize our Status class using the GSON library and persist our JSON objects in HDFS. Note that the serialization occurs within a Try clause to ensure that unwanted exceptions are not thrown. Instead, we return JSON as an optional String: twitterStream .filter(_.isSuccess) .map(_.get) .saveAsTextFiles("/path/to/twitter") Finally, we run our Spark Streaming context and keep it alive until a new president has been elected, no matter what happens! ssc.start() ssc.awaitTermination() Acquiring data in batch Only 1% of tweets are retrieved through the Spark Streaming API, meaning that 99% of records will be discarded. Although able to download around 10 million tweets, we can potentially download more data, but this time only for a selected hashtag and within a small period of time. For example, we can download all tweets related to the #LockHerUp or #BuildTheWall hashtags. The search API For that purpose, we consume Twitter historical data through the twitter4j Java API. This library comes as a transitive dependency of spark-streaming-twitter_2.11. To use it outside of a Spark project, the following maven dependency should be used: <dependency> <groupId>org.twitter4j</groupId> <artifactId>twitter4j-core</artifactId> <version>4.0.4</version> </dependency> We create a Twitter4J client as follows: ConfigurationBuilder builder = new ConfigurationBuilder(); builder.setOAuthConsumerKey(apiKey); builder.setOAuthConsumerSecret(apiSecret); Configuration configuration = builder.build(); AccessToken token = new AccessToken( accessToken, accessTokenSecret ); Twitter twitter = new TwitterFactory(configuration) .getInstance(token); Then, we consume the /search/tweets service through the Query object: Query q = new Query(filter); q.setSince(fromDate); q.setUntil(toDate); q.setCount(400); QueryResult r = twitter.search(q); List<Status> tweets = r.getTweets(); Finally, we get a list of Status objects that can easily be serialized using the GSON library introduced earlier. Rate limit Twitter is a fantastic resource for data science, but it is far from a non-profit organization, and as such, they know how to value and price data. Without any special agreement, the search API is limited to a few days retrospective, a maximum of 180 queries per 15 minute window and 450 records per query. This limit can be confirmed on both the Twitter DEV website (https://dev.twitter.com/rest/public/rate-limits) and from the API itself using the RateLimitStatus class: Map<String, RateLimitStatus> rls = twitter.getRateLimitStatus("search"); System.out.println(rls.get("/search/tweets")); /* RateLimitStatusJSONImpl{remaining=179, limit=180, resetTimeInSeconds=1482102697, secondsUntilReset=873} */ Unsurprisingly, any queries on popular terms, such as #MAGA on November 9, 2016, hit this threshold. To avoid a rate limit exception, we have to page and throttle our download requests by keeping track of the maximum number of tweet IDs processed and monitor our status limit after each search request. RateLimitStatus strl = rls.get("/search/tweets"); int totalTweets = 0; long maxID = -1; for (int i = 0; i < 400; i++) { // throttling if (strl.getRemaining() == 0) Thread.sleep(strl.getSecondsUntilReset() * 1000L); Query q = new Query(filter); q.setSince(fromDate); q.setUntil(toDate); q.setCount(100); // paging if (maxID != -1) q.setMaxId(maxID - 1); QueryResult r = twitter.search(q); for (Status s: r.getTweets()) { totalTweets++; if (maxID == -1 || s.getId() < maxID) maxID = s.getId(); writer.println(gson.toJson(s)); } strl = r.getRateLimitStatus(); } With around half a billion tweets a day, it will be optimistic, if not Naive, to gather all USrelated data. Instead, the simple ingest process detailed earlier should be used to intercept tweets matching specific queries only. Packaged as main class in an assembly jar, it can be executed as follows: java -Dtwitter.properties=twitter.properties / -jar trump-1.0.jar #maga 2016-11-08 2016-11-09 / /path/to/twitter-maga.json Here, the twitter.properties file contains your Twitter API keys: twitter.token = XXXXXXXXXXXXXX twitter.token.secret = XXXXXXXXXXXXXX twitter.api.key = XXXXXXXXXXXXXX twitter.api.secret = XXXXXXXXXXXXXX Analysing sentiment After 4 days of intense processing, we extracted around 10 million tweets;representing approximately 30 GB worth of JSON data. Massaging Twitter data One of the key reasons Twitter became so popular is that any message has to fit into a maximum of 140 characters. The drawback is also that every message has to fit into a maximum of 140 characters! Hence, the result is massive increase in the use of abbreviations, acronyms, slang words, emoticons, and hashtags. In this case, the main emotion may no longer come from the text itself, but rather from the emoticons used (http://dl.acm.org/citation.cfm?id=1628969), though some studies showed that the emoticons may sometimes lead to inadequate predictions in sentiments (https://arxiv.org/pdf/1511.02556.pdf). Emojis are even broader than emoticons as they include pictures of animals, transportation, business icons, and so on. Also, while emoticons can easily be retrieved through simple regular expressions, emojis are usually encoded in Unicode and are more difficult to extract without a dedicated library. <dependency> <groupId>com.kcthota</groupId> <artifactId>emoji4j</artifactId> <version>5.0</version> </dependency> The Emoji4J library is easy to use (although computationally expensive) and given some text with emojis/emoticons, we can either codify – replace Unicode values with actual code names – or clean – simply remove any emojis. So firstly, let's clean our text from any junk (special characters, emojis, accents, URLs, and so on) to access plain English content: import emoji4j.EmojiUtils def clean = { var text = tweet.toLowerCase() text = text.replaceAll("https?://S+", "") text = StringUtils.stripAccents(text) EmojiUtils.removeAllEmojis(text) .trim .toLowerCase() .replaceAll("rts+", "") .replaceAll("@[wd-_]+", "") .replaceAll("[^w#[]:'.!?,]+", " ") .replaceAll("s+([:'.!?,])1", "$1") .replaceAll("[st]+", " ") .replaceAll("[rn]+", ". ") .replaceAll("(w)1{2,}", "$1$1") // avoid looooool .replaceAll("#W", "") .replaceAll("[#':,;.]$", "") .trim } Let's also codify and extract all emojis and emoticons and keep them aside as a list: val eR = "(:w+:)".r def emojis = { var text = tweet.toLowerCase() text = text.replaceAll("https?://S+", "") eR.findAllMatchIn(EmojiUtils.shortCodify(text)) .map(_.group(1)) .filter { emoji => EmojiUtils.isEmoji(emoji) }.map(_.replaceAll("W", "")) .toArray } Writing these methods inside an implicit class means that they can be applied directly a String through a simple import statement. Using the Stanford NLP Our next step is to pass our cleaned text through a Sentiment Annotator. We use the Stanford NLP library for that purpose: <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.5.0</version> <classifier>models</classifier> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.5.0</version> </dependency> We create a Stanford annotator that tokenizes content into sentences (tokenize), splits sentences (ssplit), tags elements (pos), and lemmatizes each word (lemma) before analyzing the overall sentiment: def getAnnotator: StanfordCoreNLP = { val p = new Properties() p.setProperty( "annotators", "tokenize, ssplit, pos, lemma, parse, sentiment" ) new StanfordCoreNLP(pipelineProps) } def lemmatize(text: String, annotator: StanfordCoreNLP = getAnnotator) = { val annotation = annotator.process(text.clean) val sentences = annotation.get(classOf[SentencesAnnotation]) sentences.flatMap { sentence => sentence.get(classOf[TokensAnnotation]) .map { token => token.get(classOf[LemmaAnnotation]) } .mkString(" ") } val text = "If you're bashing Trump and his voters and calling them a variety of hateful names, aren't you doing exactly what you accuse them?" println(lemmatize(text)) /* if you be bash trump and he voter and call they a variety of hateful name, be not you do exactly what you accuse they */ Any word is replaced by its most basic form, that is, you're is replaced with you be and aren't you doing replaced with be not you do. def sentiment(coreMap: CoreMap) = { coreMap.get(classOf[SentimentCoreAnnotations.ClassName].match { case "Very negative" => 0 case "Negative" => 1 case "Neutral" => 2 case "Positive" => 3 case "Very positive" => 4 case _ => throw new IllegalArgumentException( s"Could not get sentiment for [${coreMap.toString}]" ) } } def extractSentiment(text: String, annotator: StanfordCoreNLP = getSentimentAnnotator) = { val annotation = annotator.process(text) val sentences = annotation.get(classOf[SentencesAnnotation]) val totalScore = sentences map sentiment if (sentences.nonEmpty) { totalScore.sum / sentences.size() } else { 2.0f } } extractSentiment("God bless America. Thank you Donald Trump!") // 2.5 extractSentiment("This is the most horrible day ever") // 1.0 A sentiment spans from Very Negative (0.0) to Very Positive (4.0) and is averaged per sentence. As we do not get more than 1 or 2 sentences per tweet, we expect a very small variance; most of the tweets should be Neutral (around 2.0), with only extremes to be scored (below ~1.5 or above ~2.5). Building the Pipeline For each of our Twitter records (stored as JSON objects), we do the following things: Parse the JSON object using json4s library Extract the date Extract the text Extract the location and map it to a US state Clean the text Extract emojis Lemmatize text Analyze sentiment We then wrap all these values into the following Tweet case class: case class Tweet( date: Long, body: String, sentiment: Float, state: Option[String], geoHash: Option[String], emojis: Array[String] ) As mentioned in previous chapters, creating a new NLP instance wouldn't scale for each record out of our dataset of 10 million records. Instead, we create only one annotator per Iterator (which means one per partition): val analyzeJson = (it: Iterator[String]) => { implicit val format = DefaultFormats val annotator = getAnnotator val sdf = new SimpleDateFormat("MMM d, yyyy hh:mm:ss a") it.map { tweet => val json = parse(tweet) val dateStr = (json "createdAt").extract[String] ) As mentioned in previous chapters, creating a new NLP instance wouldn't scale for each record out of our dataset of 10 million records. Instead, we create only one annotator per Iterator (which means one per partition): val analyzeJson = (it: Iterator[String]) => { implicit val format = DefaultFormats val annotator = getAnnotator val sdf = new SimpleDateFormat("MMM d, yyyy hh:mm:ss a") it.map { tweet => val json = parse(tweet) val dateStr = (json "createdAt").extract[String] val date = Try( sdf.parse(dateStr).getTime ) .getOrElse(0L) val text = (json "text").extract[String] val location = Try( (json "user" "location").extract[String] ) .getOrElse("") .toLowerCase() val state = Try { location.split("s") .map(_.toUpperCase()) .filter { s => states.contains(s) } .head } .toOption val cleaned = text.clean Tweet( date, cleaned.lemmatize(annotator), cleaned.sentiment(annotator), state, text.emojis ) } } val tweetJsonRDD = sc.textFile("/path/to/twitter") val tweetRDD = twitterJsonRDD mapPartitions analyzeJson tweetRDD.toDF().show(5) /* +-------------+---------------+---------+--------+----------+ | date| body|sentiment| state| emojis| +-------------+---------------+---------+--------+----------+ |1478557859000|happy halloween| 2.0| None [ghost] | |1478557860000|slave to the gr| 2.5| None|[] | |1478557862000|why be he so pe| 3.0|Some(MD)|[] | |1478557862000|marcador sentim| 2.0| None|[] | |1478557868000|you mindset tow| 2.0| None|[sparkles]| +-------------+---------------+---------+--------+----------+ */ We learnt about gathering sentimental data from twitter, you can further get to know about Data Acquisition and Scrapping link-based external data from Mastering Spark for Data Science.
Read more
  • 0
  • 0
  • 12672

article-image-deploy-rethinkdb-using-docker
Vijin Boricha
14 Feb 2018
7 min read
Save for later

How to deploy RethinkDB using Docker

Vijin Boricha
14 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will help you develop efficient and real-time applications in RethinkDB with ease.[/box] In today’s tutorial, we will learn to install Docker, create an Docker image and deploy RethinkDB using Docker. Your code is not working in Production? But it's working on the QA (quality analysis server)! I am sure you have heard statements like these in your team during the deployment phase. Well no more of that, Docker everything and forget about the infrastructure of different environments, say, QA, Staging and Production, because your code is going to run Docker container not in those machines, hence write once, run everywhere. In this section, we will learn how to use Docker to deploy a RethinkDB Server or PaaS services. I am going to cover a few docker basics too; if you are already aware of them, please skip to the next section. Installing Docker Docker is available for all major platforms, such as, Linux-based distributions, Mac, and Windows. Visit the official website at h t t p s ://w w w . d o c k e r . c o m / and download the package suitable for your platform. We are installing Docker in our machine to create a new Docker image. Docker images are independent of platform and should not be confused with Docker for Mac or Docker for Windows. It's referred to as a Docker client too. Once you have installed the Docker, you need to start the Daemon process first; I am using a Mac so I can view this in the launchpad, as shown here: Upon clicking that, it will open up a nice console showing the Docker official logo and an indication that Docker is successfully booted, as shown in the following screenshot: Now we can begin creating our Docker image that in turn will run RethinkDB. Creating a Docker image For installing our RethinkDB on Ubuntu inside the Docker, we need to install the Ubuntu operating system. Run the following command to install a Ubuntu image from the official Docker hub repository: docker pull ubuntu This will download and install the Ubuntu image in our system. We will later use this Ubuntu image and install our RethinkDB instance; you can choose different operating systems as well. Before going to the Docker configuration code, I would like to point out the steps we require to install RethinkDB on a fresh Ubuntu installation: Update the system Add the RethinkDB repository to the known repository list Install RethinkDB Set the data folder Expose the port We are going to do this using Docker. To create a Docker image, we require Dockerfile. Create a file called Dockerfile with no extension and apply the code shown here: FROM ubuntu:latest # Install RethinkDB. RUN apt-get update && echo "deb http://download.rethinkdb.com/apt `lsb_release -cs` main" > /etc/apt/sources.list.d/rethinkdb.list && apt-get install -y wget && wget -O- http://download.rethinkdb.com/apt/pubkey.gpg | apt-key add - && apt-get update && apt-get install -y rethinkdb python-pip && rm -rf /var/lib/apt/lists/* # Install python driver for rethinkdb RUN pip install rethinkdb # Define mountable directories. VOLUME ["/data"] # Define working directory. WORKDIR /data # Define default command. CMD ["rethinkdb", "--bind", "all"] # Expose ports. # - 8080: web UI # - 28015: process # - 29015: cluster EXPOSE 8080 EXPOSE 28015 EXPOSE 29015 The first line is our entry point to the Ubuntu operating system, then we are performing an update of the system and using the installation commands recommended by RethinkDB here: h t t p s ://w w w . r e t h i n k d b . c o m /d o c s /i n s t a l l /u b u n t u /. Once the installation is complete, we install the rethinkdb python driver to perform the import/export operation. The next two commands mount a new volume in Ubuntu and telling RethinkDB to use that volume. The next command runs rethinkdb by binding all the ports and exposing the ports to be used by the client driver and web console. In order to make this a docker image, save the file and run the following command within the project directory: docker build -t docker-rethinkdb. Here, we are building our docker image and giving it a name docker-rethinkdb; upon running this command, Docker will execute the Dockerfile and you're on. The representation of the previous steps is shown here: Once everything works, and I am sure it will, you will see a success message in the console, as shown here: Congratulations! You have successfully created a docker image for RethinkDB. If you want to see your image and its properties, run the following command: docker images And this will list all the images of Docker, as shown in the following screenshot: Awesome! Now let's run it. To access the web portal, we need to run our docker image and bind port 8080 of the docker image to some port of our machine; here is the command to do so: docker run -p 3000:8080 -d docker-rethinkdb As per the command above, -p is used to specify port binding, the first is the target and second port is source, that is, Docker port and -d is used to run it in the background or Daemon. This will run the docker image in the background; to extract more information about this process, we need to run the following command: docker ps This will list all the running images called as a container, along with the information, as shown in the following screenshot: You can also check the logs of specific containers using the following command: docker logs <container id> Now, in order to access the RethinkDB web console from our machine, we need to find out the IP address on which the Docker machine is running. To get that, we need to run the following command: docker-machine ip default This will print out the IP. Copy the IP and hit IP:3000 from the browser to view the RethinkDB web console, as shown here: So we have docker running and accessible from the browser. In order to import and export the data, we need to log in to our Docker image. To do that, run the following command: docker exec -i -t <container-id> /bin/bash This will log in to the docker image running Ubuntu; refer to the following screenshot: You can now run the rethinkdb command to perform the data import to the existing RethinkDB cluster. Deploying the Docker image Almost every PaaS service we have covered in earlier sections provides support for Docker. You can submit your Dockerfile to git and clone it anywhere if you want to create Docker image. You can submit the whole docker image (not Dockerfile) to Dockerhub and pull your docker image directly using the docker pull command, which is no doubt an easy way because you will be directly working on the image running on the server. We covered RethinkDB deployment using Docker and learned how to create our own RethinkDB image. You can learn more about RethinkDB Query Language and Performance Tuning in RethinkDB from this book Mastering RethinkDB.  
Read more
  • 0
  • 0
  • 57987
article-image-how-to-perform-data-exploration-with-rethinkdb
Vijin Boricha
14 Feb 2018
5 min read
Save for later

How to perform Data exploration with RethinkDB

Vijin Boricha
14 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will let you master the capabilities of RethinkDB and implement them to develop efficient real-time web applications.[/box] In this article, we will learn to do data exploration in RethinkDB with the help of few use case. Executing data exploration use cases We have imported our database i.e. our mock data into our RethinkDB instance. Now it's time to run a use case query and make use of it. But before we do so, we need to figure out one data alteration. We have made a mistake while generating mock data (on purpose actually) we have a $ sign before ctc. Hence, it becomes tough to perform salary-level queries. Before we move ahead, we need to figure out this problem, and basically get rid of the $ sign and update the ctc value to an integer instead of a string. In order to do this, we need to perform the following operation: Traverse through each document in the database Split the ctc string into two parts, containing $ and the other value Update the ctc value in the document with a new data type and value Since we require the chaining of queries, I have written a small snippet in Node.js to achieve the previous scenario as follows: var rethinkdb = require('rethinkdb'); var connection = null; rethinkdb.connect({host : 'localhost', port : 28015},function(err,conn) { if(err) { throw new Error('Connection error'); } connection = conn; rethinkdb.db("company").table("employees") .run(connection,function(err,cursor) { if(err) { throw new Error(err); } cursor.each(function(err,data) { data.ctc = parseInt(data.ctc.split("$")[1]); rethinkdb.db("company").table("employees") .get(data.id) .update({ctc : data.ctc}) .run(connection,function(err,response) { if(err) { throw new Error(err); } console.log(response); }); }); }); }); As you can see in the preceding code, we first fetch all the documents and traverse them using cursor, one document at a time. We use the split() method as a $ separator and convert the outcome, which is salary, into an integer using the parseInt() method. We update each document at a time using the id value of the document: After selecting all the documents again, we can see an updated ctc value as an integer, as shown in the following figure: This is one of the practical examples where we perform some data manipulation before moving ahead with complex queries. Similarly, you can look for errors such as blank spaces in a specific field or duplicate elements in your record. Finding duplicate elements We can use distinct() to find out whether there is any duplicate element present in the table. Say you have 1,000 rows and there are 10 duplicates. In order to determine that, we just need to find out the unique rows (of course excluding the ID key, as that's unique by nature). Here is the query for the same: r.db("company").table('employees').without('id').distinct().count() As shown in the following screenshot, this query returns the count of unique rows, which should be 1,000 if there are no duplicates: This implies that our records contain no duplicate documents. Finding the list of countries We can write a query to find all the countries we have in our record and also use distinct again by just selecting the country field. Here is the query: r.db("company").table('employees')("country").distinct() As shown in this image, we have 124 countries in our records: Finding the top 10 employees with the highest salary In this use case, we need to evaluate all the records and find the top 10 employees with the highest to lowest pay. Here is the query for the same: r.db("company").table("employees").orderBy(r.desc("ctc")).limit(10) Here we are using orderBy, which by default orders the record in ascending order. To get the highest pay at the first document, we need to use descending ordering; we did it using the desc() ReQL command. As shown in the following image, the query returns 10 rows: You can modify the same query by just by limiting the number of users to one to get thehighest-paid employee. Displaying employee records with a specific name and location To extract such records from our table, we need to again perform a filter on the "first_name" and "country" fields. Here is the query to return those records: r.db("company").table('employees').filter({"first_name" : "John","country" : "Sweden"}) We are just performing a basic filter and comparing both fields. ReQL queries are really easy for solving such queries due to their chaining feature. After executing the preceding query, we show the following output: To summarize, we looked over a few use cases where we had to perform alteration and filtering of records in order to meet exploration task, like stripping the $ sign from ctc, or converting base 256 ip addresses into base 10 values and then performing a query on them. We also covered a general use case in order to get a practical feel of ReQL. If you are interested to learn about RethinkDB Query Language,  Extending RethinkDB, and more you may check out this book Mastering RethinkDB.  
Read more
  • 0
  • 0
  • 2354

article-image-using-logistic-regression-predict-market-direction-algorithmic-trading
Richa Tripathi
14 Feb 2018
9 min read
Save for later

Using Logistic regression to predict market direction in algorithmic trading

Richa Tripathi
14 Feb 2018
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Dr. Param Jeet and Prashant Vats titled Learning Quantitative Finance with R. This book will help you learn about various algorithmic trading techniques and ways to optimize them using the tools available in R.[/box] In this tutorial we will learn how logistic regression is used to forecast market direction. Market direction is very important for investors or traders. Predicting market direction is quite a challenging task as market data involves lots of noise. The market moves either upward or downward and the nature of market movement is binary. A logistic regression model help us to fit a model using binary behavior and forecast market direction. Logistic regression is one of the probabilistic models which assigns probability to each event. We are going to use the quantmod package. The next three commands are used for loading the package into the workspace, importing data into R from the yahoo repository and extracting only the closing price from the data: >library("quantmod") >getSymbols("^DJI",src="yahoo") >dji<- DJI[,"DJI.Close"] The input data to the logistic regression is constructed using different indicators, such as moving average, standard deviation, RSI, MACD, Bollinger Bands, and so on, which has some predictive power in market direction, that is, Up or Down. These indicators can be constructed using the following commands: >avg10<- rollapply(dji,10,mean) >avg20<- rollapply(dji,20,mean) >std10<- rollapply(dji,10,sd) >std20<- rollapply(dji,20,sd) >rsi5<- RSI(dji,5,"SMA") >rsi14<- RSI(dji,14,"SMA") >macd12269<- MACD(dji,12,26,9,"SMA") >macd7205<- MACD(dji,7,20,5,"SMA") >bbands<- BBands(dji,20,"SMA",2) The following commands are to create variable direction with either Up direction (1) or Down direction (0). Up direction is created when the current price is greater than the 20 days previous price and Down direction is created when the current price is less than the 20 days previous price: >direction<- NULL >direction[dji> Lag(dji,20)] <- 1 >direction[dji< Lag(dji,20)] <- 0 Now we have to bind all columns consisting of price and indicators, which is shown in the following command: >dji<- cbind(dji,avg10,avg20,std10,std20,rsi5,rsi14,macd12269,macd7205,bbands,dire ction) The dimension of the dji object can be calculated using dim(). I used dim() over dji and saved the output in dm(). dm() has two values stored: the first value is the number of rows and the second value is the number of columns in dji. Column names can be extracted using colnames(). The third command is used to extract the name for the last column. Next I replaced the column name with a particular name, Direction: >dm<- dim(dji) >dm [1] 2493   16 >colnames(dji)[dm[2]] [1] "..11" >colnames(dji)[dm[2]] <- "Direction" >colnames(dji)[dm[2]] [1] "Direction" We have extracted the Dow Jones Index (DJI) data into the R workspace. Now, to implement logistic regression, we should divide the data into two parts. The first part is in- sample data and the second part is out-sample data. In-sample data is used for the model building process and out-sample data is used for evaluation purposes. This process also helps to control the variance and bias in the model. The next four lines are for in-sample start, in-sample end, out-sample start, and out-sample end dates: >issd<- "2010-01-01" >ised<- "2014-12-31" >ossd<- "2015-01-01" >osed<- "2015-12-31" The following two commands are to get the row number for the dates, that is, the variable isrow extracts row numbers for the in-sample date range and osrow extracts the row numbers for the out-sample date range: >isrow<- which(index(dji) >= issd& index(dji) <= ised) >osrow<- which(index(dji) >= ossd& index(dji) <= osed) The variables isdji and osdji are the in-sample and out-sample datasets respectively: >isdji<- dji[isrow,] >osdji<- dji[osrow,] If you look at the in-sample data, that is, isdji, you will realize that the scaling of each column is different: a few columns are in the scale of 100, a few others are in the scale of 10,000, and a few others are in the scale of 1. Difference in scaling can put your results in trouble as higher weights are being assigned to higher scaled variables. So before moving ahead, you should consider standardizing the dataset. I will use the following formula: standardized data =  The mean and standard deviation of each column using apply() can be seen here: >isme<- apply(isdji,2,mean) >isstd<- apply(isdji,2,sd) An identity matrix of dimension equal to the in-sample data is generated using the following command, which is going to be used for normalization: >isidn<- matrix(1,dim(isdji)[1],dim(isdji)[2]) Use formula 6.1 to standardize the data: >norm_isdji<-  (isdji - t(isme*t(isidn))) / t(isstd*t(isidn)) The preceding line also standardizes the direction column, that is, the last column. We don't want direction to be standardized so I replace the last column again with variable direction for the in-sample data range: >dm<- dim(isdji) >norm_isdji[,dm[2]] <- direction[isrow] Now we have created all the data required for model building. You should build a logistic regression model and it will help you to predict market direction based on in-sample data. First, in this step, I created a formula which has direction as dependent and all other columns as independent variables. Then I used a generalized linear model, that is, glm(), to fit a model which has formula, family, and dataset: >formula<- paste("Direction ~ .",sep="") >model<- glm(formula,family="binomial",norm_isdji) A summary of the model can be viewed using the following command: >summary(model) Next use predict() to fit values on the same dataset to estimate the best fitted value: >pred<- predict(model,norm_isdji) Once you have fitted the values, you should try to convert it to probability using the following command. This will convert the output into probabilistic form and the output will be in the range [0,1]: >prob<- 1 / (1+exp(-(pred))) The figure shown below is plotted using the following commands. The first line of the code shows that we divide the figure into two rows and one column, where the first figure is for prediction of the model and the second figure is for probability: >par(mfrow=c(2,1)) >plot(pred,type="l") >plot(prob,type="l") head() can be used to look at the first few values of the variable: >head(prob) 2010-01-042010-01-05 2010-01-06 2010-01-07 0.8019197  0.4610468  0.7397603  0.9821293 The following figure shows the above-defined variable pred, which is a real number, and its conversion between 0 and 1, which represents probability, that is, prob, using the preceding transformation: Figure 6.1: Prediction and probability distribution of DJI As probabilities are in the range of (0,1) so is our vector prob. Now, to classify them as one of the two classes, I considered Up direction (1) when prob is greater than 0.5 and Down direction (0) when prob is less than 0.5. This assignment can be done using the following commands. prob> 0.5 generate true for points where it is greater and pred_direction[prob> 0.5] assigns 1 to all such points. Similarly, the next statement shows assignment 0 when probability is less than or equal to 0.5: >pred_direction<- NULL >pred_direction[prob> 0.5] <- 1 >pred_direction[prob<= 0.5] <- 0 Once we have figured out the predicted direction, we should check model accuracy: how much our model has predicted Up direction as Up direction and Down as Down. There might be some scenarios where it predicted the opposite of what it is, such as predicting down when it is actually Up and vice versa. We can use the caret package to calculate confusionMatrix(), which gives a matrix as an output. All diagonal elements are correctly predicted and off-diagonal elements are errors or wrongly predicted. One should aim to reduce the off-diagonal elements in a confusion matrix: >install.packages('caret') >library(caret) >matrix<- confusionMatrix(pred_direction,norm_isdji$Direction) >matrix Confusion Matrix and Statistics Reference Prediction               0                     1 0            362                    35 1             42                   819 Accuracy : 0.9388        95% CI : (0.9241, 0.9514) No Information Rate : 0.6789   P-Value [Acc>NIR] : <2e-16 Kappa : 0.859                        Mcnemar's Test P-Value : 0.4941 Sensitivity : 0.8960                        Specificity : 0.9590 PosPredValue : 0.9118   NegPred Value : 0.9512 Prevalence : 0.3211                          Detection Rate : 0.2878 Detection Prevalence : 0.3156 Balanced Accuracy : 0.9275 The preceding table shows we have got 94% correct prediction, as 362+819 = 1181 are correct predictions out of 1258 (sum of all four values). Prediction above 80% over in-sample data is generally assumed good prediction; however, 80% is not fixed, one has to figure out this value based on the dataset and industry. Now you have implemented the logistic regression model, which has predicted 94% correctly, and need to test it for generalization power. One should test this model using out-sample data and test its accuracy. The first step is to standardize the out-sample data using formula (6.1). Here mean and standard deviations should be the same as those used for in-sample normalization: >osidn<- matrix(1,dim(osdji)[1],dim(osdji)[2]) >norm_osdji<-  (osdji - t(isme*t(osidn))) / t(isstd*t(osidn)) >norm_osdji[,dm[2]] <- direction[osrow] Next we use predict() on the out-sample data and use this value to calculate probability: >ospred<- predict(model,norm_osdji) >osprob<- 1 / (1+exp(-(ospred))) Once probabilities are determined for the out-sample data, you should put it into either Up or Down classes using the following commands. ConfusionMatrix() here will generate a matrix for the out-sample data: >ospred_direction<- NULL >ospred_direction[osprob> 0.5] <- 1 >ospred_direction[osprob<= 0.5] <- 0 >osmatrix<- confusionMatrix(ospred_direction,norm_osdji$Direction) >osmatrix Confusion Matrix and Statistics Reference Prediction            0                       1 0          115                     26 1           12                     99 Accuracy : 0.8492       95% CI : (0.7989, 0.891) This shows 85% accuracy on the out-sample data. A realistic trading model also accounts for trading cost and market slippage, which decrease the winning odds significantly. We presented advanced techniques implemented in capital markets and also learned logistic regression model using binary behavior to forecast market direction. If you enjoyed this excerpt, check out the book  Learning Quantitative Finance with R to deep dive into the vast world of algorithmic and machine-learning based trading.    
Read more
  • 0
  • 1
  • 39747
Modal Close icon
Modal Close icon