How-To Tutorials

article-image-how-to-use-r-to-boost-your-data-model

15 Feb 2018

8 min read

How to use R to boost your Data Model

15 Feb 2018

[box type="note" align="" class="" width=""]The following book excerpt is taken from the title Statistics for Data Science, written by James D. Miller. This book is a comprehensive primer on the basic concepts of statistics and their application in different data science tasks.[/box] In this article, we explain the implementation of boosting - a popular technique used to improve the performance of a data model - using the popular R programming language. We will take a high-level look at a thought-provoking prediction problem drawn from Mastering Predictive Analytics with R, Second Edition, by James D. Miller and Rui Miguel Forte. Here, an original example of patterns made by radiation on a telescope camera are analyzed in an attempt to predict whether a certain pattern came from gamma rays leaking into the atmosphere or from regular background radiation. Gamma rays leave distinctive elliptical patterns and so we can create a set of features to describe these. The dataset used is the MAGIC Gamma Telescope Data Set, hosted by the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope This data consists of 19,020 observations, holding the following list of attributes: Prepping the data First, various steps need to be performed on our example data. The data is first loaded into an R data frame object named magic, recoding the CLASS output variable to use classes 1 and -1 for gamma rays and background radiation respectively: > magic <- read.csv("magic04.data", header = FALSE) > names(magic) <- c("FLENGTH", "FWIDTH", "FSIZE", "FCONC", "FCONC1", "FASYM", "FM3LONG", "FM3TRANS", "FALPHA", "FDIST", "CLASS") > magic$CLASS <- as.factor(ifelse(magic$CLASS =='g', 1, -1)) Next, the data is split into two files: a training data and a test data frame using an 80-20 split: > library(caret) > set.seed(33711209) > magic_sampling_vector <- createDataPartition(magic$CLASS, p = 0.80, list = FALSE) > magic_train <- magic[magic_sampling_vector, 1:10] > magic_train_output <- magic[magic_sampling_vector, 11] > magic_test <- magic[-magic_sampling_vector, 1:10] > magic_test_output <- magic[-magic_sampling_vector, 11] The model used for boosting is a simple multilayer perceptron with a single hidden layer leveraging R's nnet package. Neural networks, often produce higher accuracy when inputs are normalized, so, in this example, before training any models, this preprocessing is performed: > magic_pp <- preProcess(magic_train, method = c("center", "scale")) > magic_train_pp <- predict(magic_pp, magic_train) > magic_train_df_pp <- cbind(magic_train_pp, CLASS = magic_train_output) > magic_test_pp <- predict(magic_pp, magic_test) Training Boosting is designed to work best with weak learners, so a very small number of hidden neurons in the model's hidden layer are used. Concretely, we will begin with the simplest possible multilayer perceptron that uses a single hidden neuron. To understand the effect of using boosting, a baseline performance is established by training a single neural network (and measuring its performance). This is to accomplish the following: > library(nnet) > n_model <- nnet(CLASS ~ ., data = magic_train_df_pp, size = 1) > n_test_predictions <- predict(n_model, magic_test_pp, type = "class") > (n_test_accuracy <- mean(n_test_predictions == magic_test_output)) [1] 0.7948988 This establishes that we have a baseline accuracy of around 79.5 percent. Not too bad, but can boost to improve upon this score? To that end, the function AdaBoostNN(), which is shown as follows, is used. This function will take input from a data frame, the name of the output variable, the number of single hidden layer neural network models to be built, and finally, the number of hidden units these neural networks will have. The function will then implement the AdaBoost algorithm and return a list of models with their corresponding weights. Here is the function: AdaBoostNN <- function(training_data, output_column, M, hidden_units) { require("nnet") models <- list() alphas <- list() n <- nrow(training_data) model_formula <- as.formula(paste(output_column, '~ .', sep = '')) w <- rep((1/n), n) for (m in 1:M) { model <- nnet(model_formula, data = training_data, size = hidden_units, weights = w) models[[m]] <- model predictions <- as.numeric(predict(model, training_data[, -which(names(training_data) == output_column)], type = "class")) errors <- predictions != training_data[, output_column] error_rate <- sum(w * as.numeric(errors)) / sum(w) alpha <- 0.5 * log((1 - error_rate) / error_rate) alphas[[m]] <- alpha temp_w <- mapply(function(x, y) if (y) { x * exp(alpha) } else { x * exp(-alpha)}, w, errors) w <- temp_w / sum(temp_w) } return(list(models = models, alphas = unlist(alphas))) } The preceding function uses the following logic: First, initialize empty lists of models and model weights (alphas). Compute the number of observations in the training data, storing this in the variable n. The name of the output column provided is then used to create a formula that describes the neural network that will be built. In the dataset used, this formula will be CLASS ~ ., meaning that the neural network will compute CLASS as a function of all the other columns as input features. Next, initialize the weights vector and define a loop that will run for M iterations in order to build M models. In every iteration, the first step is to use the current setting of the weights vector to train a neural network using as many hidden units as specified in the input, hidden_units. Then, compute a vector of predictions that the model generates on the training data using the predict() function. By comparing these predictions to the output column of the training data, calculate the errors that the current model makes on the training data. This then allows the computation of the error rate. This error rate is set as the weight of the current model and, finally, the observation weights to be used in the next iteration of the loop are updated according to whether each observation was correctly classified. The weight vector is then normalized and we are ready to begin the next iteration! After completing M iterations, output a list of models and their corresponding model weights. Ready for boosting There is now a function able to train our ensemble classifier using AdaBoost, but we also need a function to make the actual predictions. This function will take in the output list produced by our training function, AdaBoostNN(), along with a test dataset. This function is AdaBoostNN.predict() and it is shown as follows: AdaBoostNN.predict <- function(ada_model, test_data) { models <- ada_model$models alphas <- ada_model$alphas prediction_matrix <- sapply(models, function (x) as.numeric(predict(x, test_data, type = "class"))) weighted_predictions <- t(apply(prediction_matrix, 1, function(x) mapply(function(y, z) y * z, x, alphas))) final_predictions <- apply(weighted_predictions, 1, function(x) sign(sum(x))) return(final_predictions) } This function first extracts the models and the model weights (from the list produced by the previous function). A matrix of predictions is created, where each column corresponds to the vector of predictions made by a particular model. Thus, there will be as many columns in this matrix as the models that we used for boosting. We then multiply the predictions produced by each model with their corresponding model weight. For example, every prediction from the first model is in the first column of the prediction matrix and will have its value multiplied by the first model weight α1. .Lastly, the matrix of weighted observations is reduced into a single vector of observations by summing the weighted predictions for each observation and taking the sign of the result. This vector of predictions is then returned by the function. As an experiment, we will train ten neural network models with a single hidden unit and see if boosting improves accuracy: > ada_model <- AdaBoostNN(magic_train_df_pp, 'CLASS', 10, 1) > predictions <- AdaBoostNN.predict(ada_model, magic_test_pp, 'CLASS') > mean(predictions == magic_test_output) [1] 0.804365 We see in this example, boosting ten models shows a marginal improvement in accuracy, but perhaps training more models might make more of a difference. What we learned From the preceding example, you may conclude that, for the neural networks with one hidden unit, as the number of boosting models increases, we see an improvement in accuracy, but after 100 models, this tapers off and is actually slightly less for 200 models. The improvement over the baseline of a single model is substantial for these networks. When we increase the complexity of our learner by having a hidden layer with three hidden neurons, we get a much smaller improvement in performance. At 200 models, both ensembles perform at a similar level, indicating that, at this point, our accuracy is being limited by the type of model trained. If you found this article useful, make sure to check out the book Statistics for Data Science for interesting statistical techniques and their implementation in R.

0
0
15020

article-image-choose-r-data-mining-project

Fatema Patrawala

15 Feb 2018

9 min read

Why choose R for your data mining project

Fatema Patrawala

15 Feb 2018

9 min read

[box type="note" align="" class="" width=""]Our article is an excerpt taken from the book R Data Mining, written by Andrea Cirillo. If you are a budding data scientist or a data analyst with basic knowledge of R, and you want to get into the intricacies of data mining in a practical manner, be sure to check out this book.[/box] In today’s post, we will analyze R's strengths, and understand why it is a savvy idea to learn this programming language for data mining. R's strengths You know that R is really popular, but why? R is not the only data analysis language out there, and neither is it the oldest one; so why is it so popular? If looking at the root causes of R's popularity, we definitely have to mention these three: Open source inside Plugin ready Data visualization friendly Open source inside One of the main reasons the adoption of R is spreading is its open source nature. R binary code is available for everyone to download, modify, and share back again (only in an open source way). Technically, R is released with a GNU general public license, meaning that you can take it and use it for whatever purpose; but you have to share every derivative with a GNU general public license as well. These attributes fit well for almost every target user of a statistical analysis language: Academic user: Knowledge sharing is a must for an academic environment, and having the ability to share work without the worry of copyright and license questions makes R very practical for academic research purposes Business user: Companies are always worried about budget constraints; having professional statistical analysis software at their disposal for free sounds like a dream come true Private user: This user merges together both of the benefits already mentioned, because they will find it great to have a free instrument with which to learn and share their own statistical analyses Plugin ready You could imagine the R language as an expandable board game. You know, games like 7 Wonders or Carcassonne, with a base set of characters and places and further optional places and characters, increasing the choices at your disposal and maximizing the fun. The R language can be compared to this kind of game. There is a base version of R, containing a group of default packages that are delivered along with the standard version of the software (you can skip to the Installing R and writing R code section for more on how to obtain and install it). The functionalities available through the base version are mainly related to file system manipulation, statistical analysis, and data Visualization. While this base version is regularly maintained and updated by the R core team, virtually every R user can add further new functionalities to those available within the package, developing and sharing custom packages. This is basically how the package development and sharing flow works: The R user develops a new package, for example a package introducing a new machine learning algorithm exposed within a freshly published academic paper. The user submits the package to the CRAN repository or a similar repository. The Comprehensive R Archive Network (CRAN) is the official repository for R related documents and packages. Every R user can gain access to the additional features introduced with any given package, installing and loading them into their R environment. If the package has been submitted to CRAN, installing and loading the package will result in running just the two following lines of R code (similar commands are available for alternative repositories such as Bioconductor): install.packages("ggplot2") library(ggplot2) As you can see, this is a really convenient and effective way to expand R functionalities, and you will soon see how wide the range of functionalities added through additional packages developed by R users is. More than 9,000 packages are available on CRAN, and this number is sure to increase further, making more and more additional features available to the R community. Data visualization friendly As a discipline data visualization encompasses all of the principles and techniques employable to effectively display the information and messages contained within a set of data. Since we are living in an information-heavy age, the ability to effectively and concisely communicate articulated and complex messages through data visualization is a core asset for any professional. This is exactly why R is experiencing a great response in academic and professional fields: the data visualization capabilities of R place it at the cutting edge of these fields. R has been noticed for its amazing data visualization features right from its beginning; when some of its peers still showed x axes-built aggregating + signs, R was already able to produce astonishing 3D plots. Nevertheless, a major improvement of R as a data visualization technique came when Auckland's Hadley Wickham developed the highly famous ggplot2 package based on The Grammar of Graphics, introducing into the R world an organic framework for data visualization tasks: This package alone introduced the R community to a highly flexible way of producing and visualizing almost every kind of data visualization, having also been designed as an expandable tool, in order to add the possibility of incorporating new data visualization techniques as soon as they emerge. Finally, ggplot2 gives you the ability to highly customize your plot, adding every kind of graphical or textual annotation to it. Nowadays, R is being used by the biggest tech companies, such as Facebook and Google, and by widely circulated publications such as the Economist and the New York Times to visualize their data and convey their information to their stakeholders and readers. To sum all this up—should you invest your precious time learning R? If you are a professional or a student who could gain advantages from knowing effective and cutting-edge techniques to manipulate, model, and present data, I can only give you a positive opinion: yes. You should definitely learn R, and consider it a long-term investment, since the points of strength we have seen place it in a great position to further expand its influence in the coming years in every industry and academic field. Engaging with the community to learn R Now that we are aware of R’s popularity we need to engage with the community to take advantage of it. We will look at alternative and non-exclusive ways of engaging with the community: Employing community-driven learning material Asking for help from the community Staying ahead of language developments Employing community-driven learning material: There are two main kinds of R learning materials developed by the community: Papers, manuals, and books Online interactive courses Papers, manuals, and books: The first one is for sure the more traditional one, but you shouldn't neglect it, since those kinds of learning materials are always able to give you a more organic and systematic understanding of the topics they treat. You can find a lot of free material online in the form of papers, manuals, and books. Let me point out to you the more useful ones: Advanced R R for Data Science Introduction to Statistical Learning OpenIntro Statistics The R Journal Online interactive courses: This is probably the most common learning material nowadays. You can find different platforms delivering good content on the R language, the most famous of which are probably DataCamp, Udemy, and Packt itself. What all of them share is a practical and interactive approach that lets you learn the topic directly, applying it through exercises rather than passively looking at someone explaining theoretical stuff. Asking for help from the community: As soon as you start writing your first lines of R code, and perhaps before you even actually start writing it, you will come up with some questions related to your work. The best thing you can do when this happens is to resort to the community to solve those questions. You will probably not be the first one to come up with that question, and you should therefore first of all look online for previous answers to your question. Where should you look for answers? You can look everywhere, but most of the time you will find the answer you are looking for on one of the following (listed by the probability of finding the answer there): Stack Overflow R-help mailing list R packages documentation I wouldn't suggest you look for answers on Twitter, G+, and similar networks, since they were not conceived to handle these kinds of processes and you will expose yourself to the peril of reading answers that are out of date, or simply incorrect, because no review system is considered. If it is the case that you are asking an innovative question never previously asked by anyone, first of all, congratulations! That said, in that happy circumstance, you can ask your question in the same places that you previously looked for answers. Staying ahead of language developments: The R language landscape is constantly changing, thanks to the contributions of many enthusiastic users who take it a step further every day. How can you stay ahead of those changes? This is where social networks come in handy. Following the #rstats hashtag on Twitter, Google+ groups, and similar places, will give you the pulse of the language. Moreover, you will find the R-bloggers aggregator, which delivers a daily newsletter comprised of the R-related blog posts that were published the previous day really useful. Finally, annual R conferences and similar occasions constitute a great opportunity to get in touch with the most notorious R experts, gaining from them useful insights and inspiring speeches about the future of the language. To summarize, we looked why to choose R as your programming language for data mining and how we can engage with the R community. If you think this post is useful, you may further check out this book R Data Mining, to leverage data mining techniques across many different industries, including finance, medicine, scientific research, and more.

0
0
6057

article-image-sentiment-analysis-of-the-2017-us-elections-on-twitter

Vijin Boricha

14 Feb 2018

12 min read

Sentiment Analysis of the 2017 US elections on Twitter

Vijin Boricha

14 Feb 2018

12 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Andrew Morgan and Antoine Amend titled Mastering Spark for Data Science. In this book, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly.[/box] In this article, we will learn to extract and analyse large number of tweets related to the 2017 US elections on Twitter. Following the US elections on Twitter On November 8, 2016, American citizens went in millions to polling stations to cast their votes for the next President of the United States. Counting began almost immediately and, although not officially confirmed until sometime later, the forecasted result was well known by the next morning. Let's start our investigation a couple of days before the major event itself, on November 6, 2016, so that we can preserve some context in the run-up. Although we do not exactly know what we will find in advance, we know that Twitter will play an oversized role in the political commentary given its influence in the build-up, and it makes sense to start collecting data as soon as possible. In fact, data scientists may sometimes experience this as a gut feeling - a strange and often exciting notion that compels us to commence working on something without a clear plan or absolute justification, just a sense that it will pay off. And actually, this approach can be vital since, given the normal time required to formulate and realize such a plan and the transient nature of events, a major news event may occur , a new product may have been released, or the stock market may be trending differently; by this time, the original dataset may no longer be available. Acquiring data in stream The first action is to start acquiring Twitter data. As we plan to download more than 48 hours worth of tweets, the code should be robust enough to not fail somewhere in the middle of the process; there is nothing more frustrating than a fatal NullPointerException occurring after many hours of intense processing. We know we will be working on sentiment analysis at some point down the line, but for now we do not wish to over-complicate our code with large dependencies as this can decrease stability and lead to more unchecked exceptions. Instead, we will start by collecting and storing the data and subsequent processing will be done offline on the collected data, rather than applying this logic to the live stream. We create a new Streaming context reading from Twitter 1% firehose using the utility methods. We also use the excellent GSON library to serialize Java class Status (Java class embedding Twitter4J records) to JSON objects. <dependency> <groupId>com.google.code.gson</groupId> <artifactId>gson</artifactId> <version>2.3.1</version> </dependency> We read Twitter data every 5 minutes and have a choice to optionally supply Twitter filters as command line arguments. Filters can be keywords such as Trump, Clinton or #MAGA, #StrongerTogether. However, we must bear in mind that by doing this we may not capture all relevant tweets as we can never be fully up to date with the latest hashtag trends (such as #DumpTrump, #DrainTheSwamp, #LockHerUp, or #LoveTrumpsHate) and many tweets will be overlooked with an inadequate filter, so we will use an empty filter list to ensure that we catch everything. val sparkConf = new SparkConf().setAppName("Twitter Extractor") val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Minutes(5)) val filter = args val twitterStream = createTwitterStream(ssc, filter) .mapPartitions { it => val gson = new GsonBuilder().create() it.map { s: Status => Try(gson.toJson(s)).toOption } } We serialize our Status class using the GSON library and persist our JSON objects in HDFS. Note that the serialization occurs within a Try clause to ensure that unwanted exceptions are not thrown. Instead, we return JSON as an optional String: twitterStream .filter(_.isSuccess) .map(_.get) .saveAsTextFiles("/path/to/twitter") Finally, we run our Spark Streaming context and keep it alive until a new president has been elected, no matter what happens! ssc.start() ssc.awaitTermination() Acquiring data in batch Only 1% of tweets are retrieved through the Spark Streaming API, meaning that 99% of records will be discarded. Although able to download around 10 million tweets, we can potentially download more data, but this time only for a selected hashtag and within a small period of time. For example, we can download all tweets related to the #LockHerUp or #BuildTheWall hashtags. The search API For that purpose, we consume Twitter historical data through the twitter4j Java API. This library comes as a transitive dependency of spark-streaming-twitter_2.11. To use it outside of a Spark project, the following maven dependency should be used: <dependency> <groupId>org.twitter4j</groupId> <artifactId>twitter4j-core</artifactId> <version>4.0.4</version> </dependency> We create a Twitter4J client as follows: ConfigurationBuilder builder = new ConfigurationBuilder(); builder.setOAuthConsumerKey(apiKey); builder.setOAuthConsumerSecret(apiSecret); Configuration configuration = builder.build(); AccessToken token = new AccessToken( accessToken, accessTokenSecret ); Twitter twitter = new TwitterFactory(configuration) .getInstance(token); Then, we consume the /search/tweets service through the Query object: Query q = new Query(filter); q.setSince(fromDate); q.setUntil(toDate); q.setCount(400); QueryResult r = twitter.search(q); List<Status> tweets = r.getTweets(); Finally, we get a list of Status objects that can easily be serialized using the GSON library introduced earlier. Rate limit Twitter is a fantastic resource for data science, but it is far from a non-profit organization, and as such, they know how to value and price data. Without any special agreement, the search API is limited to a few days retrospective, a maximum of 180 queries per 15 minute window and 450 records per query. This limit can be confirmed on both the Twitter DEV website (https://dev.twitter.com/rest/public/rate-limits) and from the API itself using the RateLimitStatus class: Map<String, RateLimitStatus> rls = twitter.getRateLimitStatus("search"); System.out.println(rls.get("/search/tweets")); /* RateLimitStatusJSONImpl{remaining=179, limit=180, resetTimeInSeconds=1482102697, secondsUntilReset=873} */ Unsurprisingly, any queries on popular terms, such as #MAGA on November 9, 2016, hit this threshold. To avoid a rate limit exception, we have to page and throttle our download requests by keeping track of the maximum number of tweet IDs processed and monitor our status limit after each search request. RateLimitStatus strl = rls.get("/search/tweets"); int totalTweets = 0; long maxID = -1; for (int i = 0; i < 400; i++) { // throttling if (strl.getRemaining() == 0) Thread.sleep(strl.getSecondsUntilReset() * 1000L); Query q = new Query(filter); q.setSince(fromDate); q.setUntil(toDate); q.setCount(100); // paging if (maxID != -1) q.setMaxId(maxID - 1); QueryResult r = twitter.search(q); for (Status s: r.getTweets()) { totalTweets++; if (maxID == -1 || s.getId() < maxID) maxID = s.getId(); writer.println(gson.toJson(s)); } strl = r.getRateLimitStatus(); } With around half a billion tweets a day, it will be optimistic, if not Naive, to gather all USrelated data. Instead, the simple ingest process detailed earlier should be used to intercept tweets matching specific queries only. Packaged as main class in an assembly jar, it can be executed as follows: java -Dtwitter.properties=twitter.properties / -jar trump-1.0.jar #maga 2016-11-08 2016-11-09 / /path/to/twitter-maga.json Here, the twitter.properties file contains your Twitter API keys: twitter.token = XXXXXXXXXXXXXX twitter.token.secret = XXXXXXXXXXXXXX twitter.api.key = XXXXXXXXXXXXXX twitter.api.secret = XXXXXXXXXXXXXX Analysing sentiment After 4 days of intense processing, we extracted around 10 million tweets;representing approximately 30 GB worth of JSON data. Massaging Twitter data One of the key reasons Twitter became so popular is that any message has to fit into a maximum of 140 characters. The drawback is also that every message has to fit into a maximum of 140 characters! Hence, the result is massive increase in the use of abbreviations, acronyms, slang words, emoticons, and hashtags. In this case, the main emotion may no longer come from the text itself, but rather from the emoticons used (http://dl.acm.org/citation.cfm?id=1628969), though some studies showed that the emoticons may sometimes lead to inadequate predictions in sentiments (https://arxiv.org/pdf/1511.02556.pdf). Emojis are even broader than emoticons as they include pictures of animals, transportation, business icons, and so on. Also, while emoticons can easily be retrieved through simple regular expressions, emojis are usually encoded in Unicode and are more difficult to extract without a dedicated library. <dependency> <groupId>com.kcthota</groupId> <artifactId>emoji4j</artifactId> <version>5.0</version> </dependency> The Emoji4J library is easy to use (although computationally expensive) and given some text with emojis/emoticons, we can either codify – replace Unicode values with actual code names – or clean – simply remove any emojis. So firstly, let's clean our text from any junk (special characters, emojis, accents, URLs, and so on) to access plain English content: import emoji4j.EmojiUtils def clean = { var text = tweet.toLowerCase() text = text.replaceAll("https?://S+", "") text = StringUtils.stripAccents(text) EmojiUtils.removeAllEmojis(text) .trim .toLowerCase() .replaceAll("rts+", "") .replaceAll("@[wd-_]+", "") .replaceAll("[^w#[]:'.!?,]+", " ") .replaceAll("s+([:'.!?,])1", "$1") .replaceAll("[st]+", " ") .replaceAll("[rn]+", ". ") .replaceAll("(w)1{2,}", "$1$1") // avoid looooool .replaceAll("#W", "") .replaceAll("[#':,;.]$", "") .trim } Let's also codify and extract all emojis and emoticons and keep them aside as a list: val eR = "(:w+:)".r def emojis = { var text = tweet.toLowerCase() text = text.replaceAll("https?://S+", "") eR.findAllMatchIn(EmojiUtils.shortCodify(text)) .map(_.group(1)) .filter { emoji => EmojiUtils.isEmoji(emoji) }.map(_.replaceAll("W", "")) .toArray } Writing these methods inside an implicit class means that they can be applied directly a String through a simple import statement. Using the Stanford NLP Our next step is to pass our cleaned text through a Sentiment Annotator. We use the Stanford NLP library for that purpose: <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.5.0</version> <classifier>models</classifier> </dependency> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.5.0</version> </dependency> We create a Stanford annotator that tokenizes content into sentences (tokenize), splits sentences (ssplit), tags elements (pos), and lemmatizes each word (lemma) before analyzing the overall sentiment: def getAnnotator: StanfordCoreNLP = { val p = new Properties() p.setProperty( "annotators", "tokenize, ssplit, pos, lemma, parse, sentiment" ) new StanfordCoreNLP(pipelineProps) } def lemmatize(text: String, annotator: StanfordCoreNLP = getAnnotator) = { val annotation = annotator.process(text.clean) val sentences = annotation.get(classOf[SentencesAnnotation]) sentences.flatMap { sentence => sentence.get(classOf[TokensAnnotation]) .map { token => token.get(classOf[LemmaAnnotation]) } .mkString(" ") } val text = "If you're bashing Trump and his voters and calling them a variety of hateful names, aren't you doing exactly what you accuse them?" println(lemmatize(text)) /* if you be bash trump and he voter and call they a variety of hateful name, be not you do exactly what you accuse they */ Any word is replaced by its most basic form, that is, you're is replaced with you be and aren't you doing replaced with be not you do. def sentiment(coreMap: CoreMap) = { coreMap.get(classOf[SentimentCoreAnnotations.ClassName].match { case "Very negative" => 0 case "Negative" => 1 case "Neutral" => 2 case "Positive" => 3 case "Very positive" => 4 case _ => throw new IllegalArgumentException( s"Could not get sentiment for [${coreMap.toString}]" ) } } def extractSentiment(text: String, annotator: StanfordCoreNLP = getSentimentAnnotator) = { val annotation = annotator.process(text) val sentences = annotation.get(classOf[SentencesAnnotation]) val totalScore = sentences map sentiment if (sentences.nonEmpty) { totalScore.sum / sentences.size() } else { 2.0f } } extractSentiment("God bless America. Thank you Donald Trump!") // 2.5 extractSentiment("This is the most horrible day ever") // 1.0 A sentiment spans from Very Negative (0.0) to Very Positive (4.0) and is averaged per sentence. As we do not get more than 1 or 2 sentences per tweet, we expect a very small variance; most of the tweets should be Neutral (around 2.0), with only extremes to be scored (below ~1.5 or above ~2.5). Building the Pipeline For each of our Twitter records (stored as JSON objects), we do the following things: Parse the JSON object using json4s library Extract the date Extract the text Extract the location and map it to a US state Clean the text Extract emojis Lemmatize text Analyze sentiment We then wrap all these values into the following Tweet case class: case class Tweet( date: Long, body: String, sentiment: Float, state: Option[String], geoHash: Option[String], emojis: Array[String] ) As mentioned in previous chapters, creating a new NLP instance wouldn't scale for each record out of our dataset of 10 million records. Instead, we create only one annotator per Iterator (which means one per partition): val analyzeJson = (it: Iterator[String]) => { implicit val format = DefaultFormats val annotator = getAnnotator val sdf = new SimpleDateFormat("MMM d, yyyy hh:mm:ss a") it.map { tweet => val json = parse(tweet) val dateStr = (json "createdAt").extract[String] ) As mentioned in previous chapters, creating a new NLP instance wouldn't scale for each record out of our dataset of 10 million records. Instead, we create only one annotator per Iterator (which means one per partition): val analyzeJson = (it: Iterator[String]) => { implicit val format = DefaultFormats val annotator = getAnnotator val sdf = new SimpleDateFormat("MMM d, yyyy hh:mm:ss a") it.map { tweet => val json = parse(tweet) val dateStr = (json "createdAt").extract[String] val date = Try( sdf.parse(dateStr).getTime ) .getOrElse(0L) val text = (json "text").extract[String] val location = Try( (json "user" "location").extract[String] ) .getOrElse("") .toLowerCase() val state = Try { location.split("s") .map(_.toUpperCase()) .filter { s => states.contains(s) } .head } .toOption val cleaned = text.clean Tweet( date, cleaned.lemmatize(annotator), cleaned.sentiment(annotator), state, text.emojis ) } } val tweetJsonRDD = sc.textFile("/path/to/twitter") val tweetRDD = twitterJsonRDD mapPartitions analyzeJson tweetRDD.toDF().show(5) /* +-------------+---------------+---------+--------+----------+ | date| body|sentiment| state| emojis| +-------------+---------------+---------+--------+----------+ |1478557859000|happy halloween| 2.0| None [ghost] | |1478557860000|slave to the gr| 2.5| None|[] | |1478557862000|why be he so pe| 3.0|Some(MD)|[] | |1478557862000|marcador sentim| 2.0| None|[] | |1478557868000|you mindset tow| 2.0| None|[sparkles]| +-------------+---------------+---------+--------+----------+ */ We learnt about gathering sentimental data from twitter, you can further get to know about Data Acquisition and Scrapping link-based external data from Mastering Spark for Data Science.

0
0
12672

article-image-using-logistic-regression-predict-market-direction-algorithmic-trading

Richa Tripathi

14 Feb 2018

9 min read

Using Logistic regression to predict market direction in algorithmic trading

Richa Tripathi

14 Feb 2018

9 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book by Dr. Param Jeet and Prashant Vats titled Learning Quantitative Finance with R. This book will help you learn about various algorithmic trading techniques and ways to optimize them using the tools available in R.[/box] In this tutorial we will learn how logistic regression is used to forecast market direction. Market direction is very important for investors or traders. Predicting market direction is quite a challenging task as market data involves lots of noise. The market moves either upward or downward and the nature of market movement is binary. A logistic regression model help us to fit a model using binary behavior and forecast market direction. Logistic regression is one of the probabilistic models which assigns probability to each event. We are going to use the quantmod package. The next three commands are used for loading the package into the workspace, importing data into R from the yahoo repository and extracting only the closing price from the data: >library("quantmod") >getSymbols("^DJI",src="yahoo") >dji<- DJI[,"DJI.Close"] The input data to the logistic regression is constructed using different indicators, such as moving average, standard deviation, RSI, MACD, Bollinger Bands, and so on, which has some predictive power in market direction, that is, Up or Down. These indicators can be constructed using the following commands: >avg10<- rollapply(dji,10,mean) >avg20<- rollapply(dji,20,mean) >std10<- rollapply(dji,10,sd) >std20<- rollapply(dji,20,sd) >rsi5<- RSI(dji,5,"SMA") >rsi14<- RSI(dji,14,"SMA") >macd12269<- MACD(dji,12,26,9,"SMA") >macd7205<- MACD(dji,7,20,5,"SMA") >bbands<- BBands(dji,20,"SMA",2) The following commands are to create variable direction with either Up direction (1) or Down direction (0). Up direction is created when the current price is greater than the 20 days previous price and Down direction is created when the current price is less than the 20 days previous price: >direction<- NULL >direction[dji> Lag(dji,20)] <- 1 >direction[dji< Lag(dji,20)] <- 0 Now we have to bind all columns consisting of price and indicators, which is shown in the following command: >dji<- cbind(dji,avg10,avg20,std10,std20,rsi5,rsi14,macd12269,macd7205,bbands,dire ction) The dimension of the dji object can be calculated using dim(). I used dim() over dji and saved the output in dm(). dm() has two values stored: the first value is the number of rows and the second value is the number of columns in dji. Column names can be extracted using colnames(). The third command is used to extract the name for the last column. Next I replaced the column name with a particular name, Direction: >dm<- dim(dji) >dm [1] 2493 16 >colnames(dji)[dm[2]] [1] "..11" >colnames(dji)[dm[2]] <- "Direction" >colnames(dji)[dm[2]] [1] "Direction" We have extracted the Dow Jones Index (DJI) data into the R workspace. Now, to implement logistic regression, we should divide the data into two parts. The first part is in- sample data and the second part is out-sample data. In-sample data is used for the model building process and out-sample data is used for evaluation purposes. This process also helps to control the variance and bias in the model. The next four lines are for in-sample start, in-sample end, out-sample start, and out-sample end dates: >issd<- "2010-01-01" >ised<- "2014-12-31" >ossd<- "2015-01-01" >osed<- "2015-12-31" The following two commands are to get the row number for the dates, that is, the variable isrow extracts row numbers for the in-sample date range and osrow extracts the row numbers for the out-sample date range: >isrow<- which(index(dji) >= issd& index(dji) <= ised) >osrow<- which(index(dji) >= ossd& index(dji) <= osed) The variables isdji and osdji are the in-sample and out-sample datasets respectively: >isdji<- dji[isrow,] >osdji<- dji[osrow,] If you look at the in-sample data, that is, isdji, you will realize that the scaling of each column is different: a few columns are in the scale of 100, a few others are in the scale of 10,000, and a few others are in the scale of 1. Difference in scaling can put your results in trouble as higher weights are being assigned to higher scaled variables. So before moving ahead, you should consider standardizing the dataset. I will use the following formula: standardized data = The mean and standard deviation of each column using apply() can be seen here: >isme<- apply(isdji,2,mean) >isstd<- apply(isdji,2,sd) An identity matrix of dimension equal to the in-sample data is generated using the following command, which is going to be used for normalization: >isidn<- matrix(1,dim(isdji)[1],dim(isdji)[2]) Use formula 6.1 to standardize the data: >norm_isdji<- (isdji - t(isme*t(isidn))) / t(isstd*t(isidn)) The preceding line also standardizes the direction column, that is, the last column. We don't want direction to be standardized so I replace the last column again with variable direction for the in-sample data range: >dm<- dim(isdji) >norm_isdji[,dm[2]] <- direction[isrow] Now we have created all the data required for model building. You should build a logistic regression model and it will help you to predict market direction based on in-sample data. First, in this step, I created a formula which has direction as dependent and all other columns as independent variables. Then I used a generalized linear model, that is, glm(), to fit a model which has formula, family, and dataset: >formula<- paste("Direction ~ .",sep="") >model<- glm(formula,family="binomial",norm_isdji) A summary of the model can be viewed using the following command: >summary(model) Next use predict() to fit values on the same dataset to estimate the best fitted value: >pred<- predict(model,norm_isdji) Once you have fitted the values, you should try to convert it to probability using the following command. This will convert the output into probabilistic form and the output will be in the range [0,1]: >prob<- 1 / (1+exp(-(pred))) The figure shown below is plotted using the following commands. The first line of the code shows that we divide the figure into two rows and one column, where the first figure is for prediction of the model and the second figure is for probability: >par(mfrow=c(2,1)) >plot(pred,type="l") >plot(prob,type="l") head() can be used to look at the first few values of the variable: >head(prob) 2010-01-042010-01-05 2010-01-06 2010-01-07 0.8019197 0.4610468 0.7397603 0.9821293 The following figure shows the above-defined variable pred, which is a real number, and its conversion between 0 and 1, which represents probability, that is, prob, using the preceding transformation: Figure 6.1: Prediction and probability distribution of DJI As probabilities are in the range of (0,1) so is our vector prob. Now, to classify them as one of the two classes, I considered Up direction (1) when prob is greater than 0.5 and Down direction (0) when prob is less than 0.5. This assignment can be done using the following commands. prob> 0.5 generate true for points where it is greater and pred_direction[prob> 0.5] assigns 1 to all such points. Similarly, the next statement shows assignment 0 when probability is less than or equal to 0.5: >pred_direction<- NULL >pred_direction[prob> 0.5] <- 1 >pred_direction[prob<= 0.5] <- 0 Once we have figured out the predicted direction, we should check model accuracy: how much our model has predicted Up direction as Up direction and Down as Down. There might be some scenarios where it predicted the opposite of what it is, such as predicting down when it is actually Up and vice versa. We can use the caret package to calculate confusionMatrix(), which gives a matrix as an output. All diagonal elements are correctly predicted and off-diagonal elements are errors or wrongly predicted. One should aim to reduce the off-diagonal elements in a confusion matrix: >install.packages('caret') >library(caret) >matrix<- confusionMatrix(pred_direction,norm_isdji$Direction) >matrix Confusion Matrix and Statistics Reference Prediction 0 1 0 362 35 1 42 819 Accuracy : 0.9388 95% CI : (0.9241, 0.9514) No Information Rate : 0.6789 P-Value [Acc>NIR] : <2e-16 Kappa : 0.859 Mcnemar's Test P-Value : 0.4941 Sensitivity : 0.8960 Specificity : 0.9590 PosPredValue : 0.9118 NegPred Value : 0.9512 Prevalence : 0.3211 Detection Rate : 0.2878 Detection Prevalence : 0.3156 Balanced Accuracy : 0.9275 The preceding table shows we have got 94% correct prediction, as 362+819 = 1181 are correct predictions out of 1258 (sum of all four values). Prediction above 80% over in-sample data is generally assumed good prediction; however, 80% is not fixed, one has to figure out this value based on the dataset and industry. Now you have implemented the logistic regression model, which has predicted 94% correctly, and need to test it for generalization power. One should test this model using out-sample data and test its accuracy. The first step is to standardize the out-sample data using formula (6.1). Here mean and standard deviations should be the same as those used for in-sample normalization: >osidn<- matrix(1,dim(osdji)[1],dim(osdji)[2]) >norm_osdji<- (osdji - t(isme*t(osidn))) / t(isstd*t(osidn)) >norm_osdji[,dm[2]] <- direction[osrow] Next we use predict() on the out-sample data and use this value to calculate probability: >ospred<- predict(model,norm_osdji) >osprob<- 1 / (1+exp(-(ospred))) Once probabilities are determined for the out-sample data, you should put it into either Up or Down classes using the following commands. ConfusionMatrix() here will generate a matrix for the out-sample data: >ospred_direction<- NULL >ospred_direction[osprob> 0.5] <- 1 >ospred_direction[osprob<= 0.5] <- 0 >osmatrix<- confusionMatrix(ospred_direction,norm_osdji$Direction) >osmatrix Confusion Matrix and Statistics Reference Prediction 0 1 0 115 26 1 12 99 Accuracy : 0.8492 95% CI : (0.7989, 0.891) This shows 85% accuracy on the out-sample data. A realistic trading model also accounts for trading cost and market slippage, which decrease the winning odds significantly. We presented advanced techniques implemented in capital markets and also learned logistic regression model using binary behavior to forecast market direction. If you enjoyed this excerpt, check out the book Learning Quantitative Finance with R to deep dive into the vast world of algorithmic and machine-learning based trading.

0
1
39747

article-image-hands-on-table-calculation-techniques-with-tableau

Amarabha Banerjee

14 Feb 2018

4 min read

Hands on Table Calculation Techniques with Tableau

Amarabha Banerjee

14 Feb 2018

4 min read

[box type="note" align="" class="" width=""]This article is a book excerpt from the title Mastering Tableau written by David Baldwin. This book will help you master the intricacies of Tableau to create effective data visualizations.[/box] Today, we shall explore Table Calculation Techniques with Tableau and explore a real world example of using these techniques. In this article, we provide a simple schema for understanding table calculations. This schema is communicated via two questions: What is the function? How is the function applied? These two questions are inexorably connected. You cannot reliably apply something until you know what it is. And you cannot get useful results from something until you correctly apply it. The sections below will help you to get a head start into Table calculation techniques and how to use Tableau functions effectively for implementing Table calculation. Basics of Table Calculation Calculated fields can be categorized as: row level, aggregate level, and table level. For row- and aggregate-level calculations, the underlying data source engine does most (if not all) of the computational work, and Tableau merely visualizes the results. For table calculations, Tableau also relies on the underlying data source engine to execute computational tasks; however, after that work is completed and a dataset is returned, Tableau performs additional processing before rendering the results. This can be seen within the following process flow diagram, in the circled part titled Tableau performs additional processing. This is where table calculations are processed. We will continue with a definition: A table calculation is a function performed on a dataset in cache that has been generated as a result of a query from Tableau to the data source. Let's consider a couple of points regarding the dataset in cache mentioned in the preceding definition. First, it is important to understand that this cache is not simply the returned results of a query. Tableau may adjust the returned results. For example,, Tableau may expand the cache via data densification. Secondly, it's important to consider how the cache is structured. Basically, the dataset in cache is a table and, like all tables, is made up of rows and columns. This is particularly important for table calculations since a table calculation may be computed as it moves along the cache. Such a table calculation is directional. Alternatively, a table calculation may be computed based on the entire cache with no directional consideration. Table calculations such as this are non-directional. Directional and nondirectional table calculations will be considered more fully in the following section. Directional and non-directional table calculation functions As of Tableau 10, there are 32 table calculation functions in Tableau. However, many of these are simply variations of a theme; for example, there are five Running functions, including RUNNING_SUM and RUNNING_AVG. If we narrow our consideration to unique groups of table calculations functions, we will discover that there are only 11. The following table shows these 11 functions organized in two categories: As mentioned previously, non-directional table calculation functions operate on the entire cache and thus are not computed based on movement through the cache. For example, the SIZE function doesn't change based on the value of a previous row in the cache. On the other hand, RUNNING_SUM does change based on previous rows in the cache and is therefore considered directional. Practical Example In this example, we'll see directional and non-directional table calculation functions in action: Navigate to h t t p s ://p u b l i c . t a b l e a u . c o m /p r o f i l e /d a v i d 1. . b a l d w i n #!/ to locate and download the workbook associated with this chapter. Navigate to the worksheet entitled Directional/Non-Directional. Create the following calculated fields: Place Category and Ship Mode on the Rows shelf. Double-click on Sales, Lookup, Size, Window Sum, Window Sum w/Start&End, and Running Sum to populate the view. Compare the following screenshot with the notes in step 3 of this exercise for better understanding: We discussed techniques for implementing table calculations with Tableau. If you liked our post, be sure to check out Mastering Tableau which consists of other useful data visualization and data analysis techniques.

0
0
16059

article-image-how-to-perform-data-exploration-with-rethinkdb

Vijin Boricha

14 Feb 2018

5 min read

How to perform Data exploration with RethinkDB

Vijin Boricha

14 Feb 2018

5 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will let you master the capabilities of RethinkDB and implement them to develop efficient real-time web applications.[/box] In this article, we will learn to do data exploration in RethinkDB with the help of few use case. Executing data exploration use cases We have imported our database i.e. our mock data into our RethinkDB instance. Now it's time to run a use case query and make use of it. But before we do so, we need to figure out one data alteration. We have made a mistake while generating mock data (on purpose actually) we have a $ sign before ctc. Hence, it becomes tough to perform salary-level queries. Before we move ahead, we need to figure out this problem, and basically get rid of the $ sign and update the ctc value to an integer instead of a string. In order to do this, we need to perform the following operation: Traverse through each document in the database Split the ctc string into two parts, containing $ and the other value Update the ctc value in the document with a new data type and value Since we require the chaining of queries, I have written a small snippet in Node.js to achieve the previous scenario as follows: var rethinkdb = require('rethinkdb'); var connection = null; rethinkdb.connect({host : 'localhost', port : 28015},function(err,conn) { if(err) { throw new Error('Connection error'); } connection = conn; rethinkdb.db("company").table("employees") .run(connection,function(err,cursor) { if(err) { throw new Error(err); } cursor.each(function(err,data) { data.ctc = parseInt(data.ctc.split("$")[1]); rethinkdb.db("company").table("employees") .get(data.id) .update({ctc : data.ctc}) .run(connection,function(err,response) { if(err) { throw new Error(err); } console.log(response); }); }); }); }); As you can see in the preceding code, we first fetch all the documents and traverse them using cursor, one document at a time. We use the split() method as a $ separator and convert the outcome, which is salary, into an integer using the parseInt() method. We update each document at a time using the id value of the document: After selecting all the documents again, we can see an updated ctc value as an integer, as shown in the following figure: This is one of the practical examples where we perform some data manipulation before moving ahead with complex queries. Similarly, you can look for errors such as blank spaces in a specific field or duplicate elements in your record. Finding duplicate elements We can use distinct() to find out whether there is any duplicate element present in the table. Say you have 1,000 rows and there are 10 duplicates. In order to determine that, we just need to find out the unique rows (of course excluding the ID key, as that's unique by nature). Here is the query for the same: r.db("company").table('employees').without('id').distinct().count() As shown in the following screenshot, this query returns the count of unique rows, which should be 1,000 if there are no duplicates: This implies that our records contain no duplicate documents. Finding the list of countries We can write a query to find all the countries we have in our record and also use distinct again by just selecting the country field. Here is the query: r.db("company").table('employees')("country").distinct() As shown in this image, we have 124 countries in our records: Finding the top 10 employees with the highest salary In this use case, we need to evaluate all the records and find the top 10 employees with the highest to lowest pay. Here is the query for the same: r.db("company").table("employees").orderBy(r.desc("ctc")).limit(10) Here we are using orderBy, which by default orders the record in ascending order. To get the highest pay at the first document, we need to use descending ordering; we did it using the desc() ReQL command. As shown in the following image, the query returns 10 rows: You can modify the same query by just by limiting the number of users to one to get thehighest-paid employee. Displaying employee records with a specific name and location To extract such records from our table, we need to again perform a filter on the "first_name" and "country" fields. Here is the query to return those records: r.db("company").table('employees').filter({"first_name" : "John","country" : "Sweden"}) We are just performing a basic filter and comparing both fields. ReQL queries are really easy for solving such queries due to their chaining feature. After executing the preceding query, we show the following output: To summarize, we looked over a few use cases where we had to perform alteration and filtering of records in order to meet exploration task, like stripping the $ sign from ctc, or converting base 256 ip addresses into base 10 values and then performing a query on them. We also covered a general use case in order to get a practical feel of ReQL. If you are interested to learn about RethinkDB Query Language, Extending RethinkDB, and more you may check out this book Mastering RethinkDB.

0
0
2354

article-image-how-convert-java-code-into-kotlin

Aanand Shekhar

14 Feb 2018

2 min read

How to convert Java code into Kotlin

Aanand Shekhar

14 Feb 2018

2 min read

This Kotlin programming tutorial has been taken from Kotlin Programming Cookbook. One of the best things about Kotlin is its interoperability with Java. If you're a Java programmer, that should be a reason to start learning alone. If you're using an IntelliJ-based IDE, it's actually incredibly easy to convert Java code to Kotlin. In this step-by-step recipe, you'll find out how to do it. What you need to convert your Java code to Kotlin All you need to follow this recipe is an IntelliJ-based IDE installed, which compiles and runs Kotlin and Java. How to do it... Here are the steps you need to follow to convert a Java file to a Kotlin file: In your IntelliJ IDE, open the Java file that you want to convert to Kotlin. Note that it has a .java extension. Now, in the main menu, click on Code menu and choose the Convert Java File to Kotlin File option. Your Java file will be converted into Kotlin, and the extension will now be .kt. Here is an example of a Java file: After converting to Kotlin, this is what we have: A Kotlin file can be converted into Java, but it's better if you can avoid it or find an alternative way to do it. If you have to absolutely convert your Kotlin code to Java, click on Tools | Kotlin | Show Kotlin Bytecode in the menu: After clicking on Show Kotlin Bytecode, a window will open with the title Kotlin Bytecode: Click on Decompile and a .java file will be generated, containing a decompiled Java bytecode from Kotlin code: Yes, it has a lot of unnecessary code that was not present in the original Java code, but that is the case with decompiled bytecode. At the moment, this is the only way to convert Kotlin code to Java. Copy the decompiled file into a .java file and remove the unnecessary code. How it works Kotlin is a statically-typed programming language that works on Java Virtual Machine and compiles into JVM compatible bytecode. This is the reason we can convert Java code to Kotlin and mix Java and Kotlin code together. This is also the reason why you can, in a way, get Java code back from Kotlin (although the output is not completely desired).

0
0
42088

How-To Tutorials

article-image-deploy-rethinkdb-using-docker

Vijin Boricha

14 Feb 2018

7 min read

How to deploy RethinkDB using Docker

Vijin Boricha

14 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will help you develop efficient and real-time applications in RethinkDB with ease.[/box] In today’s tutorial, we will learn to install Docker, create an Docker image and deploy RethinkDB using Docker. Your code is not working in Production? But it's working on the QA (quality analysis server)! I am sure you have heard statements like these in your team during the deployment phase. Well no more of that, Docker everything and forget about the infrastructure of different environments, say, QA, Staging and Production, because your code is going to run Docker container not in those machines, hence write once, run everywhere. In this section, we will learn how to use Docker to deploy a RethinkDB Server or PaaS services. I am going to cover a few docker basics too; if you are already aware of them, please skip to the next section. Installing Docker Docker is available for all major platforms, such as, Linux-based distributions, Mac, and Windows. Visit the official website at h t t p s ://w w w . d o c k e r . c o m / and download the package suitable for your platform. We are installing Docker in our machine to create a new Docker image. Docker images are independent of platform and should not be confused with Docker for Mac or Docker for Windows. It's referred to as a Docker client too. Once you have installed the Docker, you need to start the Daemon process first; I am using a Mac so I can view this in the launchpad, as shown here: Upon clicking that, it will open up a nice console showing the Docker official logo and an indication that Docker is successfully booted, as shown in the following screenshot: Now we can begin creating our Docker image that in turn will run RethinkDB. Creating a Docker image For installing our RethinkDB on Ubuntu inside the Docker, we need to install the Ubuntu operating system. Run the following command to install a Ubuntu image from the official Docker hub repository: docker pull ubuntu This will download and install the Ubuntu image in our system. We will later use this Ubuntu image and install our RethinkDB instance; you can choose different operating systems as well. Before going to the Docker configuration code, I would like to point out the steps we require to install RethinkDB on a fresh Ubuntu installation: Update the system Add the RethinkDB repository to the known repository list Install RethinkDB Set the data folder Expose the port We are going to do this using Docker. To create a Docker image, we require Dockerfile. Create a file called Dockerfile with no extension and apply the code shown here: FROM ubuntu:latest # Install RethinkDB. RUN apt-get update && echo "deb http://download.rethinkdb.com/apt `lsb_release -cs` main" > /etc/apt/sources.list.d/rethinkdb.list && apt-get install -y wget && wget -O- http://download.rethinkdb.com/apt/pubkey.gpg | apt-key add - && apt-get update && apt-get install -y rethinkdb python-pip && rm -rf /var/lib/apt/lists/* # Install python driver for rethinkdb RUN pip install rethinkdb # Define mountable directories. VOLUME ["/data"] # Define working directory. WORKDIR /data # Define default command. CMD ["rethinkdb", "--bind", "all"] # Expose ports. # - 8080: web UI # - 28015: process # - 29015: cluster EXPOSE 8080 EXPOSE 28015 EXPOSE 29015 The first line is our entry point to the Ubuntu operating system, then we are performing an update of the system and using the installation commands recommended by RethinkDB here: h t t p s ://w w w . r e t h i n k d b . c o m /d o c s /i n s t a l l /u b u n t u /. Once the installation is complete, we install the rethinkdb python driver to perform the import/export operation. The next two commands mount a new volume in Ubuntu and telling RethinkDB to use that volume. The next command runs rethinkdb by binding all the ports and exposing the ports to be used by the client driver and web console. In order to make this a docker image, save the file and run the following command within the project directory: docker build -t docker-rethinkdb. Here, we are building our docker image and giving it a name docker-rethinkdb; upon running this command, Docker will execute the Dockerfile and you're on. The representation of the previous steps is shown here: Once everything works, and I am sure it will, you will see a success message in the console, as shown here: Congratulations! You have successfully created a docker image for RethinkDB. If you want to see your image and its properties, run the following command: docker images And this will list all the images of Docker, as shown in the following screenshot: Awesome! Now let's run it. To access the web portal, we need to run our docker image and bind port 8080 of the docker image to some port of our machine; here is the command to do so: docker run -p 3000:8080 -d docker-rethinkdb As per the command above, -p is used to specify port binding, the first is the target and second port is source, that is, Docker port and -d is used to run it in the background or Daemon. This will run the docker image in the background; to extract more information about this process, we need to run the following command: docker ps This will list all the running images called as a container, along with the information, as shown in the following screenshot: You can also check the logs of specific containers using the following command: docker logs <container id> Now, in order to access the RethinkDB web console from our machine, we need to find out the IP address on which the Docker machine is running. To get that, we need to run the following command: docker-machine ip default This will print out the IP. Copy the IP and hit IP:3000 from the browser to view the RethinkDB web console, as shown here: So we have docker running and accessible from the browser. In order to import and export the data, we need to log in to our Docker image. To do that, run the following command: docker exec -i -t <container-id> /bin/bash This will log in to the docker image running Ubuntu; refer to the following screenshot: You can now run the rethinkdb command to perform the data import to the existing RethinkDB cluster. Deploying the Docker image Almost every PaaS service we have covered in earlier sections provides support for Docker. You can submit your Dockerfile to git and clone it anywhere if you want to create Docker image. You can submit the whole docker image (not Dockerfile) to Dockerhub and pull your docker image directly using the docker pull command, which is no doubt an easy way because you will be directly working on the image running on the server. We covered RethinkDB deployment using Docker and learned how to create our own RethinkDB image. You can learn more about RethinkDB Query Language and Performance Tuning in RethinkDB from this book Mastering RethinkDB.

0
0
57987

article-image-getting-started-with-data-visualization-in-tableau

Amarabha Banerjee

13 Feb 2018

5 min read

Getting started with Data Visualization in Tableau

Amarabha Banerjee

13 Feb 2018

5 min read

[box type="note" align="" class="" width=""]This article is an book extract from Mastering Tableau, written by David Baldwin. Tableau has emerged as one of the most popular Business Intelligence solutions in recent times, thanks to its powerful and interactive data visualization capabilities. This book will empower you to become a master in Tableau by exploiting the many new features introduced in Tableau 10.0.[/box] In today’s post, we shall explore data visualization basics with Tableau and explore a real world example using these techniques. Tableau Software has a focused vision resulting in a small product line. The main product (and hence the center of the Tableau universe) is Tableau Desktop. Assuming you are a Tableau author, that's where almost all your time will be spent when working with Tableau. But of course you must be able to connect to data and output the results. Thus, as shown in the following figure, the Tableau universe encompasses data sources, Tableau Desktop, and output channels, which include the Tableau Server family and Tableau Reader: Worksheet and dashboard creation At the heart of Tableau are worksheets and dashboards. Worksheets contain individual visualizations and dashboards contain one or more worksheets. Additionally, worksheets and dashboards may be combined into stories to communicate specific insights to the end user via a presentation environment. Lastly, all worksheets, dashboards, and stories are organized in workbooks that can be accessed via Tableau Desktop, Server, or Reader. In this section, we will look at worksheet and dashboard creation with the intent of not only communicating the basics, but also providing some insight that may prove helpful to even more seasoned Tableau authors. Worksheet creation At the most fundamental level, a visualization in Tableau is created by placing one or more fields on one or more shelves. To state this as a pseudo-equation: Field(s) + shelf(s) = Viz As an example, note that the visualization created in the following screenshot is generated by placing the Sales field on the Text shelf. Although the results are quite simple – a single number – this does qualify as a view. In other words, a Field (Sales) placed on a shelf (Text) has generated a Viz: Exercise – fundamentals of visualizations Let's explore the basics of creating a visualization via an exercise: Navigate to h t t p s ://p u b l i c . t a b l e a u . c o m /p r o f i l e /d a v i d 1. . b a l d w i n #!/ to locate and download the workbook associated with this chapter. In the workbook, find the tab labeled Fundamentals of Visualizations: Locate Region within the Dimensions portion of the Data pane: Drag Region to the Color shelf; that is, Region + Color shelf = what is shown in the following screenshot: Click on the Color shelf and then on Edit Colors… to adjust colors as desired: Next, move Region to the Size, Label/Text, Detail, Columns, and Rows shelves. After placing Region on each shelf, click on the shelf to access additional options. Lastly, choose other fields to drop on various shelves to continue exploring Tableau's behavior. As you continue exploring Tableau's behavior by dragging and dropping different fields onto different shelves, you will notice that Tableau responds with default behaviors. These defaults, however, can be overridden, which we will explore in the following section. Dashboard Creation Although, as stated previously, a dashboard contains one or more worksheets, dashboards are much more than static presentations. They are an essential part of Tableau's interactivity. In this section, we will populate a dashboard with worksheets and then deploy actions for interactivity. Exercise – building a dashboard In the workbook associated with this chapter, navigate to the tab entitled Building a Dashboard. Within the Dashboard pane located on the left-hand portion of the screen, double-click on each of the following worksheets (in the order in which they are listed) to add them to the dashboard: US Sales Customer Segment Scatter Plot Customers In the lower right-hand corner of the dashboard, click in the blank area below Profit Ratio to select the vertical container: After clicking in the blank area, you should see a blue border around the filter and the legends. This indicates that the vertical container is selected. As shown in the following screenshot, select the vertical container handle and drag it to the left-hand side of the Customers worksheet. Note the gray shading, which communicates where the container will be placed: The gray shading (provided by Tableau when dragging elements such as worksheets and containers onto a dashboard) helpfully communicates where the element will be placed. Take your time and observe carefully when placing an element on a dashboard or the results may be Unexpected. 5. Format the dashboard as desired. The following tips may prove helpful: Adjust the sizes of the elements on the screen by hovering over the edges between each element and then clicking and dragging as Desired. 2. Note that the Sales and Profit legends in the following screenshot are floating elements. Make an element float by right-clicking on the element handle and selecting Floating. (See the previous screenshot and note that the handle is located immediately above Region, in the upper-right-hand corner). 3. Create Horizontal and Vertical containers by dragging those objects from the bottom portion of the Dashboard pane. 4. Drag the edges of containers to adjust the size of each worksheet. 5. Display the dashboard title via Dashboard | Show Title…: If you enjoyed our post, be sure to check out Mastering Tableau which consists of many useful data visualization and data analysis techniques.

0
0
36691

article-image-getting-inside-c-plus-plus-multithreaded-application

Maya Posch

13 Feb 2018

8 min read

Getting Inside a C++ Multithreaded Application

Maya Posch

13 Feb 2018

8 min read

This C++ programming tutorial is taken from Maya Posch's Mastering C++ Multithreading. In its most basic form, a multithreaded application consists of a singular process with two or more threads. These threads can be used in a variety of ways, for example, to allow the process to respond to events in an asynchronous manner by using one thread per incoming event or type of event, or to speed up the processing of data by splitting the work across multiple threads. Examples of asynchronous response to events include the processing of user interface (GUI) and network events on separate threads so that neither type of event has to wait on the other, or can block events from being responded to in time. Generally, a single thread performs a single task, such as the processing of GUI or network events, or the processing of data. For this basic example, the application will start with a singular thread, which will then launch a number of threads, and wait for them to finish. Each of these new threads will perform its own task before finishing. Let's start with the includes and global variables for our application: #include <iostream> #include <thread> #include <mutex> #include <vector> #include <random> using namespace std; // --- Globals mutex values_mtx; mutex cout_mtx; Both the I/O stream and vector headers should be familiar to anyone who has ever used C++: the former is here used for the standard output (cout), and vector for storing a sequence of values. The random header is new in c++11, and as the name suggests, it offers classes and methods for generating random sequences. We use it here to make our threads do something interesting. Finally, the thread and mutex includes are the core of our multithreaded application; they provide the basic means for creating threads, and allow for thread-safe interactions between them. Moving on, we create two mutexes: one for the global vector and one for cout, since the latter is not thread-safe. Next we create the main function as follows: int main() { values.push_back(42); We then push a fixed value onto the vector instance; this one will be used by the threads we create in a moment: thread tr1(threadFnc, 1); thread tr2(threadFnc, 2); thread tr3(threadFnc, 3); thread tr4(threadFnc, 4); We create new threads, and provide them with the name of the method to use, passing along any parameters--in this case, just a single integer: tr1.join(); tr2.join(); tr3.join(); tr4.join(); Next, we wait for each thread to finish before we continue by calling join() on each thread instance: cout << "Input: " << values[0] << ", Result 1: " << values[1] << ", Result 2: " << values[2] << ", Result 3: " << values[3] << ", Result 4: " << values[4] << "n"; return 1; } At this point, we expect that each thread has done whatever it's supposed to do, and added the result to the vector, which we then read out and show the user. Of course, this shows almost nothing of what really happens in the application, mostly just the essential simplicity of using threads. Next, let's see what happens inside this method that we pass to each thread instance: void threadFnc(int tid) { cout_mtx.lock(); cout << "Starting thread " << tid << ".n"; cout_mtx.unlock(); When we obtain the initial value set in the vector, we copy it to a local variable so that we can immediately release the mutex for the vector to enable other threads to use it: int rval = randGen(0, 10); val += rval; These last two lines contain the essence of what the threads created do: they take the initial value, and add a randomly generated value to it. The randGen() method takes two parameters, defining the range of the returned value: cout_mtx.lock(); cout << "Thread " << tid << " adding " << rval << ". New value: " << val << ".n"; cout_mtx.unlock(); values_mtx.lock(); values.push_back(val); values_mtx.unlock(); } Finally, we (safely) log a message informing the user of the result of this action before adding the new value to the vector. In both cases, we use the respective mutex to ensure that there can be no overlap with any of the other threads. Once the method reaches this point, the thread containing it will terminate, and the main thread will have one fewer thread to wait for to rejoin. Lastly, we'll take a look at the randGen() method. Here we can see some multithreaded specific additions as well: int randGen(const int& min, const int& max) { static thread_local mt19937 generator(hash<thread::id>()(this_thread::get_id())); uniform_int_distribution<int> distribution(min, max); return distribution(generator) } This preceding method takes a minimum and maximum value as explained earlier, which limit the range of the random numbers this method can return. At its core, it uses a mt19937-based generator, which employs a 32-bit Mersenne Twister algorithm with a state size of 19937 bits. This is a common and appropriate choice for most applications. Of note here is the use of the thread_local keyword. What this means is that even though it is defined as a static variable, its scope will be limited to the thread using it. Every thread will thus create its own generator instance, which is important when using the random number API in the STL. A hash of the internal thread identifier (not our own) is used as seed for the generator. This ensures that each thread gets a fairly unique seed for its generator instance, allowing for better random number sequences. Finally, we create a new uniform_int_distribution instance using the provided minimum and maximum limits, and use it together with the generator instance to generate the random number which we return. Makefile In order to compile the code described earlier, one could use an IDE, or type the command on the command line. As mentioned in the beginning of this chapter, we'll be using makefiles for the examples in this book. The big advantages of this are that one does not have to repeatedly type in the same extensive command, and it is portable to any system which supports make. The makefile for this example is rather basic: GCC := g++ OUTPUT := ch01_mt_example SOURCES := $(wildcard *.cpp) CCFLAGS := -std=c++11 all: $(OUTPUT) $(OUTPUT): clean: rm $(OUTPUT) .PHONY: all From the top down, we first define the compiler that we'll use (g++), set the name of the output binary (the .exe extension on Windows will be post-fixed automatically), followed by the gathering of the sources and any important compiler flags. The wildcard feature allows one to collect the names of all files matching the string following it in one go without having to define the name of each source file in the folder individually. For the compiler flags, we're only really interested in enabling the c++11 features, for which GCC still requires one to supply this compiler flag. For the all method, we just tell make to run g++ with the supplied information. Next we define a simple clean method which just removes the produced binary, and finally, we tell make to not interpret any folder or file named all in the folder, but to use the internal method with the .PHONY section. When we run this makefile, we see the following command-line output: $ make Afterwards, we find an executable file called ch01_mt_example (with the .exe extension attached on Windows) in the same folder. Executing this binary will result in a command-line output akin to the following: $ ./ch01_mt_example.exe Starting thread 1. Thread 1 adding 8. New value: 50. Starting thread 2. Thread 2 adding 2. New value: 44. Starting thread 3. Starting thread 4. Thread 3 adding 0. New value: 42. Thread 4 adding 8. New value: 50. Input: 42, Result 1: 50, Result 2: 44, Result 3: 42, Result 4: 50 What one can see here already is the somewhat asynchronous nature of threads and their output. While threads 1 and 2 appear to run synchronously, threads 3 and 4 clearly run asynchronously. For this reason, and especially in longer-running threads, it's virtually impossible to say in which order the log output and results will be returned. While we use a simple vector to collect the results of the threads, there is no saying whether Result 1 truly originates from the thread which we assigned ID 1 in the beginning. If we need this information, we need to extend the data we return by using an information structure with details on the processing thread or similar. One could, for example, use struct like this: struct result { int tid; int result; }; The vector would then be changed to contain result instances rather than integer instances. One could pass the initial integer value directly to the thread as part of its parameters, or pass it via some other way. Want to learn C++ multithreading in detail? You can find Mastering C++ Multithreading here, or explore all our latest C++ eBooks and videos here.

0
0
16318

article-image-how-to-maintain-apache-mesos

Vijin Boricha

13 Feb 2018

6 min read

How to maintain Apache Mesos

Vijin Boricha

13 Feb 2018

6 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by David Blomquist and Tomasz Janiszewski, titled Apache Mesos Cookbook. Throughout the course of the book, you will get to know tips and tricks along with best practices to follow when working with Mesos.[/box] In this article, we will learn about configuring logging options, setting up monitoring ecosystem, and upgrading your Mesos cluster. Logging and debugging Here we will configure logging options that will allow us to debug the state of Mesos. Getting ready We will assume Mesos is available on localhost port 5050. The steps provided here will work for either master or agents. How to do it... When Mesos is installed from pre-built packages, the logs are by default stored in /var/log/mesos/. When installing from a source build, storing logs is disabled by default. To change the log store location, we need to edit /etc/default/mesos and set the LOGS variable to the desired destination. For some reason, mesos-init-wrapper does not transfer the contents of /etc/mesos/log_dir to the --log_dir flag. That's why we need to set the log's destination in the environment variable. Remember that only Mesos logs will be stored there. Logs from third-party applications (for example, ZooKeeper) will still be sent to STDERR. Changing the default logging level can be done in one of two ways: by specifying the -- logging_level flag or by sending a request and changing the logging level at runtime for a specific period of time. For example, to change the logging level to INFO, just put it in the following code: /etc/mesos/logging_level echo INFO > /etc/mesos/logging_level The possible levels are INFO, WARNING, and ERROR. For example, to change the logging level to the most verbose for 15 minutes for debug purposes, we need to send the following request to the logging/toggle endpoint: curl -v -X POST localhost:5050/logging/toggle?level=3&duration=15mins How it works... Mesos uses the Google-glog library for debugging, but third-party dependencies such as ZooKeeper have their own logging solution. All configuration options are backed by glog and apply only to Mesos core code. Monitoring Now, we will set up monitoring for Mesos. Getting ready We must have a running monitoring ecosystem. Metrics storage could be a simple time- series database such as graphite, influxdb, or prometheus. In the following example, we are using graphite and our metrics are published with http://diamond.readthedocs.io/en/latest/. How to do it... Monitoring is enabled by default. Mesos does not provide any way to automatically push metrics to the registry. However, it exposes them as a JSON that can be periodically pulled and saved into the metrics registry: Install Diamond using following command: pip install diamond If additional packages are required to install them, run: sudo apt-get install python-pip python-dev build-essential. pip (Pip Installs Packages) is a Python package manager used to install software written in Python. Configure the metrics handler and interval. Open /etc/diamond/diamond.conf and ensure that there is a section for graphite configuration: [handler_graphite] class = handlers.GraphiteHandler host = <graphite.host> port = <graphite.port> Remember to replace graphite.host and graphite.port with real graphite details. Enable the default Mesos Collector. Create configuration files diamond-setup - C MesosCollector. Check whether the configuration has proper values and edit them if needed. The configuration can be found in /etc/diamond/collectors/MesosCollector.conf. On master, this file should look like this: enabled = True host = localhost port = 5050 While on agent, the port could be different (5051), as follows: enabled = True host = localhost port = 5051 How it works... Mesos exposes metrics via the HTTP API. Diamond is a small process that periodically pulls metrics, parses them, and sends them to the metrics registry, in this case, graphite. The default implementation of Mesos Collector does not store all the available metrics so it's recommended to write a custom handler that will collect all the interesting information. See also... Metrics could be read from the following endpoints: http://mesos.apache.org/documentation/latest/endpoints/metrics/snapshot/ http://mesos.apache.org/documentation/latest/endpoints/slave/monitor/statistics/ http://mesos.apache.org/documentation/latest/endpoints/slave/state/ Upgrading Mesos In this recipe, you will learn how to upgrade your Mesos cluster. How to do it... Mesos release cadence is at least one release per quarter. Minor releases are backward compatible, although there could be some small incompatibilities or the dropping of deprecated methods. The recommended method of upgrading is to apply all intermediate versions. For example, to upgrade from 0.27.2 to 1.0.0, we should apply 0.28.0, 0.28.1, 0.28.2, and finally 1.0.0. If the agent's configuration changes, clearing the metadata directory is required. You can do this with the following code: rm -rv {MESOS_DIR}/metadata Here, {MESOS_DIR} should be replaced with the configured Mesos directory. Rolling upgrades is the preferred method of upgrading clusters, starting with masters and then agents. To minimize the impact on running tasks, if an agent's configuration changes and it becomes inaccessible, then it should be switched to maintenance mode. How it works... Configuration changes may require clearing the metadata because the changes may not be backward compatible. For example, when an agent runs with different isolators, it shouldn't attach to the already running processes without this isolator. The Mesos architecture will guarantee that the executors that were not attached to the Mesos agent will commit suicide after a configurable amount of time (--executor_registration_timeout). Maintenance mode allows you to declare the time window during which the agent will be inaccessible. When this occurs, Mesos will send a reverse offer to all the frameworks to drain that particular agent. The frameworks are responsible for shutting down its task and spawning it on another agent. The Maintenance mode is applied, even if the framework does not implement the HTTP API or is explicitly declined. Using maintenance mode can prevent restarting tasks multiple times. Consider the following example with five agents and one task, X. We schedule the rolling upgrade of all the agents. Task X is deployed on agent 1. When it goes down, it's moved to 2, then to 3, and so on. This approach is extremely inefficient because the task is restarted five times, but it only needs to be restarted twice. Maintenance mode enables the framework to optimally schedule the task to run on agent 5 when 1 goes down, and then return to 1 when 5 goes down: Worst case scenario of rolling upgrade without maintenance mode legend optimal solution of rolling upgrade with maintenance mode. We have learnt about running and maintaining Mesos. To know more about managing containers and understanding the scheduler API you may check out this book, Apache Mesos Cookbook.

0
0
3686

Richa Tripathi

13 Feb 2018

8 min read

Hypothesis testing with R

Richa Tripathi

13 Feb 2018

8 min read

0
0
15203

article-image-tableau-data-handling-engine-offer

Amarabha Banerjee

13 Feb 2018

6 min read

What Tableau Data Handling Engine has to offer

Amarabha Banerjee

13 Feb 2018

6 min read

[box type="note" align="" class="" width=""]This article is taken from the book Mastering Tableau, written by David Baldwin. This book will equip you with all the information needed to create effective dashboards and data visualization solutions using Tableau.[/box] In today’s tutorial, we shall explore the Tableau data handling engine and a real world example of how to use it. Tableau's data-handling engine is usually not well comprehended by even advanced authors because it's not an overt part of day-to-day activities; however, for the author who wants to truly grasp how to ready data for Tableau, this understanding is indispensable. In this section, we will explore Tableau's data-handling engine and how it enables structured yet organic data mining processes in the enterprise. To begin, let's clarify a term. The phrase Data-Handling Engine (DHE) in this context references how Tableau interfaces with and processes data. This interfacing and processing is comprised of three major parts: Connection, Metadata, and VizQL. Each part is described in detail in the following section. In other publications, Tableau's DHE may be referred to as a metadata model or the Tableau infrastructure. I've elected not to use either term because each is frequently defined differently in different contexts, which can be quite confusing. Tableau's DHE (that is, the engine for interfacing with and processing data) differs from other broadly considered solutions in the marketplace. Legacy business intelligence solutions often start with structuring the data for an entire enterprise. Data sources are identified, connections are established, metadata is defined, a model is created, and more. The upfront challenges this approach presents are obvious: highly skilled professionals, time-intensive rollout, and associated high startup costs. The payoff is a scalable, structured solution with detailed documentation and process control. Many next generation business intelligence platforms claim to minimize or completely do away with the need for structuring data. The upfront challenges are minimized: specialized skillsets are not required and the rollout time and associated startup costs are low. However, the initial honeymoon is short-lived, since the total cost of ownership advances significantly when difficulties are encountered trying to maintain and scale the solution. Tableau's infrastructure represents a hybrid approach, which attempts to combine the advantages of legacy business intelligence solutions with those of next-generation platforms, while minimizing the shortcomings of both. The philosophical underpinnings of Tableau's hybrid approach include the following: Infrastructure present in current systems should be utilized when advantageous Data models should be accessible by Tableau but not required DHE components as represented in Tableau should be easy to modify DHE components should be adjustable by business users The Tableau Data-Handling Engine The preceding diagram shows that the DHE consists of a run time module (VizQL) and two layers of abstraction (Metadata and Connection). Let's begin at the bottom of the graphic by considering the first layer of abstraction, Connection. The most fundamental aspect of the Connection is a path to the data source. The path should include attributes for the database, tables, and views as applicable. The Connection may also include joins, custom SQL, data-source filters, and more. In keeping with Tableau's philosophy of easy to modify and adjustable by business users (see the previous section), each of these aspects of the Connection is easily modifiable. For example, an author may choose to add an additional table to a join or modify a data-source filter. Note that the Connection does not contain any of the actual data. Although an author may choose to create a data extract based on data accessed by the Connection, that extract is separate from the connection. The next layer of abstraction is the metadata. The most fundamental aspect of the Metadata layer is the determination of each field as a measure or dimension. When connecting to relational data, Tableau makes the measure/dimension determination based on heuristics that consider the data itself as well as the data source's data types. Other aspects of the metadata include aliases, data types, defaults, roles, and more. Additionally, the Metadata layer encompasses author-generated fields such as calculations, sets, groups, hierarchies, bins, and so on. Because the Metadata layer is completely separate from the Connection layer, it can be used with other Connection layers; that is, the same metadata definitions can be used with different data sources. VizQL is generated when a user places a field on a shelf. The VizQL is then translated into Structured Query Language (SQL), Multidimensional Expressions(MDX), or Tableau Query Language (TQL) and passed to the backend data source via a driver. The following two aspects of the VizQL module are of primary importance: VizQL allows the author to change field attributions on the fly VizQL enables table calculations Let's consider each of these aspects of VizQL via examples: Changing field attribution example An analyst is considering infant mortality rates around the world. Using data from h t t p://d a t a . w o r l d b a n k . o r g /, they create the following worksheet by placing AVG(Infant Mortality Rate) and Country on the Columns and Rows shelves, respectively. AVG(Infant Mortality Rate) is, of course, treated as a measure in this case: Next they create a second worksheet to analyze the relationship between Infant Mortality Rate and Health Exp/Capita (that is, health expenditure per capita). In order to accomplish this, they define Infant Mortality Rate as a dimension, as shown in the following Screenshot: Studying the SQL generated by VizQL to create the preceding visualization is particularly Insightful: SELECT ['World Indicators$'].[Infant Mortality Rate] AS [Infant Mortality Rate], AVG(['World Indicators$'].[Health Exp/Capita]) AS [avg:Health Exp/Capita:ok] FROM [dbo].['World Indicators$'] ['World Indicators$'] GROUP BY ['World Indicators$'].[Infant Mortality Rate] The Group By clause clearly communicates that Infant Mortality Rate is treated as a dimension. The takeaway is to note that VizQL enabled the analyst to change the field usage from measure to dimension without adjusting the source metadata. This on-the-fly ability enables creative exploration of the data not possible with other tools and avoids lengthy exercises attempting to define all possible uses for each field. If you liked our article, be sure to check out Mastering Tableau which consists of more useful data visualization and data analysis techniques.

0
0
11096

article-image-how-to-build-a-music-recommendation-system-with-pagerank-algorithm

Vijin Boricha

13 Feb 2018

6 min read

How to Build a music recommendation system with PageRank Algorithm

Vijin Boricha

13 Feb 2018

6 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book Mastering Spark for Data Science written by Andrew Morgan and Antoine Amend. In this book, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms to scale them linearly.[/box] In today’s tutorial, we will learn to build a recommender with PageRank algorithm. The PageRank algorithm Instead of recommending a specific song, we will recommend playlists. A playlist would consist of a list of all our songs ranked by relevance, most to least relevant. Let's begin with the assumption that people listen to music in a similar way to how they browse articles on the web, that is, following a logical path from link to link, but occasionally switching direction, or teleporting, and browsing to a totally different website. Continuing with the analogy, while listening to music one can either carry on listening to music of a similar style (and hence follow their most expected journey), or skip to a random song in a totally different genre. It turns out that this is exactly how Google ranks websites by popularity using a PageRank algorithm. For more details on the PageRank algorithm visit: h t t p ://i l p u b s . s t a n f o r d . e d u :8090/422/1/1999- 66. p d f . The popularity of a website is measured by the number of links it points to (and is referred from). In our music use case, the popularity is built as the number hashes a given song shares with all its neighbors. Instead of popularity, we introduce the concept of song commonality. Building a Graph of Frequency Co-occurrence We start by reading our hash values back from Cassandra and re-establishing the list of song IDs for each distinct hash. Once we have this, we can count the number of hashes for each song using a simple reduceByKey function, and because the audio library is relatively small, we collect and broadcast it to our Spark executors: val hashSongsRDD = sc.cassandraTable[HashSongsPair]("gzet", "hashes") val songHashRDD = hashSongsRDD flatMap { hash => hash.songs map { song => ((hash, song), 1) } } val songTfRDD = songHashRDD map { case ((hash,songId),count) => (songId, count) } reduceByKey(_+_) val songTf = sc.broadcast(songTfRDD.collectAsMap()) Next, we build a co-occurrence matrix by getting the cross product of every song sharing a same hash value, and count how many times the same tuple is observed. Finally, we wrap the song IDs and the normalized (using the term frequency we just broadcast) frequency count inside of an Edge class from GraphX: implicit class Crossable[X](xs: Traversable[X]) { def cross[Y](ys: Traversable[Y]) = for { x <- xs; y <- ys } yield (x, y) val crossSongRDD = songHashRDD.keys .groupByKey() .values .flatMap { songIds => songIds cross songIds filter { case (from, to) => from != to }.map(_ -> 1) }.reduceByKey(_+_) .map { case ((from, to), count) => val weight = count.toDouble / songTfB.value.getOrElse(from, 1) Edge(from, to, weight) }.filter { edge => edge.attr > minSimilarityB.value } val graph = Graph.fromEdges(crossSongRDD, 0L) We are only keeping edges with a weight (meaning a hash co-occurrence) greater than a predefined threshold in order to build our hash frequency graph. Running PageRank Contrary to what one would normally expect when running a PageRank, our graph is undirected. It turns out that for our recommender, the lack of direction does not matter, since we are simply trying to find similarities between Led Zeppelin and Spirit. A possible way of introducing direction could be to look at the song publishing date. In order to find musical influences, we could certainly introduce a chronology from the oldest to newest songs giving directionality to our edges. In the following pageRank, we define a probability of 15% to skip, or teleport as it is known, to any random song, but this can be obviously tuned for different needs: val prGraph = graph.pageRank(0.001, 0.15) Finally, we extract the page ranked vertices and save them as a playlist in Cassandra via an RDD of the Song case class: case class Song(id: Long, name: String, commonality: Double) val vertices = prGraph .vertices .mapPartitions { vertices => val songIds = songIdsB .value .vertices .map { case (songId, pr) => val songName = songIds.get(vId).get Song(songId, songName, pr) } } vertices.saveAsCassandraTable("gzet", "playlist") The reader may be pondering the exact purpose of PageRank here, and how it could be used as a recommender? In fact, our use of PageRank means that the highest ranking songs would be the ones that share many frequencies with other songs. This could be due to a common arrangement, key theme, or melody; or maybe because a particular artist was a major influence on a musical trend. However, these songs should be, at least in theory, more popular (by virtue of the fact they occur more often), meaning that they are more likely to have mass appeal. On the other end of the spectrum, low ranking songs are ones where we did not find any similarity with anything we know. Either these songs are so avant-garde that no one has explored these musical ideas before, or alternatively are so bad that no one ever wanted to copy them! Maybe they were even composed by that up-and-coming artist you were listening to in your rebellious teenage years. Either way, the chance of a random user liking these songs is treated as negligible. Surprisingly, whether it is a pure coincidence or whether this assumption really makes sense, the lowest ranked song from this particular audio library is Daft Punk's–Motherboard it is a title that is quite original (a brilliant one though) and a definite unique sound. To summarize, we have learnt how to build a complete recommendation system for a song playlist. You can check out the book Mastering Spark for Data Science to deep dive into Spark and deliver other production grade data science solutions. Read our post on how deep learning is revolutionizing the music industry. And here is how you can analyze big data using the pagerank algorithm.

0
0
47410

article-image-neural-network-architectures-101-understanding-perceptrons

Kunal Chaudhari

12 Feb 2018

9 min read

Neural Network Architectures 101: Understanding Perceptrons

Kunal Chaudhari

12 Feb 2018

9 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Neural Network Programming with Java Second Edition written by Fabio M. Soares and Alan M. F. Souza. This book is for Java developers who want to master developing smarter applications like weather forecasting, pattern recognition etc using neural networks. [/box] In this article we will discuss about perceptrons along with their features, applications and limitations. Perceptrons are a very popular neural network architecture that implements supervised learning. Projected by Frank Rosenblatt in 1957, it has just one layer of neurons, receiving a set of inputs and producing another set of outputs. This was one of the first representations of neural networks to gain attention, especially because of their simplicity. In our Java implementation, this is illustrated with one neural layer (the output layer). The following code creates a perceptron with three inputs and two outputs, having the linear function at the output layer: int numberOfInputs=3; int numberOfOutputs=2; Linear outputAcFnc = new Linear(1.0); NeuralNet perceptron = new NeuralNet(numberOfInputs,numberOfOutputs, outputAcFnc); Applications and limitations However, scientists did not take long to conclude that a perceptron neural network could only be applied to simple tasks, according to that simplicity. At that time, neural networks were being used for simple classification problems, but perceptrons usually failed when faced with more complex datasets. Let's illustrate this with a very basic example (an AND function) to understand better this issue. Linear separation The example consists of an AND function that takes two inputs, x1 and x2. That function can be plotted in a two-dimensional chart as follows: And now let's examine how the neural network evolves the training using the perceptron rule, considering a pair of two weights, w1 and w2, initially 0.5, and bias valued 0.5 as well. Assume learning rate η equals 0.2: Epoch x1 x2 w1 w2 b y t E Δw1 Δw2 Δb 1 0 0 0.5 0.5 0.5 0.5 0 -0.5 0 0 -0.1 1 0 1 0.5 0.5 0.4 0.9 0 -0.9 0 -0.18 -0.18 1 1 0 0.5 0.32 0.22 0.72 0 -0.72 -0.144 0 -0.144 1 1 1 0.356 0.32 0.076 0.752 1 0.248 0.0496 0.0496 0.0496 2 0 0 0.406 0.370 0.126 0.126 0 -0.126 0.000 0.000 -0.025 2 0 1 0.406 0.370 0.100 0.470 0 -0.470 0.000 -0.094 -0.094 2 1 0 0.406 0.276 0.006 0.412 0 -0.412 -0.082 0.000 -0.082 2 1 1 0.323 0.276 -0.076 0.523 1 0.477 0.095 0.095 0.095 … … 89 0 0 0.625 0.562 -0.312 -0.312 0 0.312 0 0 0.062 89 0 1 0.625 0.562 -0.25 0.313 0 -0.313 0 -0.063 -0.063 89 1 0 0.625 0.500 -0.312 0.313 0 -0.313 -0.063 0 -0.063 89 1 1 0.562 0.500 -0.375 0.687 1 0.313 0.063 0.063 0.063 After 89 epochs, we find the network to produce values near to the desired output. Since in this example the outputs are binary (zero or one), we can assume that any value produced by the network that is below 0.5 is considered to be 0 and any value above 0.5 is considered to be 1. So, we can draw a function , with the final weights and bias found by the learning algorithm w1=0.562, w2=0.5 and b=-0.375, defining the linear boundary in the chart: This boundary is a definition of all classifications given by the network. You can see that the boundary is linear, given that the function is also linear. Thus, the perceptron network is really suitable for problems whose patterns are linearly separable. The XOR case Now let's analyze the XOR case: We see that in two dimensions, it is impossible to draw a line to separate the two patterns. What would happen if we tried to train a single layer perceptron to learn this function? Suppose we tried, let's see what happened in the following table: Epoch x1 x2 w1 w2 b y t E Δw1 Δw2 Δb 1 0 0 0.5 0.5 0.5 0.5 0 -0.5 0 0 -0.1 1 0 1 0.5 0.5 0.4 0.9 1 0.1 0 0.02 0.02 1 1 0 0.5 0.52 0.42 0.92 1 0.08 0.016 0 0.016 1 1 1 0.516 0.52 0.436 1.472 0 -1.472 -0.294 -0.294 -0.294 2 0 0 0.222 0.226 0.142 0.142 0 -0.142 0.000 0.000 -0.028 2 0 1 0.222 0.226 0.113 0.339 1 0.661 0.000 0.132 0.132 2 1 0 0.222 0.358 0.246 0.467 1 0.533 0.107 0.000 0.107 2 1 1 0.328 0.358 0.352 1.038 0 -1.038 -0.208 -0.208 -0.208 … … 127 0 0 -0.250 -0.125 0.625 0.625 0 -0.625 0.000 0.000 -0.125 127 0 1 -0.250 -0.125 0.500 0.375 1 0.625 0.000 0.125 0.125 127 1 0 -0.250 0.000 0.625 0.375 1 0.625 0.125 0.000 0.125 127 1 1 -0.125 0.000 0.750 0.625 0 -0.625 -0.125 -0.125 -0.125 The perceptron just could not find any pair of weights that would drive the following error 0.625. This can be explained mathematically as we already perceived from the chart that this function cannot be linearly separable in two dimensions. So what if we add another dimension? Let's see the chart in three dimensions: In three dimensions, it is possible to draw a plane that would separate the patterns, provided that this additional dimension could properly transform the input data. Okay, but now there is an additional problem: how could we derive this additional dimension since we have only two input variables? One obvious, but also workaround, answer would be adding a third variable as a derivation from the two original ones. And being this third variable a (derivation), our neural network would probably get the following shape: Okay, now the perceptron has three inputs, one of them being a composition of the other. This also leads to a new question: how should that composition be processed? We can see that this component could act as a neuron, so giving the neural network a nested architecture. If so, there would another new question: how would the weights of this new neuron be trained, since the error is on the output neuron? Multi-layer perceptrons As we can see, one simple example in which the patterns are not linearly separable has led us to more and more issue using the perceptron architecture. That need led to the application of multilayer perceptrons. The fact that the natural neural network is structured in layers as well, and each layer captures pieces of information from a specific environment is already established. In artificial neural networks, layers of neurons act in this way, by extracting and abstracting information from data, transforming them into another dimension or shape. In the XOR example, we found the solution to be the addition of a third component that would make possible a linear separation. But there remained a few questions regarding how that third component would be computed. Now let's consider the same solution as a two-layer perceptron: Now we have three neurons instead of just one, but in the output the information transferred by the previous layer is transformed into another dimension or shape, whereby it would be theoretically possible to establish a linear boundary on those data points. However, the question on finding the weights for the first layer remains unanswered, or can we apply the same training rule to neurons other than the output? We are going to deal with this issue in the Generalized delta rule section. MLP properties Multi-layer perceptrons can have any number of layers and also any number of neurons in each layer. The activation functions may be different on any layer. An MLP network is usually composed of at least two layers, one for the output and one hidden layer. There are also some references that consider the input layer as the nodes that collect input data; therefore, for those cases, the MLP is considered to have at least three layers. For the purpose of this article, let's consider the input layer as a special type of layer which has no weights, and as the effective layers, that is, those enabled to be trained, we'll consider the hidden and output layers. A hidden layer is called that because it actually hides its outputs from the external world. Hidden layers can be connected in series in any number, thus forming a deep neural network. However, the more layers a neural network has, the slower would be both training and running, and according to mathematical foundations, a neural network with one or two hidden layers at most may learn as well as deep neural networks with dozens of hidden layers. But it depends on several factors. MLP weights In an MLP feedforward network, one particular neuron i receives data from a neuron j of the previous layer and forwards its output to a neuron k of the next layer: The mathematical description of a neural network is recursive: Here, yo is the network output (should we have multiple outputs, we can replace yo with Y, representing a vector); fo is the activation function of the output; l is the number of hidden layers; nhi is the number of neurons in the hidden layer i; wi is the weight connecting the i th neuron of the last hidden layer to the output; fi is the activation function of the neuron i; and bi is the bias of the neuron i. It can be seen that this equation gets larger as the number of layers increases. In the last summing operation, there will be the inputs xi. Recurrent MLP The neurons on an MLP may feed signals not only to neurons in the next layers (feedforward network), but also to neurons in the same or previous layers (feedback or recurrent). This behavior allows the neural network to maintain state on some data sequence, and this feature is especially exploited when dealing with time series or handwriting recognition. Recurrent networks are usually harder to train, and eventually the computer may run out of memory while executing them. In addition, there are recurrent network architectures better than MLPs, such as Elman, Hopfield, Echo state, Bidirectional RNNs (recurrent neural networks). But we are not going to dive deep into these architectures. Coding an MLP Bringing these concepts into the OOP point of view, we can review the classes already designed so far: One can see that the neural network structure is hierarchical. A neural network is composed of layers that are composed of neurons. In the MLP architecture, there are three types of layers: input, hidden, and output. So suppose that in Java, we would like to define a neural network consisting of three inputs, one output (linear activation function) and one hidden layer (sigmoid function) containing five neurons. The resulting code would be as follows: int numberOfInputs=3; int numberOfOutputs=1; int[] numberOfHiddenNeurons={5}; Linear outputAcFnc = new Linear(1.0); Sigmoid hiddenAcFnc = new Sigmoid(1.0); NeuralNet neuralnet = new NeuralNet(numberOfInputs, numberOfOutputs, numberOfHiddenNeurons, hiddenAcFnc, outputAcFnc); To summarize, we saw how perceptrons can be applied to solve linear separation problems, their limitations in classifying nonlinear data and how to suppress those limitations with multi-layer perceptrons (MLPs). If you enjoyed this excerpt, check out the book Neural Network Programming with Java Second Edition for a better understanding of neural networks and how they fit in different real-world projects.

0
0
23853

How to use R to boost your Data Model

Why choose R for your data mining project

Sentiment Analysis of the 2017 US elections on Twitter

Using Logistic regression to predict market direction in algorithmic trading

Hands on Table Calculation Techniques with Tableau

How to perform Data exploration with RethinkDB

How to convert Java code into Kotlin

How to deploy RethinkDB using Docker

Getting started with Data Visualization in Tableau

Getting Inside a C++ Multithreaded Application

Trending Topics

How to maintain Apache Mesos

Hypothesis testing with R

What Tableau Data Handling Engine has to offer

How to Build a music recommendation system with PageRank Algorithm

Neural Network Architectures 101: Understanding Perceptrons

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access