Reader small image

You're reading from  Learning Spark SQL

Product typeBook
Published inSep 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785888359
Edition1st Edition
Languages
Right arrow

Chapter 9. Developing Applications with Spark SQL

In this chapter, we will present several examples of developing applications using Spark SQL. We will primarily focus on text analysis-based applications, including preprocessing pipelines, bag-of-words techniques, computing readability metrics for financial documents, identifying themes in document corpuses, and using Naive Bayes classifiers. Additionally, we will describe the implementation of a machine learning example.

More specifically, you will learn about the following in this chapter:

  • Spark SQL-based application's development
  • Preprocessing textual data
  • Building preprocessing data pipelines
  • Identifying themes in document corpuses
  • Using Naive Bayes classifiers
  • Developing a machine learning application

Introducing Spark SQL applications


Machine learning, predictive analytics, and related data science topics are becoming increasingly popular for solving real-world problems across business domains. These applications are driving mission-critical business decision making in many organizations. Examples of such applications recommendation engines, targeted advertising, speech recognition, fraud detection, image recognition and categorization, and so on. Spark (and Spark SQL) is increasingly becoming the platform of choice for these large-scale distributed applications.

With the availability of online data sources for financial news, earning conference calls, regulatory filings, social media, and so on, interest in the automated and intelligent analysis of textual and other unstructured data available in various formats, including text, audio, and video, has proliferated. These applications include sentiment analysis from regulatory filings, large-scale automated analysis of news articles and...

Understanding text analysis applications


The inherent nature of language and writing to problems of high dimensionality while analyzing documents. Hence, some of the most widely used textual methods rely on the critical assumption of independence, where the order and direct context of a word are not important. Methods, where word sequence is ignored, are typically labeled as "bag-of-words" techniques.

Textual analysis  is a lot more imprecise compared to quantitative analysis. Textual data requires an additional step of translating the text into quantitative measures, which are then used as inputs for various text-based analytics or ML methods. Many of these methods are based on deconstructing a document into a term-document matrix consisting of rows of words and columns of word counts. 

In applications using a bag of words, the approach to normalizing the word counts is important as the raw counts directly dependent on the document length. A simple use of proportions can  this problem, however...

Understanding themes in document corpuses


Bag-of-words-based techniques can also be to classify common themes in documents or to identify themes within a corpus of documents. Broadly, these techniques, like most, are attempting to reduce the dimensionality of the term-document matrix, based on each word's relation to latent variables in this case.

One of the earliest approaches to this of classification was Latent Semantic Analysis (LSA). LSA can avoid the limitations of count-based methods associated with synonyms and terms with multiple meanings. Over the years, the concept of LSA has evolved into another model called Latent Dirichlet Allocation (LDA).

LDA allows us to identify latent thematic structure a collection of documents. Both LSA and LDA use the term-document matrix for reducing the dimensionality of the term space and for producing the topic weights. A constraint of both the LSA and LDA techniques is that they work best when applied to large documents.

Note

For more detailed explanation...

Using Naive Bayes classifiers


Naive Bayes classifiers are a family of probabilistic classifiers on applying the Bayes' conditional probability theorem. These classifiers assume independence between the features. Naive Bayes is often the baseline method for text categorization with word frequencies as the set. Despite the strong independence assumptions, the Naive Bayes classifiers are fast and easy to implement; hence, they are used very commonly in practice.

While Naive Bayes is very popular, it also suffers from errors that can lead to favoring of one class over the other(s). For example, skewed data can cause the classifier to favor one class over another. Similarly, the independence assumption can lead to erroneous classification weights that one class over another.

Note

For specific heuristics for dealing with problems associated with Naive Bayes classifers, refer to Tackling the Poor Assumptions of Naive Bayes Text Classifiers, by Rennie, Shih, et al at https://people.csail.mit.edu...

Developing a machine learning application


In this section, we will present a machine learning example for textual analysis. Refer to Chapter 6, Using Spark SQL in Machine Learning Applications, for more details the machine learning code presented in this section.

The Dataset used in the following example contains 1,080 documents of text business descriptions of Brazilian companies categorized into a of nine categories. You can download this Dataset from https://archive.ics.uci.edu/ml/datasets/CNAE-9.

scala> val inRDD = spark.sparkContext.textFile("file:///Users/aurobindosarkar/Downloads/CNAE-9.data")

scala> val rowRDD = inRDD.map(_.split(",")).map(attributes => Row(attributes(0).toDouble, attributes(1).toDouble, attributes(2).toDouble, attributes(3).toDouble, attributes(4).toDouble, attributes(5).toDouble,
.
.
.
attributes(852).toDouble, attributes(853).toDouble, attributes(854).toDouble, attributes(855).toDouble, attributes(856).toDouble))

Next, we define a schema for the input...

Summary


In this chapter, we introduced a few Spark SQL applications in the textual analysis space. Additionally, we provided detailed code examples, including building a data preprocessing pipeline, implementing sentiment analysis, using Naive Bayes classifier with n-grams, and implementing an LDA application to identify themes in a document corpus. Additionally, we worked through the details of implementing an example of machine learning.

In the next chapter, we will focus on use cases for using Spark SQL in deep learning applications. We will explore a few of the emerging deep learning libraries and present examples of implementing deep learning related applications.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Spark SQL
Published in: Sep 2017Publisher: PacktISBN-13: 9781785888359
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime