Reader small image

You're reading from  Apache Spark for Data Science Cookbook

Product typeBook
Published inDec 2016
Publisher
ISBN-139781785880100
Edition1st Edition
Concepts
Right arrow
Author (1)
Padma Priya Chitturi
Padma Priya Chitturi
author image
Padma Priya Chitturi

Padma Priya Chitturi is Analytics Lead at Fractal Analytics Pvt Ltd and has over five years of experience in Big Data processing. Currently, she is part of capability development at Fractal and responsible for solution development for analytical problems across multiple business domains at large scale. Prior to this, she worked for an Airlines product on a real-time processing platform serving one million user requests/sec at Amadeus Software Labs. She has worked on realizing large-scale deep networks (Jeffrey deans work in Google brain) for image classification on the big data platform Spark. She works closely with Big Data technologies such as Spark, Storm, Cassandra and Hadoop. She was an open source contributor to Apache Storm.
Read more about Padma Priya Chitturi

Right arrow

Chapter 6. NLP with Spark

In this chapter, we will see how to run NLP algorithms over Spark. You will learn the following recipes:

  • Installing NLTK on Linux

  • Installing Anaconda on Linux

  • Anaconda for cluster management

  • POS tagging with PySpark on an Anaconda cluster

  • Named Entity Recognition with IPython over Spark

  • Implementing openNLP - chunker over Spark

  • Implementing openNLP - sentence detector over Spark

  • Implementing stanford NLP - lemmatization over Spark

  • Implementing sentiment analysis using stanford NLP over Spark

Introduction


The study of natural language processing is called NLP. It is about the application of computers on different language nuances and building real-world applications using NLP techniques. NLP is analogous to teaching a language to a child. The most common tasks, such as understanding words and sentences, forming grammatically and structurally correct sentences are natural to humans. In NLP, some of these tasks translate to tokenization, chunking, parts of speech tagging, parsing, machine translation and speech recognition and these are tough challenges for computers.

Currently, NLP is one of the rarest skill sets that is required in the industry. With the advent of big data, the major challenge is that there is a need for people who are good with not just structured, but also with semi or unstructured data. Petabytes of weblogs, tweets, Facebook feeds, chats, e-mails and reviews are generated continuously. Companies are collecting all these different kinds of data for better customer...

Installing NLTK on Linux


In this recipe, we will see how to install NLTK on Linux. Before proceeding with the installation, let's consider the version of Python we're going to use. There are two versions or flavors of Python, namely Python 2.7.x and Python 3.x. Although the latest version, Python 3.x, appears to be the better choice, for scientific, numeric, or data analysis work, Python 2.7 is recommended.

Getting ready

To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Python comes pre-installed. The python --version command gives the version of the Python installed. If the version seems to be 2.6.x, upgrade it to Python 2.7 as follows:

    sudo apt-get install python2.7

How to do it…

Let's see the installation process for NLTK:

  1. Once the Python 2.7.x version is available, install NLTK as follows:

          sudo pip install -U nltk
    
  2. The preceding installation may throw an error such as the following:

           Could not find any downloads that satisfy the requirement...

Installing Anaconda on Linux


Anaconda is a free, enterprise-ready Python distribution for data analytics, processing and scientific computing. In this recipe, we will see how to install Anaconda on Linux. Before proceeding with the installation, let's consider the version of Python we're going to use. There are two versions or flavors of Python, namely Python 2.7.x and Python 3.x. Although the latest version, Python 3.x, appears to be the better choice, for scientific, numeric, or data analysis work, Python 2.7 is recommended.

Getting ready

To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Python comes pre-installed. python --version gives the version of the Python installed. If the version is 2.6.x, upgrade it to Python 2.7 as follows:

    sudo apt-get install python2.7

How to do it…

Once Python version 2.7.x is available, download the Anaconda installer from https://www.continuum.io/downloads and type the following in the terminal window at the path...

Anaconda for cluster management


Anaconda for cluster management provides resource management tools which allow users to easily create, provision and manage bare-metal or cloud-based clusters. It enables the management of conda environments on clusters and provides integration, configuration and setup management of Hadoop services. This can be installed alongside enterprise Hadoop distributions such as Cloudera CDH or Hortonworks HDP and this is used to manage conda packages and environments across a cluster.

Getting ready

To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Python comes pre-installed. python --version gives the version of Python installed. If the version is 2.6.x, upgrade it to Python 2.7 as follows:

    sudo apt-get install python2.7

For installing Anaconda, please refer to the earlier Installing Anaconda on Linux recipe.

How to do it…

Let's look at the installation process for installing Anaconda for cluster management:

  1. You can create...

POS tagging with PySpark on an Anaconda cluster


Parts-of-speech tagging is the process of converting a sentence in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech tag and signifies whether the word is a noun, adjective, verb and so on. This is a necessary step before chunking. With parts-of-speech tags, a chunker knows how to identify phrases based on tag patterns. These POS tags are used for grammar analysis and word sense disambiguation.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, have PySpark and Anaconda installed on the Linux machine, that is, Ubuntu 14.04. For installing Anaconda, please refer the earlier recipes.

How to do it…

Let's see how to implement POS tagging using PySpark:

  1. Activate the Anaconda cluster as follows:

            source activate acluster
    
  2. Install the...

NER with IPython over Spark


Apart from POS, one of the most common labeling problems is finding entities in the text. Typically, NER constitutes name, location and organizations. There are NER systems that tag more entities than just these three such as labeling and named entities using the context and other features. There is a lot more research going on in this area of NLP, where people are trying to tag biomedical entities, product entities, and so on.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, have PySpark and Ipython installed on the Linux machine, that is, Ubuntu 14.04. For installing IPython, please refer to the Using IPython with PySpark recipe in the Chapter 2Tricky Statistics with Spark.

How to do it…

  1. Download and install NLTK data correctly as follows:

          ipython console -profile=pyspark
          In [1]: 
          In [1]: from...

Implementing openNLP - chunker over Spark


Chunking is shallow parsing, where instead of retrieving deep structure of the sentence, we try to club some chunks of the sentences that constitute some meaning. A chunk is defined as the minimal unit that can be processed. The conventional pipeline in chunking is to tokenize the POS tag and the input string, before they are given to any chunker.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer to http://spark.apache.org/docs/latest/spark-standalone.html. Install Hadoop (optionally), Scala, and Java.

How to do it…

Let's see how to run OpenNLP-Chunker over Spark:

  1. Let's start an application named SparkNLP. Initially specify the following libraries in the build.sbt file:

         libraryDependencies ++= Seq(
         "org.apache.spark" %% "spark-core" % "1.6.0",...

Implementing openNLP - sentence detector over Spark


Partitioning text into sentences is called Sentence Boundary Disambiguation (SBD) or Sentence Detection. This process is useful for many downstream NLP tasks, which require analysis within sentences; for instance POS and phrase analysis. This Sentence Detection process is language dependent. Most search engines are not concerned with Sentence Detection. They are only interested in query's tokens and their respective positions. POS taggers and other NLP tasks that perform extraction of data will frequently process individual sentences. The detection of sentence boundaries will help separate phrases that might appear to span sentences.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer to http://spark.apache.org/docs/latest/spark-standalone.html. ...

Implementing stanford NLP - lemmatization over Spark


Lemmatization is one of the pre-processing steps which is a more methodical way of converting all the grammatical/inflected forms of the root of the word. It uses context and parts of speech to determine the inflected form of the word and applies different normalization rules for each part of speech to get the word (lemma). In this recipe, we'll see lemmatization of text using Stanford API.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer to http://spark.apache.org/docs/latest/spark-standalone.html. . Install Hadoop (optionally), Scala, and Java.

How to do it…

Let's see how to apply lemmatization using Stanford NLP over Spark:

  1. Let's start an application named SparkCoreNLP. Initially specify the following libraries in build.sbt file:

        libraryDependencies...

Implementing sentiment analysis using stanford NLP over Spark


Sentiment analysis or opinion mining involves building a system to collect and categorize opinions about a product. This can be used in several ways that help marketers evaluate the success of an ad-campaign or new product launch, determine which versions of product or service are popular and also identify demographics that like or dislike product features. In this recipe we will see how the Stanford NLP API performs sentiment analysis.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer to http://spark.apache.org/docs/latest/spark-standalone.html. Install Hadoop (optionally), Scala, and Java.

How to do it…

Let's see how to apply sentiment analysis using Stanford NLP over Spark:

  1. Let's start an application named SparkCoreNLP. Initially specify...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark for Data Science Cookbook
Published in: Dec 2016Publisher: ISBN-13: 9781785880100
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Padma Priya Chitturi

Padma Priya Chitturi is Analytics Lead at Fractal Analytics Pvt Ltd and has over five years of experience in Big Data processing. Currently, she is part of capability development at Fractal and responsible for solution development for analytical problems across multiple business domains at large scale. Prior to this, she worked for an Airlines product on a real-time processing platform serving one million user requests/sec at Amadeus Software Labs. She has worked on realizing large-scale deep networks (Jeffrey deans work in Google brain) for image classification on the big data platform Spark. She works closely with Big Data technologies such as Spark, Storm, Cassandra and Hadoop. She was an open source contributor to Apache Storm.
Read more about Padma Priya Chitturi