Packt+ | Advance your knowledge in tech

You're reading from Apache Spark for Data Science Cookbook

Product typeBook

Published inDec 2016

Publisher

ISBN-139781785880100

Edition1st Edition

Tools

Apache Spark Pandas

Concepts

Data Science

Author (1)

Padma Priya Chitturi

Chapter 6. NLP with Spark

In this chapter, we will see how to run NLP algorithms over Spark. You will learn the following recipes:

Installing NLTK on Linux
Installing Anaconda on Linux
Anaconda for cluster management
POS tagging with PySpark on an Anaconda cluster
Named Entity Recognition with IPython over Spark
Implementing openNLP - chunker over Spark
Implementing openNLP - sentence detector over Spark
Implementing stanford NLP - lemmatization over Spark
Implementing sentiment analysis using stanford NLP over Spark

Introduction

The study of natural language processing is called NLP. It is about the application of computers on different language nuances and building real-world applications using NLP techniques. NLP is analogous to teaching a language to a child. The most common tasks, such as understanding words and sentences, forming grammatically and structurally correct sentences are natural to humans. In NLP, some of these tasks translate to tokenization, chunking, parts of speech tagging, parsing, machine translation and speech recognition and these are tough challenges for computers.

Currently, NLP is one of the rarest skill sets that is required in the industry. With the advent of big data, the major challenge is that there is a need for people who are good with not just structured, but also with semi or unstructured data. Petabytes of weblogs, tweets, Facebook feeds, chats, e-mails and reviews are generated continuously. Companies are collecting all these different kinds of data for better customer...

Installing NLTK on Linux

In this recipe, we will see how to install NLTK on Linux. Before proceeding with the installation, let's consider the version of Python we're going to use. There are two versions or flavors of Python, namely Python 2.7.x and Python 3.x. Although the latest version, Python 3.x, appears to be the better choice, for scientific, numeric, or data analysis work, Python 2.7 is recommended.

Getting ready

To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Python comes pre-installed. The python --version command gives the version of the Python installed. If the version seems to be 2.6.x, upgrade it to Python 2.7 as follows:

    sudo apt-get install python2.7

How to do it…

Let's see the installation process for NLTK:

Once the Python 2.7.x version is available, install NLTK as follows:
```
      sudo pip install -U nltk
```

The preceding installation may throw an error such as the following:

       Could not find any downloads that satisfy the requirement...

Installing Anaconda on Linux

Anaconda is a free, enterprise-ready Python distribution for data analytics, processing and scientific computing. In this recipe, we will see how to install Anaconda on Linux. Before proceeding with the installation, let's consider the version of Python we're going to use. There are two versions or flavors of Python, namely Python 2.7.x and Python 3.x. Although the latest version, Python 3.x, appears to be the better choice, for scientific, numeric, or data analysis work, Python 2.7 is recommended.

Getting ready

To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Python comes pre-installed. python --version gives the version of the Python installed. If the version is 2.6.x, upgrade it to Python 2.7 as follows:

    sudo apt-get install python2.7

How to do it…

Once Python version 2.7.x is available, download the Anaconda installer from https://www.continuum.io/downloads and type the following in the terminal window at the path...

Anaconda for cluster management

Anaconda for cluster management provides resource management tools which allow users to easily create, provision and manage bare-metal or cloud-based clusters. It enables the management of conda environments on clusters and provides integration, configuration and setup management of Hadoop services. This can be installed alongside enterprise Hadoop distributions such as Cloudera CDH or Hortonworks HDP and this is used to manage conda packages and environments across a cluster.

Getting ready

To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Python comes pre-installed. python --version gives the version of Python installed. If the version is 2.6.x, upgrade it to Python 2.7 as follows:

    sudo apt-get install python2.7

For installing Anaconda, please refer to the earlier Installing Anaconda on Linux recipe.

How to do it…

Let's look at the installation process for installing Anaconda for cluster management:

You can create...

POS tagging with PySpark on an Anaconda cluster

Parts-of-speech tagging is the process of converting a sentence in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech tag and signifies whether the word is a noun, adjective, verb and so on. This is a necessary step before chunking. With parts-of-speech tags, a chunker knows how to identify phrases based on tag patterns. These POS tags are used for grammar analysis and word sense disambiguation.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, have PySpark and Anaconda installed on the Linux machine, that is, Ubuntu 14.04. For installing Anaconda, please refer the earlier recipes.

How to do it…

Let's see how to implement POS tagging using PySpark:

Activate the Anaconda cluster as follows:
```
        source activate acluster
```
Install the...

NER with IPython over Spark

Apart from POS, one of the most common labeling problems is finding entities in the text. Typically, NER constitutes name, location and organizations. There are NER systems that tag more entities than just these three such as labeling and named entities using the context and other features. There is a lot more research going on in this area of NLP, where people are trying to tag biomedical entities, product entities, and so on.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, have PySpark and Ipython installed on the Linux machine, that is, Ubuntu 14.04. For installing IPython, please refer to the Using IPython with PySpark recipe in the Chapter 2, Tricky Statistics with Spark.

How to do it…

Download and install NLTK data correctly as follows:

      ipython console -profile=pyspark
      In [1]: 
      In [1]: from...

Implementing openNLP - chunker over Spark

Chunking is shallow parsing, where instead of retrieving deep structure of the sentence, we try to club some chunks of the sentences that constitute some meaning. A chunk is defined as the minimal unit that can be processed. The conventional pipeline in chunking is to tokenize the POS tag and the input string, before they are given to any chunker.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer to http://spark.apache.org/docs/latest/spark-standalone.html. Install Hadoop (optionally), Scala, and Java.

How to do it…

Let's see how to run OpenNLP-Chunker over Spark:

Let's start an application named SparkNLP. Initially specify the following libraries in the build.sbt file:
```
     libraryDependencies ++= Seq(
     "org.apache.spark" %% "spark-core" % "1.6.0",...
```

Implementing openNLP - sentence detector over Spark

Partitioning text into sentences is called Sentence Boundary Disambiguation (SBD) or Sentence Detection. This process is useful for many downstream NLP tasks, which require analysis within sentences; for instance POS and phrase analysis. This Sentence Detection process is language dependent. Most search engines are not concerned with Sentence Detection. They are only interested in query's tokens and their respective positions. POS taggers and other NLP tasks that perform extraction of data will frequently process individual sentences. The detection of sentence boundaries will help separate phrases that might appear to span sentences.

Getting ready

Implementing stanford NLP - lemmatization over Spark

Lemmatization is one of the pre-processing steps which is a more methodical way of converting all the grammatical/inflected forms of the root of the word. It uses context and parts of speech to determine the inflected form of the word and applies different normalization rules for each part of speech to get the word (lemma). In this recipe, we'll see lemmatization of text using Stanford API.

Getting ready

How to do it…

Let's see how to apply lemmatization using Stanford NLP over Spark:

Let's start an application named SparkCoreNLP. Initially specify the following libraries in build.sbt file:
```
    libraryDependencies...
```

Implementing sentiment analysis using stanford NLP over Spark

Sentiment analysis or opinion mining involves building a system to collect and categorize opinions about a product. This can be used in several ways that help marketers evaluate the success of an ad-campaign or new product launch, determine which versions of product or service are popular and also identify demographics that like or dislike product features. In this recipe we will see how the Stanford NLP API performs sentiment analysis.

Getting ready

How to do it…

Let's see how to apply sentiment analysis using Stanford NLP over Spark:

Let's start an application named SparkCoreNLP. Initially specify...

The rest of the chapter is locked

You have been reading a chapter from

Apache Spark for Data Science Cookbook

Published in: Dec 2016Publisher: ISBN-13: 9781785880100

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Padma Priya Chitturi

Padma Priya Chitturi is Analytics Lead at Fractal Analytics Pvt Ltd and has over five years of experience in Big Data processing. Currently, she is part of capability development at Fractal and responsible for solution development for analytical problems across multiple business domains at large scale. Prior to this, she worked for an Airlines product on a real-time processing platform serving one million user requests/sec at Amadeus Software Labs. She has worked on realizing large-scale deep networks (Jeffrey deans work in Google brain) for image classification on the big data platform Spark. She works closely with Big Data technologies such as Spark, Storm, Cassandra and Hadoop. She was an open source contributor to Apache Storm.
Read more about Padma Priya Chitturi

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages