Reader small image

You're reading from  Natural Language Processing and Computational Linguistics

Product typeBook
Published inJun 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781788838535
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Bhargav Srinivasa-Desikan
Bhargav Srinivasa-Desikan
author image
Bhargav Srinivasa-Desikan

Bhargav Srinivasa-Desikan is a research engineer working for INRIA in Lille, France. He is a part of the MODAL (Models of Data Analysis and Learning) team, and he works on metric learning, predictor aggregation, and data visualization. He is a regular contributor to the Python open source community, and completed Google Summer of Code in 2016 with Gensim where he implemented Dynamic Topic Models. He is a regular speaker at PyCons and PyDatas across Europe and Asia, and conducts tutorials on text analysis using Python.
Read more about Bhargav Srinivasa-Desikan

Right arrow

Chapter 5. POS-Tagging and Its Applications

Chapter 1, What is Text Analysis, and Chapter 2, Python Tips for Text Analysis, introduced text analysis and Python, and Chapter 3, SpaCy's Language Models, and Chapter 4, Gensim - Vectorizing Text and Transformations and n-grams, helped us set-up our code for more advanced text analysis. This chapter will discuss the first of such advanced techniques – part of speech tagging, popularly called POS-tagging. We will study what parts of speech exist, how to identify them in our documents, and what possible uses these POS-tags have.

  • What is POS-tagging?
  • spaCy for POS-tagging
  • Training your POS-tagger
  • POS-tagging examples

Summary

We've explored in this chapter how to use spaCy as part of our pipelines, and in particular how to extract POS-tags. We discussed what POS-tags are, and how they can be useful in different kinds of analysis. We soon moved on to training your own POS-tagger in spaCy and looked at different examples where we use POS-tags. We will now explore other spaCy functionalities such as NER-tagging and dependency parsing.

References

[1] 8 Parts of Speech:
http://www.butte.edu/departments/cas/tipsheets/grammar/parts_of_speech.html

[2] Parts of Speech overview:
http://partofspeech.org/

[3] spaCy Annotation Specifications:
https://spacy.io/api/annotation#pos-tagging

[4] Hidden Markov Model:
https://en.wikipedia.org/wiki/Hidden_Markov_model

[5] A simple rule-based part of speech tagger:
http://www.aclweb.org/anthology/A92-1021

[6] displaCy:
https://explosion.ai/demos/displacy

[7] ntlk tag module:
https://www.nltk.org/api/nltk.tag.html

[8] nltk chapter 5:
https://www.nltk.org/book/ch05.html

[9] Training NLTK tagger:
http://textminingonline.com/dive-into-nltk-part-iii-part-of-speech-tagging-and-pos-tagger

[10] AI in Practice: Identifying Parts of Speech in Python:
https://medium.com/@brianray_7981/ai-in-practice-identifying-parts-of-speech-in-python-8a690c7a1a08

[11] Speech Tagging in TextBlob...

Training our own POS-taggers


The prediction done by spaCy's models with regard to its POS-tag are statistical predictions; unlike, say, whether or not it is a stop word, which is just a check against a list of words. If it is a statistical prediction, this means that we can train a model for it to perform better predictions or predictions that are more relevant to the dataset we are intending to use it on. Here, better isn't meant to be taken too literally – the current spaCy model already comes to 97% in terms of tagging accuracy.

Before we dive in deep into our training process, let's clarify a few commonly used terms when it comes to machine learning, and machine learning for text.

Training - the process of teaching your machine learning model how to make the right prediction. In text analysis, we do this by providing classified data to the model. What does this mean? In the setting of POS-tagging, it would be a list of words and their tagged POS. This labeled information is then used to...

POS-tagging code examples


The following code snippets illustrate some of the simple tasks we can do with knowledge of POS-tags. These examples don't achieve too much in terms of in-depth text analysis, but offer a quick glance at text manipulation once we have our text processed.

def make_verb_upper(text, pos):
    return text.upper() if pos == "VERB" else text
doc = nlp(u'Tom ran swiftly and walked slowly')
text = ''.join(make_verb_upper(w.text_with_ws, w.pos_) for w in doc)
print(text)

As the function name suggests, the preceding code is to change all the verbs of the sentence into uppercase. By doing a quick check of the POS-tag and the basic string function upper, we can achieve this in 5 lines!

Another popular task often done during analysis of text is to count the occurrences of each kind of POS. This can be done quite quickly with the following code snippet, where we find out the number of occurrences of these words in the 1st Harry Potter book (which you would buy/download and save...

Summary


We've explored in this chapter how to use spaCy as part of our pipelines, and in particular how to extract POS-tags. We discussed what POS-tags are, and how they can be useful in different kinds of analysis. We soon moved on to training your own POS-tagger in spaCy and looked at different examples where we use POS-tags. We will now explore other spaCy functionalities such as NER-tagging and dependency parsing.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Natural Language Processing and Computational Linguistics
Published in: Jun 2018Publisher: PacktISBN-13: 9781788838535
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Bhargav Srinivasa-Desikan

Bhargav Srinivasa-Desikan is a research engineer working for INRIA in Lille, France. He is a part of the MODAL (Models of Data Analysis and Learning) team, and he works on metric learning, predictor aggregation, and data visualization. He is a regular contributor to the Python open source community, and completed Google Summer of Code in 2016 with Gensim where he implemented Dynamic Topic Models. He is a regular speaker at PyCons and PyDatas across Europe and Asia, and conducts tutorials on text analysis using Python.
Read more about Bhargav Srinivasa-Desikan