Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Frank Kane's Taming Big Data with Apache Spark and Python

You're reading from  Frank Kane's Taming Big Data with Apache Spark and Python

Product type Book
Published in Jun 2017
Publisher Packt
ISBN-13 9781787287945
Pages 296 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Frank Kane Frank Kane
Profile icon Frank Kane

Table of Contents (13) Chapters

Title Page
Credits
About the Author
www.PacktPub.com
Customer Feedback
Preface
1. Getting Started with Spark 2. Spark Basics and Spark Examples 3. Advanced Examples of Spark Programs 4. Running Spark on a Cluster 5. SparkSQL, DataFrames, and DataSets 6. Other Spark Technologies and Libraries 7. Where to Go From Here? – Learning More About Spark and Data Science

Improving the word-count script with regular expressions


The main problem with the initial results from our word-count script is that we didn't account for things such as punctuation and capitalization. There are fancy ways to deal with that problem in text processing, but we're going to use a simple way for now. We'll use something called regular expressions in Python. So let's look at how that works, then run it and see it in action.

Text normalization

In the previous section, we had a first crack at counting the number of times each word occurred in our book, but the results weren't that great. We had each individual word that had different capitalization or punctuation surrounding it being counted as a word of its own, and that's not what we want. We want each word to be counted only once, no matter how it's capitalized or what punctuation might surround it. We don't want duplicate words showing up in there. There are toolkits you can get for Python such as NLTK (Natural Language Toolkit...

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}