Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learning Apache Mahout

You're reading from  Learning Apache Mahout

Product type Book
Published in Mar 2015
Publisher
ISBN-13 9781783555215
Pages 250 pages
Edition 1st Edition
Languages

Categorizing text


Text categorization or classification deals with labeling documents to certain predefined classes. One of the most common tasks of text classification is labeling e-mail as ham and spam. We will discuss how to implement text classification in Mahout.

The dataset

For the text classification case study, we are going to use the 20 newsgroups dataset. The data is from transcripts of several months of postings made in 20 Usenet newsgroups on 20 different topics. Download the dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz.

The dataset is divided into train and test sets, and each set has 20 subdirectories. If you look at the training folder, you will see these 20 subdirectories. Each subdirectory will be considered a class label, and all files belonging to the directory will belong to that class. The following screenshot displays the folders in which files of respective classes as present. The folder name is the class label for documents present...

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}