Packt+ | Advance your knowledge in tech

You're reading from Mastering Java Machine Learning A Java developer's guide to implementing machine learning and big data architectures

Product type Paperback

Published in Jul 2017

Publisher Packt

ISBN-13 9781785880513

Length 556 pages

Edition 1st Edition

Languages

Java

Concepts

Big Data

Authors (2):

Kamath

Krishna Choppella

View More author details

Table of Contents (13) Chapters

Preface

1. Machine Learning Review FREE CHAPTER

2. Practical Approach to Real-World Supervised Learning

3. Unsupervised Machine Learning Techniques

4. Semi-Supervised and Active Learning

5. Real-Time Stream Machine Learning

6. Probabilistic Graph Modeling

7. Deep Learning

8. Text Mining and Natural Language Processing

9. Big Data Machine Learning – The Final Frontier

A. Linear Algebra

B. Probability

Index

Text processing components and transformations

In this section, we will discuss some common preprocessing and transformation steps that are done in most text mining processes. The general concept is to convert the documents into structured datasets with features or attributes that most Machine Learning algorithms can use to perform different kinds of learning.

We will briefly describe some of the most used techniques in the next section. Different applications of text mining might use different pieces or variations of the components shown in the following figure:

Figure 10: Text Processing components and the flow

Document collection and standardization

One of the first steps in most text mining applications is the collection of data in the form of a body of documents—often referred to as a corpus in the text mining world. These documents can have predefined categorization associated with them or it can simply be an unlabeled corpus. The documents can be of heterogeneous formats or standardized...