Home Data Machine Learning: End-to-End guide for Java developers

Machine Learning: End-to-End guide for Java developers

By Boštjan Kaluža , Krishna Choppella , Uday Kamath
books-svg-icon Book
Subscription FREE
READ FOR FREE Free Trial for 7 days. $15.99 p/m after trial. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
READ FOR FREE Free Trial for 7 days. $15.99 p/m after trial. Cancel Anytime!
Subscription FREE
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
About this book
Machine Learning is one of the core area of Artificial Intelligence where computers are trained to self-learn, grow, change, and develop on their own without being explicitly programmed. In this course, we cover how Java is employed to build powerful machine learning models to address the problems being faced in the world of Data Science. The course demonstrates complex data extraction and statistical analysis techniques supported by Java, applying various machine learning methods, exploring machine learning sub-domains, and exploring real-world use cases such as recommendation systems, fraud detection, natural language processing, and more, using Java programming. The course begins with an introduction to data science and basic data science tasks such as data collection, data cleaning, data analysis, and data visualization. The next section has a detailed overview of statistical techniques, covering machine learning, neural networks, and deep learning. The next couple of sections cover applying machine learning methods using Java to a variety of chores including classifying, predicting, forecasting, market basket analysis, clustering stream learning, active learning, semi-supervised learning, probabilistic graph modeling, text mining, and deep learning. The last section highlights real-world test cases such as performing activity recognition, developing image recognition, text classification, and anomaly detection. The course includes premium content from three of our most popular books: [*]Java for Data Science [*]Machine Learning in Java [*]Mastering Java Machine Learning On completion of this course, you will understand various machine learning techniques, different machine learning java algorithms you can use to gain data insights, building data models to analyze larger complex data sets, and incubating applications using Java and machine learning algorithms in the field of artificial intelligence.
Publication date:
October 2017
Publisher
Packt
ISBN
9781788622219

 

Part 1. Module 1

Java for Data Science

Examine the techniques and Java tools supporting the growing field of data science

 

Chapter 1. Getting Started with Data Science

Data science is not a single science as much as it is a collection of various scientific disciplines integrated for the purpose of analyzing data. These disciplines include various statistical and mathematical techniques, including:

  • Computer science
  • Data engineering
  • Visualization
  • Domain-specific knowledge and approaches

With the advent of cheaper storage technology, more and more data has been collected and stored permitting previously unfeasible processing and analysis of data. With this analysis came the need for various techniques to make sense of the data. These large sets of data, when used to analyze data and identify trends and patterns, become known as big data.

This in turn gave rise to cloud computing and concurrent techniques such as map-reduce, which distributed the analysis process across a large number of processors, taking advantage of the power of parallel processing.

The process of analyzing big data is not simple and evolves to the specialization of developers who were known as data scientists. Drawing upon a myriad of technologies and expertise, they are able to analyze data to solve problems that previously were either not envisioned or were too difficult to solve.

Early big data applications were typified by the emergence of search engines capable of more powerful and accurate searches than their predecessors. For example, AltaVista was an early popular search engine that was eventually superseded by Google. While big data applications were not limited to these search engine functionalities, these applications laid the groundwork for future work in big data.

The term, data science, has been used since 1974 and evolved over time to include statistical analysis of data. The concepts of data mining and data analytics have been associated with data science. Around 2008, the term data scientist appeared and was used to describe a person who performs data analysis. A more in-depth discussion of the history of data science can be found at http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#3d9ea08369fd.

This book aims to take a broad look at data science using Java and will briefly touch on many topics. It is likely that the reader may find topics of interest and pursue these at greater depth independently. The purpose of this book, however, is simply to introduce the reader to the significant data science topics and to illustrate how they can be addressed using Java.

There are many algorithms used in data science. In this book, we do not attempt to explain how they work except at an introductory level. Rather, we are more interested in explaining how they can be used to solve problems. Specifically, we are interested in knowing how they can be used with Java.

Problems solved using data science

The various data science techniques that we will illustrate have been used to solve a variety of problems. Many of these techniques are motivated to achieve some economic gain, but they have also been used to solve many pressing social and environmental problems. Problem domains where these techniques have been used include finance, optimizing business processes, understanding customer needs, performing DNA analysis, foiling terrorist plots, and finding relationships between transactions to detect fraud, among many other data-intensive problems.

Data mining is a popular application area for data science. In this activity, large quantities of data are processed and analyzed to glean information about the dataset, to provide meaningful insights, and to develop meaningful conclusions and predictions. It has been used to analyze customer behavior, detecting relationships between what may appear to be unrelated events, and to make predictions about future behavior.

Machine learning is an important aspect of data science. This technique allows the computer to solve various problems without needing to be explicitly programmed. It has been used in self-driving cars, speech recognition, and in web searches. In data mining, the data is extracted and processed. With machine learning, computers use the data to take some sort of action.

Understanding the data science problem -  solving approach

Data science is concerned with the processing and analysis of large quantities of data to create models that can be used to make predictions or otherwise support a specific goal. This process often involves the building and training of models. The specific approach to solve a problem is dependent on the nature of the problem. However, in general, the following are the high-level tasks that are used in the analysis process:

  • Acquiring the data: Before we can process the data, it must be acquired. The data is frequently stored in a variety of formats and will come from a wide range of data sources.
  • Cleaning the data: Once the data has been acquired, it often needs to be converted to a different format before it can be used. In addition, the data needs to be processed, or cleaned, so as to remove errors, resolve inconsistencies, and otherwise put it in a form ready for analysis.
  • Analyzing the data: This can be performed using a number of techniques including:
    • Statistical analysis: This uses a multitude of statistical approaches to provide insight into data. It includes simple techniques and more advanced techniques such as regression analysis.
    • AI analysis: These can be grouped as machine learning, neural networks, and deep learning techniques:
      • Machine learning approaches are characterized by programs that can learn without being specifically programmed to complete a specific task
      • Neural networks are built around models patterned after the neural connection of the brain
      • Deep learning attempts to identify higher levels of abstraction within a set of data

    • Text analysis: This is a common form of analysis, which works with natural languages to identify features such as the names of people and places, the relationship between parts of text, and the implied meaning of text.
    • Data visualization: This is an important analysis tool. By displaying the data in a visual form, a hard-to-understand set of numbers can be more readily understood.
    • Video, image, and audio processing and analysis: This is a more specialized form of analysis, which is becoming more common as better analysis techniques are discovered and faster processors become available. This is in contrast to the more common text processing and analysis tasks.

Complementing this set of tasks is the need to develop applications that are efficient. The introduction of machines with multiple processors and GPUs contributes significantly to the end result.

While the exact steps used will vary by application, understanding these basic steps provides the basis for constructing solutions to many data science problems.

Using Java to support data science

Java and its associated third-party libraries provide a range of support for the development of data science applications. There are numerous core Java capabilities that can be used, such as the basic string processing methods. The introduction of lambda expressions in Java 8 helps enable more powerful and expressive means of building applications. In many of the examples that follow in subsequent chapters, we will show alternative techniques using lambda expressions.

There is ample support provided for the basic data science tasks. These include multiple ways of acquiring data, libraries for cleaning data, and a wide variety of analysis approaches for tasks such as natural language processing and statistical analysis. There are also myriad of libraries supporting neural network types of analysis.

Java can be a very good choice for data science problems. The language provides both object-oriented and functional support for solving problems. There is a large developer community to draw upon and there exist multiple APIs that support data science tasks. These are but a few reasons as to why Java should be used.

The remainder of this chapter will provide an overview of the data science tasks and Java support demonstrated in the book. Each section is only able to present a brief introduction to the topics and the available support. The subsequent chapter will go into considerably more depth regarding these topics.

Using Java to support data science

Java and its associated third-party libraries provide a range of support for the development of data science applications. There are numerous core Java capabilities that can be used, such as the basic string processing methods. The introduction of lambda expressions in Java 8 helps enable more powerful and expressive means of building applications. In many of the examples that follow in subsequent chapters, we will show alternative techniques using lambda expressions.

There is ample support provided for the basic data science tasks. These include multiple ways of acquiring data, libraries for cleaning data, and a wide variety of analysis approaches for tasks such as natural language processing and statistical analysis. There are also myriad of libraries supporting neural network types of analysis.

Java can be a very good choice for data science problems. The language provides both object-oriented and functional support for solving problems. There is a large developer community to draw upon and there exist multiple APIs that support data science tasks. These are but a few reasons as to why Java should be used.

The remainder of this chapter will provide an overview of the data science tasks and Java support demonstrated in the book. Each section is only able to present a brief introduction to the topics and the available support. The subsequent chapter will go into considerably more depth regarding these topics.

Acquiring data for an application

Data acquisition is an important step in the data analysis process. When data is acquired, it is often in a specialized form and its contents may be inconsistent or different from an application's need. There are many sources of data, which are found on the Internet. Several examples will be demonstrated in Chapter 2, Data Acquisition.

Data may be stored in a variety of formats. Popular formats for text data include HTML, Comma Separated Values (CSV), JavaScript Object Notation (JSON), and XML. Image and audio data are stored in a number of formats. However, it is frequently necessary to convert one data format into another format, typically plain text.

For example, JSON (http://www.JSON.org/) is stored using blocks of curly braces containing key-value pairs. In the following example, parts of a YouTube result is shown:

    {
      "kind": "youtube#searchResult",
      "etag": etag,
      "id": {
        "kind": string,
        "videoId": string,
        "channelId": string,
        "playlistId": string
      },
      ...
    }

Data is acquired using techniques such as processing live streams, downloading compressed files, and through screen scraping, where the information on a web page is extracted. Web crawling is a technique where a program examines a series of web pages, moving from one page to another, acquiring the data that it needs.

With many popular media sites, it is necessary to acquire a user ID and password to access data. A commonly used technique is OAuth, which is an open standard used to authenticate users to many different websites. The technique delegates access to a server resource and works over HTTPS. Several companies use OAuth 2.0, including PayPal, Facebook, Twitter, and Yelp.

The importance and process of cleaning data

Once the data has been acquired, it will need to be cleaned. Frequently, the data will contain errors, duplicate entries, or be inconsistent. It often needs to be converted to a simpler data type such as text. Data cleaning is often referred to as data wrangling, reshaping, or munging. They are effectively synonyms.

When data is cleaned, there are several tasks that often need to be performed, including checking its validity, accuracy, completeness, consistency, and uniformity. For example, when the data is incomplete, it may be necessary to provide substitute values.

Consider CSV data. It can be handled in one of several ways. We can use simple Java techniques such as the String class' split method. In the following sequence, a string array, csvArray, is assumed to hold comma-delimited data. The split method populates a second array, tokenArray.

for(int i=0; i<csvArray.length; i++) { 
    tokenArray[i] = csvArray[i].split(","); 
} 

More complex data types require APIs to retrieve the data. For example, in Chapter 3, Data Cleaning, we will use the Jackson Project (https://github.com/FasterXML/jackson) to retrieve fields from a JSON file. The example uses a file containing a JSON-formatted presentation of a person, as shown next:

{ 
 "firstname":"Smith",
 "lastname":"Peter", 
 "phone":8475552222,
 "address":["100 Main Street","Corpus","Oklahoma"] 
}

The code sequence that follows shows how to extract the values for fields of a person. A parser is created, which uses getCurrentName to retrieve a field name. If the name is firstname, then the getText method returns the value for that field. The other fields are handled in a similar manner.

try { 
    JsonFactory jsonfactory = new JsonFactory(); 
    JsonParser parser = jsonfactory.createParser( 
        new File("Person.json")); 
    while (parser.nextToken() != JsonToken.END_OBJECT) { 
        String token = parser.getCurrentName(); 
        if ("firstname".equals(token)) { 
            parser.nextToken(); 
            String fname = parser.getText(); 
            out.println("firstname : " + fname); 
        } 
        ... 
    } 
    parser.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

The output of this example is as follows:

firstname : Smith

Simple data cleaning may involve converting the text to lowercase, replacing certain text with blanks, and removing multiple whitespace characters with a single blank. One way of doing this is shown next, where a combination of the String class' toLowerCase, replaceAll, and trim methods are used. Here, a string containing dirty text is processed:

dirtyText = dirtyText 
    .toLowerCase() 
    .replaceAll("[\\d[^\\w\\s]]+", "  
    .trim(); 
while(dirtyText.contains("  ")){ 
      dirtyText = dirtyText.replaceAll("  ", " "); 
}            

Stop words are words such as the, and, or but that do not always contribute to the analysis of text. Removing these stop words can often improve the results and speed up the processing.

The LingPipe API can be used to remove stop words. In the next code sequence, a TokenizerFactory class instance is used to tokenize text. Tokenization is the process of returning individual words. The EnglishStopTokenizerFactory class is a special tokenizer that removes common English stop words.

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer( 
    text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
      out.print(word + " "); 
} 

Consider the following text, which was pulled from the book, Moby Dick:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.

The output will be as follows:

call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

These are just a couple of the data cleaning tasks discussed in Chapter 3, Data Cleaning.

Visualizing data to enhance understanding

The analysis of data often results in a series of numbers representing the results of the analysis. However, for most people, this way of expressing results is not always intuitive. A better way to understand the results is to create graphs and charts to depict the results and the relationship between the elements of the result.

The human mind is often good at seeing patterns, trends, and outliers in visual representation. The large amount of data present in many data science problems can be analyzed using visualization techniques. Visualization is appropriate for a wide range of audiences ranging from analysts to upper-level management to clientele. In this chapter, we present various visualization techniques and demonstrate how they are supported in Java.

In Chapter 4, Data Visualization, we illustrate how to create different types of graphs, plots, and charts. These examples use JavaFX using a free library called GRAL(http://trac.erichseifert.de/gral/).

Visualization allows users to examine large datasets in ways that provide insights that are not present in the mass of the data. Visualization tools helps us identify potential problems or unexpected data results and develop meaningful interpretations of the data.

For example, outliers, which are values that lie outside of the normal range of values, can be hard to spot from a sea of numbers. Creating a graph based on the data allows users to quickly see outliers. It can also help spot errors quickly and more easily classify data.

For example, the following chart might suggest that the upper two values should be outliers that need to be dealt with:

Visualizing data to enhance understanding

The use of statistical methods in data science

Statistical analysis is the key to many data science tasks. It is used for many types of analysis ranging from the computation of simple mean and medium to complex multiple regression analysis. Chapter 5, Statistical Data Analysis Techniques, introduces this type of analysis and the Java support available.

Statistical analysis is not always an easy task. In addition, advanced statistical techniques often require a particular mindset to fully comprehend, which can be difficult to learn. Fortunately, many techniques are not that difficult to use and various libraries mitigate some of these techniques' inherent complexity.

Regression analysis, in particular, is an important technique for analyzing data. The technique attempts to draw a line that matches a set of data. An equation representing the line is calculated and can be used to predict future behavior. There are several types of regression analysis, including simple and multiple regression. They vary by the number of variables being considered.

The following graph shows the straight line that closely matches a set of data points representing the population of Belgium over several decades:

The use of statistical methods in data science

Simple statistical techniques, such as mean and standard deviation, can be computed using basic Java. They can also be handled by libraries such as Apache Commons. For example, to calculate the median, we can use the Apache Commons DescriptiveStatistics class. This is illustrated next where the median of an array of doubles is calculated. The numbers are added to an instance of this class, as shown here:

double[] testData = {12.5, 18.3, 11.2, 19.0, 22.1, 14.3, 16.2, 
    12.5, 17.8, 16.5, 12.5}; 
DescriptiveStatistics statTest =  
    new SynchronizedDescriptiveStatistics(); 
for(double num : testData){ 
   statTest.addValue(num); 
} 

The getPercentile method returns the value stored at the percentile specified in its argument. To find the median, we use the value of 50.

out.println("The median is " + statTest.getPercentile(50)); 
 

Our output is as follows:

The median is 16.2

In Chapter 5, Statistical Data Analysis Techniques, we will demonstrate how to perform regression analysis using the Apache Commons SimpleRegression class.

Machine learning applied to data science

Machine learning has become increasingly important for data science analysis as it has been for a multitude of other fields. A defining characteristic of machine learning is the ability of a model to be trained on a set of representative data and then later used to solve similar problems. There is no need to explicitly program an application to solve the problem. A model is a representation of the real-world object.

For example, customer purchases can be used to train a model. Subsequently, predictions can be made about the types of purchases a customer might subsequently make. This allows an organization to tailor ads and coupons for a customer and potentially providing a better customer experience.

Training can be performed in one of several different approaches:

  • Supervised learning: The model is trained with annotated, labeled, data showing corresponding correct results
  • Unsupervised learning: The data does not contain results, but the model is expected to find relationships on its own
  • Semi-supervised: A small amount of labeled data is combined with a larger amount of unlabeled data
  • Reinforcement learning: This is similar to supervised learning, but a reward is provided for good results

There are several approaches that support machine learning. In Chapter 6, Machine Learning, we will illustrate three techniques:

  • Decision trees: A tree is constructed using features of the problem as internal nodes and the results as leaves
  • Support vector machines: This is used for classification by creating a hyperplane that partitions the dataset and then makes predictions
  • Bayesian networks: This is used to depict probabilistic relationships between events

A Support Vector Machine (SVM) is used primarily for classification type problems. The approach creates a hyperplane to categorize data, which can be envisioned as a geometric plane that separates two regions. In a two-dimensional space, it will be a line. In a three-dimensional space, it will be a two-dimensional plane. In Chapter 6, Machine Learning, we will demonstrate how to use the approach using a set of data relating to the propensity of individuals to camp. We will use the Weka class, SMO, to demonstrate this type of analysis.

The following figure depicts a hyperplane using a distribution of two types of data points. The lines represent possible hyperplanes that separate these points. The lines clearly separate the data points except for one outlier.

Machine learning applied to data science

Once the model has been trained, the possible hyperplanes are considered and predictions can then be made using similar data.

Using neural networks in data science

An Artificial Neural Network (ANN), which we will call a neural network, is based on the neuron found in the brain. A neuron is a cell that has dendrites connecting it to input sources and other neurons. Depending on the input source, a weight allocated to a source, the neuron is activated, and then fires a signal down a dendrite to another neuron. A collection of neurons can be trained to respond to a set of input signals.

An artificial neuron is a node that has one or more inputs and a single output. Each input has a weight assigned to it that can change over time. A neural network can learn by feeding an input into a network, invoking an activation function, and comparing the results. This function combines the inputs and creates an output. If outputs of multiple neurons match the expected result, then the network has been trained correctly. If they don't match, then the network is modified.

A neural network can be visualized as shown in the following figure, where Hidden Layer is used to augment the process:

Using neural networks in data science

In Chapter 7, Neural Networks, we will use the Weka class, MultilayerPerceptron, to illustrate the creation and use of a Multi Layer Perceptron (MLP) network. As we will explain, this type of network is a feedforward neural network with multiple layers. The network uses supervised learning with backpropagation. The example uses a dataset called dermatology.arff that contains 366 instances that are used to diagnose erythemato-squamous diseases. It uses 34 attributes to classify the disease into one of the five different categories.

The dataset is split into a training set and a testing set. Once the data has been read, the MLP instance is created and initialized using the method to configure the attributes of the model, including how quickly the model is to learn and the amount of time spent training the model.

String trainingFileName = "dermatologyTrainingSet.arff"; 
String testingFileName = "dermatologyTestingSet.arff"; 
 
try (FileReader trainingReader = new FileReader(trainingFileName); 
        FileReader testingReader =  
            new FileReader(testingFileName)) { 
    Instances trainingInstances = new Instances(trainingReader); 
    trainingInstances.setClassIndex( 
        trainingInstances.numAttributes() - 1); 
    Instances testingInstances = new Instances(testingReader); 
    testingInstances.setClassIndex( 
        testingInstances.numAttributes() - 1); 
 
    MultilayerPerceptron mlp = new MultilayerPerceptron(); 
    mlp.setLearningRate(0.1); 
    mlp.setMomentum(0.2); 
    mlp.setTrainingTime(2000); 
    mlp.setHiddenLayers("3"); 
          mlp.buildClassifier(trainingInstances); 
       ... 
} catch (Exception ex) { 
    // Handle exceptions 
} 

The model is then evaluated using the testing data:

Evaluation evaluation = new Evaluation(trainingInstances); 
evaluation.evaluateModel(mlp, testingInstances); 

The results can then be displayed:

System.out.println(evaluation.toSummaryString()); 

The truncated output of this example is shown here where the number of correctly and incorrectly identified diseases are listed:

Correctly Classified Instances 73 98.6486 %

Incorrectly Classified Instances 1 1.3514 %

The various attributes of the model can be tweaked to improve the model. In Chapter 7, Neural Networks, we will discuss this and other techniques in more depth.

Deep learning approaches

Deep learning networks are often described as neural networks that use multiple intermediate layers. Each layer will train on the outputs of a previous layer potentially identifying features and subfeatures of a dataset. The features refer to those aspects of the data that may be of interest. In Chapter 8, Deep Learning, we will examine these types of networks and how they can support several different data science tasks.

These networks often work with unstructured and unlabeled datasets, which is the vast majority of the data available today. A typical approach is to take the data, identify features, and then use these features and their corresponding layers to reconstruct the original dataset, thus validating the network. The Restricted Boltzmann Machines (RBM) is a good example of the application of this approach.

The deep learning network needs to ensure that the results are accurate and minimizes any error that can creep into the process. This is accomplished by adjusting the internal weights assigned to neurons based on what is known as gradient descent. This represents the slope of the weight changes. The approach modifies the weight so as to minimize the error and also speeds up the learning process.

There are several types of networks that have been classified as a deep learning network. One of these is an autoencoder network. In this network, the layers are symmetrical where the number of input values is the same as the number of output values and the intermediate layers effectively compress the data to a single smaller internal layer. Each layer of the autoencoder is a RBM.

This structure is reflected in the following example, which will extract the numbers found in a set of images containing hand-written numbers. The details of the complete example are not shown here, but notice that 1,000 input and output values are used along with internal layers consisting of RBMs. The size of the layers are specified in the nOut and nIn methods.

MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder() 
        .seed(seed) 
        .iterations(numberOfIterations) 
        .optimizationAlgo( 
           OptimizationAlgorithm.LINE_GRADIENT_DESCENT) 
        .list() 
        .layer(0, new RBM.Builder() 
            .nIn(numberOfRows * numberOfColumns).nOut(1000) 
            .lossFunction(LossFunctions.LossFunction.RMSE_XENT) 
            .build()) 
        .layer(1, new RBM.Builder().nIn(1000).nOut(500) 
            .lossFunction(LossFunctions.LossFunction.RMSE_XENT) 
            .build()) 
        .layer(2, new RBM.Builder().nIn(500).nOut(250) 
            .lossFunction(LossFunctions.LossFunction.RMSE_XENT) 
            .build()) 
        .layer(3, new RBM.Builder().nIn(250).nOut(100) 
            .lossFunction(LossFunctions.LossFunction.RMSE_XENT) 
            .build()) 
        .layer(4, new RBM.Builder().nIn(100).nOut(30) 
            .lossFunction(LossFunctions.LossFunction.RMSE_XENT) 
            .build()) //encoding stops 
        .layer(5, new RBM.Builder().nIn(30).nOut(100) 
            .lossFunction(LossFunctions.LossFunction.RMSE_XENT) 
            .build()) //decoding starts 
        .layer(6, new RBM.Builder().nIn(100).nOut(250) 
            .lossFunction(LossFunctions.LossFunction.RMSE_XENT) 
            .build()) 
        .layer(7, new RBM.Builder().nIn(250).nOut(500) 
            .lossFunction(LossFunctions.LossFunction.RMSE_XENT) 
            .build()) 
        .layer(8, new RBM.Builder().nIn(500).nOut(1000) 
            .lossFunction(LossFunctions.LossFunction.RMSE_XENT) 
            .build()) 
        .layer(9, new OutputLayer.Builder( 
                LossFunctions.LossFunction.RMSE_XENT).nIn(1000) 
                .nOut(numberOfRows * numberOfColumns).build()) 
        .pretrain(true).backprop(true) 
        .build(); 

Once the model has been trained, it can be used for predictive and searching tasks. With a search, the compressed middle layer can be used to match other compressed images that need to be classified.

Performing text analysis

The field of Natural Language Processing (NLP) is used for many different tasks including text searching, language translation, sentiment analysis, speech recognition, and classification to mention a few. Processing text is difficult due to a number of reasons, including the inherent ambiguity of natural languages.

There are several different types of processing that can be performed such as:

  • Identifying Stop words: These are words that are common and may not be necessary for processing
  • Name Entity Recognition (NER): This is the process of identifying elements of text such as people's names, location, or things
  • Parts of Speech (POS): This identifies the grammatical parts of a sentence such as noun, verb, adjective, and so on
  • Relationships: Here we are concerned with identifying how parts of text are related to each other, such as the subject and object of a sentence

As with most data science problems, it is important to preprocess and clean text. In Chapter 9, Text Analysis, we examine the support Java provides for this area of data science.

For example, we will use Apache's OpenNLP (https://opennlp.apache.org/) library to find the parts of speech. This is just one of the several NLP APIs that we could have used including LingPipe (http://alias-i.com/lingpipe/), Apache UIMA (https://uima.apache.org/), and Standford NLP (http://nlp.stanford.edu/). We chose OpenNLP because it is easy to use for this example.

In the following example, a model used to identify POS elements is found in the en-pos-maxent.bin file. An array of words is initialized and the POS model is created:

try (InputStream input = new FileInputStream( 
        new File("en-pos-maxent.bin"));) { 
    String sentence = "Let's parse this sentence."; 
    ... 
    String[] words; 
    ...  
    list.toArray(words); 
    POSModel posModel = new POSModel(input); 
    ... 
} catch (IOException ex) { 
    // Handle exceptions 
} 

The tag method is passed an array of words and returns an array of tags. The words and tags are then displayed.

String[] posTags = posTagger.tag(words); 
for(int i=0; i<posTags.length; i++) { 
    out.println(words[i] + " - " + posTags[i]); 
} 

The output for this example is as follows:

Let's - NNP

parse - NN

this - DT

sentence. - NN

The abbreviations NNP and DT stand for a singular proper noun and determiner respectively. We examine several other NLP techniques in Chapter 9, Text Analysis.

Visual and audio analysis

In Chapter 10, Visual and Audio Analysis, we demonstrate several Java techniques for processing sounds and images. We begin by demonstrating techniques for sound processing, including speech recognition and text-to-speech APIs. Specifically, we will use the FreeTTS (http://freetts.sourceforge.net/docs/index.php) API to convert text to speech. We also include a demonstration of the CMU Sphinx toolkit for speech recognition.

The Java Speech API (JSAPI) (http://www.oracle.com/technetwork/java/index-140170.html) supports speech technology. This API, created by third-party vendors, supports speech recognition and speech synthesizers. FreeTTS and Festival (http://www.cstr.ed.ac.uk/projects/festival/) are examples of vendors supporting JSAPI.

In the second part of the chapter, we examine image processing techniques such as facial recognition. This demonstration involves identifying faces within an image and is easy to accomplish using OpenCV (http://opencv.org/).

Also, in Chapter 10, Visual and Audio Analysis, we demonstrate how to extract text from images, a process known as OCR. A common data science problem involves extracting and analyzing text embedded in an image. For example, the information contained in license plate, road signs, and directions can be significant.

In the following example, explained in more detail in Chapter 11Mathematical and Parallel Techniques for Data Analysis accomplishes OCR using Tess4j (http://tess4j.sourceforge.net/) a Java JNA wrapper for Tesseract OCR API. We perform OCR on an image captured from the Wikipedia article on OCR (https://en.wikipedia.org/wiki/Optical_character_recognition#Applications), shown here:

Visual and audio analysis

The ITesseract interface provides numerous OCR methods. The doOCR method takes a file and returns a string containing the words found in the file as shown here:

ITesseract instance = new Tesseract();  
try { 
    String result = instance.doOCR(new File("OCRExample.png")); 
    System.out.println(result); 
} catch (TesseractException e) { 
    System.err.println(e.getMessage()); 
} 

A part of the output is shown next:

OCR engines nave been developed into many lunds oiobiectorlented OCR applicatlons, sucn as reoeipt OCR, involoe OCR, check OCR, legal billing document OCR

They can be used ior

- Data entry ior business documents, e g check, passport, involoe, bank statement and receipt

- Automatic number plate recognnlon

As you can see, there are numerous errors in this example that need to be addressed. We build upon this example in Chapter 11, Mathematical and Parallel Techniques for Data Analysis, with a discussion of enhancements and considerations to ensure the OCR process is as effective as possible.

We will conclude the chapter with a discussion of NeurophStudio, a neural network Java-based editor, to classify images and perform image recognition. We train a neural network to recognize and classify faces in this section.

Improving application performance using parallel techniques

In Chapter 11, Mathematical and Parallel Techniques for Data Analysis, we consider some of the parallel techniques available for data science applications. Concurrent execution of a program can significantly improve performance. In relation to data science, these techniques range from low-level mathematical calculations to higher-level API-specific options.

This chapter includes a discussion of basic performance enhancement considerations. Algorithms and application architecture matter as much as enhanced code, and this should be considered when attempting to integrate parallel techniques. If an application does not behave in the expected or desired manner, any gains from parallel optimizing are irrelevant.

Matrix operations are essential to many data applications and supporting APIs. We will include a discussion in this chapter about matrix multiplication and how it is handled using a variety of approaches. Even though these operations are often hidden within the API, it can be useful to understand how they are supported.

One approach we demonstrate utilizes the Apache Commons Math API (http://commons.apache.org/proper/commons-math/). This API supports a large number of mathematical and statistical operations, including matrix multiplication. The following example illustrates how to perform matrix multiplication.

We first declare and initialize matrices A and B:

double[][] A = { 
    {0.1950, 0.0311}, 
    {0.3588, 0.2203}, 
    {0.1716, 0.5931}, 
    {0.2105, 0.3242}}; 
 
double[][] B = { 
    {0.0502, 0.9823, 0.9472}, 
    {0.5732, 0.2694, 0.916}}; 

Apache Commons uses the RealMatrix class to store a matrix. Next, we use the Array2DRowRealMatrix constructor to create the corresponding matrices for A and B:

RealMatrix aRealMatrix = new Array2DRowRealMatrix(A); 
RealMatrix bRealMatrix = new Array2DRowRealMatrix(B); 

We perform multiplication simply using the multiply method:

RealMatrix cRealMatrix = aRealMatrix.multiply(bRealMatrix); 

Finally, we use a for loop to display the results:

for (int i = 0; i < cRealMatrix.getRowDimension(); i++) { 
    System.out.println(cRealMatrix.getRowVector(i)); 
} 

The output is as follows:

{0.02761552; 0.19992684; 0.2131916}
{0.14428772; 0.41179806; 0.54165016}
{0.34857924; 0.32834382; 0.70581912}
{0.19639854; 0.29411363; 0.4963528}

Another approach to concurrent processing involves the use of Java threads. Threads are used by APIs such as Aparapi when multiple CPUs or GPUs are not available.

Data science applications often take advantage of the map-reduce algorithm. We will demonstrate parallel processing by using Apache's Hadoop to perform map-reduce. Designed specifically for large datasets, Hadoop reduces processing time for large scale data science projects. We demonstrate a technique for calculating the average value of a large dataset.

We also include examples of APIs that support multiple processors, including CUDA and OpenCL. CUDA is supported using Java bindings for CUDA (JCuda) (http://jcuda.org/). We also discuss OpenCL and its Java support. The Aparapi API provides high-level support for using multiple CPUs or GPUs and we include a demonstration of Aparapi in support of matrix multiplication.

Assembling the pieces

In the final chapter of this book, we will tie together many of the techniques explored in the previous chapters. We will create a simple console-based application for acquiring data from Twitter and performing various types of data manipulation and analysis. Our goal in this chapter is to demonstrate a simple project exploring a variety of data science concepts and provide insights and considerations for future projects.

Specifically, the application developed in the final chapter performs several high-level tasks, including data acquisition, data cleaning, sentiment analysis, and basic statistical collection. We demonstrate these techniques using Java 8 Streams and focus on Java 8 approaches whenever possible.

Summary

Data science is a broad, diverse field of study and it would be impossible to explore exhaustively within this book. We hope to provide a solid understanding of important data science concepts and equip the reader for further study. In particular, this book will provide concrete examples of different techniques for all stages of data science related inquiries. This ranges from data acquisition and cleaning to detailed statistical analysis.

So let's start with a discussion of data acquisition and how Java supports it as illustrated in the next chapter.

 

Chapter 2. Data Acquisition

It is never much fun to work with code that is not formatted properly or uses variable names that do not convey their intended purpose. The same can be said of data, except that bad data can result in inaccurate results. Thus, data acquisition is an important step in the analysis of data. Data is available from a number of sources but must be retrieved and ultimately processed before it can be useful. It is available from a variety of sources. We can find it in numerous public data sources as simple files, or it may be found in more complex forms across the Internet. In this chapter, we will demonstrate how to acquire data from several of these, including various Internet sites and several social media sites.

We can access data from the Internet by downloading specific files or through a process known as web scraping, which involves extracting the contents of a web page. We also explore a related topic known as web crawling, which involves applications that examine a web site to determine whether it is of interest and then follows embedded links to identify other potentially relevant pages.

We can also extract data from social media sites. These types of sites often hold a treasure trove of data that is readily available if we know how to access it. In this chapter, we will demonstrate how to extract data from several sites, including:

  • Twitter
  • Wikipedia
  • Flickr
  • YouTube

When extracting data from a site, many different data formats may be encountered. We will examine three basic types: text, audio, and video. However, even within text, audio, and video data, many formats exist. For audio data alone, there are 45 audio coding formats compared at https://en.wikipedia.org/wiki/Comparison_of_audio_coding_formats. For textual data, there are almost 300 formats listed at http://fileinfo.com/filetypes/text. In this chapter, we will focus on how to download and extract these types of text as plain text for eventual processing.

We will briefly examine different data formats, followed by an examination of possible data sources. We need this knowledge to demonstrate how to obtain data using different data acquisition techniques.

Understanding the data formats used in data science applications

When we discuss data formats, we are referring to content format, as opposed to the underlying file format, which may not even be visible to most developers. We cannot examine all available formats due to the vast number of formats available. Instead, we will tackle several of the more common formats, providing adequate examples to address the most common data retrieval needs. Specifically, we will demonstrate how to retrieve data stored in the following formats:

  • HTML
  • PDF
  • CSV/TSV
  • Spreadsheets
  • Databases
  • JSON
  • XML

Some of these formats are well supported and documented elsewhere. For example, XML has been in use for years and there are several well-established techniques for accessing XML data in Java. For these types of data, we will outline the major techniques available and show a few examples to illustrate how they work. This will provide those readers who are not familiar with the technology some insight into their nature.

The most common data format is binary files. For example, Word, Excel, and PDF documents are all stored in binary. These require special software to extract information from them. Text data is also very common.

Overview of CSV data

Comma Separated Values (CSV) files, contain tabular data organized in a row-column format. The data, stored as plaintext, is stored in rows, also called records. Each record contains fields separated by commas. These files are also closely related to other delimited files, most notably Tab-Separated Values (TSV) files. The following is a part of a simple CSV file, and these numbers are not intended to represent any specific type of data:

JURISDICTION NAME,COUNT PARTICIPANTS,COUNT FEMALE,PERCENT FEMALE 
10001,44,22,0.5 
10002,35,19,0.54 
10003,1,1,1 

Notice that the first row contains header data to describe the subsequent records. Each value is separated by a comma and corresponds to the header in the same position. In Chapter 3, Data Cleaning, we will discuss CSV files in more depth and examine the support available for different types of delimiters.

Overview of spreadsheets

Spreadsheets are a form of tabular data where information is stored in rows and columns, much like a two-dimensional array. They typically contain numeric and textual information and use formulas to summarize and analyze their contents. Most people are familiar with Excel spreadsheets, but they are also found as part of other product suites, such as OpenOffice.

Spreadsheets are an important data source because they have been used for the past several decades to store information in many industries and applications. Their tabular nature makes them easy to process and analyze. It is important to know how to extract data from this ubiquitous data source so that we can take advantage of the wealth of information that is stored in them.

For some of our examples, we will use a simple Excel spreadsheet that consists of a series of rows containing an ID, along with minimum, maximum, and average values. These numbers are not intended to represent any specific type of data. The spreadsheet looks like this:

ID

Minimum

Maximum

Average

12345

45

89

65.55

23456

78

96

86.75

34567

56

89

67.44

45678

86

99

95.67

In Chapter 3, Data Cleaning, we will learn how to extract data from spreadsheets.

Overview of databases

Data can be found in Database Management Systems (DBMS), which, like spreadsheets, are ubiquitous. Java provides a rich set of options for accessing and processing data in a DBMS. The intent of this section is to provide a basic introduction to database access using Java.

We will demonstrate the essence of connecting to a database, storing information, and retrieving information using JDBC. For this example, we used the MySQL DBMS. However, it will work for other DBMSes as well with a change in the database driver. We created a database called example and a table called URLTABLE using the following command within the MySQL Workbench. There are other tools that can achieve the same results:

CREATE TABLE IF NOT EXISTS `URLTABLE` ( 
  `RecordID` INT(11) NOT NULL AUTO_INCREMENT, 
  `URL` text NOT NULL, 
  PRIMARY KEY (`RecordID`) 
); 

We start with a try block to handle exceptions. A driver is needed to connect to the DBMS. In this example, we used com.mysql.jdbc.Driver. To connect to the database, the getConnection method is used, where the database server location, user ID, and password are passed. These values depend on the DBMS used:

    try { 
        Class.forName("com.mysql.jdbc.Driver"); 
        Stri­ng url = "jdbc:mysql://localhost:3306/example"; 
        connection = DriverManager.getConnection(url, "user ID", 
            "password"); 
            ... 
    } catch (SQLException | ClassNotFoundException ex) { 
        // Handle exceptions 
    } 

Next, we will illustrate how to add information to the database and then how to read it. The SQL INSERT command is constructed in a string. The name of the MySQL database is example. This command will insert values into the URLTABLE table in the database where the question mark is a placeholder for the value to be inserted:

    String insertSQL = "INSERT INTO  `example`.`URLTABLE` " 
        + "(`url`) VALUES " + "(?);"; 

The PreparedStatement class represents an SQL statement to execute. The prepareStatement method creates an instance of the class using the INSERT SQL statement:

    PreparedStatement stmt = connection.prepareStatement(insertSQL); 

We then add URLs to the table using the setString method and the execute method. The setString method possesses two arguments. The first specifies the column index to insert the data and the second is the value to be inserted. The execute method does the actual insertion. We add two URLs in the next sequence:

    stmt.setString(1, "https://en.wikipedia.org/wiki/Data_science"); 
    stmt.execute(); 
    stmt.setString(1,  
      "https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly"); 
    stmt.execute(); 

To read the data, we use a SQL SELECT statement as declared in the selectSQL string. This will return all the rows and columns from the URLTABLE table. The createStatement method creates an instance of a Statement class, which is used for INSERT type statements. The executeQuery method executes the query and returns a ResultSet instance that holds the contents of the table:

    String selectSQL = "select * from URLTABLE"; 
    Statement statement = connection.createStatement(); 
    ResultSet resultSet = statement.executeQuery(selectSQL); 

The following sequence iterates through the table, displaying one row at a time. The argument of the getString method specifies that we want to use the second column of the result set, which corresponds to the URL field:

    out.println("List of URLs"); 
    while (resultSet.next()) { 
        out.println(resultSet.getString(2)); 
    }  

The output of this example, when executed, is as follows:

List of URLs
https://en.wikipedia.org/wiki/Data_science
https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly

If you need to empty the contents of the table, use the following sequence:

    Statement statement = connection.createStatement(); 
    statement.execute("TRUNCATE URLTABLE;"); 

This was a brief introduction to database access using Java. There are many resources available that will provide more in-depth coverage of this topic. For example, Oracle provides a more in-depth introduction to this topic at https://docs.oracle.com/javase/tutorial/jdbc/.

Overview of PDF files

The Portable Document Format (PDF) is a format not tied to a specific platform or software application. A PDF document can hold formatted text and images. PDF is an open standard, making it useful in a variety of places.

There are a large number of documents stored as PDF, making it a valuable source of data. There are several Java APIs that allow access to PDF documents, including Apache POI and PDFBox. Techniques for extracting information from a PDF document are illustrated in Chapter 3, Data Cleaning.

Overview of JSON

JavaScript Object Notation (JSON) (http://www.JSON.org/) is a data format used to interchange data. It is easy for humans or machines to read and write. JSON is supported by many languages, including Java, which has several JSON libraries listed at http://www.JSON.org/.

A JSON entity is composed of a set of name-value pairs enclosed in curly braces. We will use this format in several of our examples. In handling YouTube, we will use a JSON object, some of which is shown next, representing the results of a request from a YouTube video:

{ 
  "kind": "youtube#searchResult", 
  "etag": "etag", 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  ... 
} 

Accessing the fields and values of such a document is not hard and is illustrated in Chapter 3, Data Cleaning.

Overview of XML

Extensible Markup Language (XML) is a markup language that specifies a standard document format. Widely used to communicate between applications and across the Internet, XML is popular due to its relative simplicity and flexibility. Documents encoded in XML are character-based and easily read by machines and humans.

XML documents contain markup and content characters. These characters allow parsers to classify the information contained within the document. The document consists of tags, and elements are stored within the tags. Elements may also contain other markup tags and form child elements. Additionally, elements may contain attributes or specific characteristics stored as a name-and-value pair.

An XML document must be well-formed. This means it must follow certain rules such as always using closing tags and only a single root tag. Other rules are discussed at https://en.wikipedia.org/wiki/XML#Well-formedness_and_error-handling.

The Java API for XML Processing (JAXP) consists of three interfaces for parsing XML data. The Document Object Model (DOM) interface parses an XML document and returns a tree structure delineating the structure of the document. The DOM interface parses an entire document as a whole. Alternatively, the Simple API for XML (SAX) parses a document one element at a time. SAX is preferable when memory usage is a concern as DOM requires more resources to construct the tree. DOM, however, offers flexibility over SAX in that any element can be accessed at any time and in any order.

The third Java API is known as Streaming API for XML (StAX). This streaming model was designed to accommodate the best parts of DOM and SAX models by granting flexibility without sacrificing resources. StAX exhibits higher performance, with the trade-off being that access is only available to one location in a document at a time. StAX is the preferred technique if you already know how you want to process the document, but it is also popular for applications with limited available memory.

The following is a simple XML file. Each <text> represents a tag, labelling the element contained within the tags. In this case, the largest node in our file is <music> and contained within it are sets of song data. Each tag within a <song> tag describes another element corresponding to that song. Every tag will eventually have a closing tag, such as </song>. Notice that the first tag contains information about which XML version should be used to parse the file:

<?xml version="1.0"?> 
<music> 
   <song id="1234"> 
      <artist>Patton, Courtney</artist> 
      <name>So This Is Life</name> 
      <genre>Country</genre> 
      <price>2.99</price> 
   </song> 
   <song id="5678"> 
      <artist>Eady, Jason</artist> 
      <name>AM Country Heaven</name> 
      <genre>Country</genre> 
      <price>2.99</price> 
   </song> 
</music> 

There are numerous other XML-related technologies. For example, we can validate a specific XML document using either a DTD document or XML schema writing specifically for that XML document. XML documents can be transformed into a different format using XLST.

Overview of streaming data

Streaming data refers to data generated in a continuous stream and accessed in a sequential, piece-by-piece manner. Much of the data the average Internet user accesses is streamed, including video and audio channels, or text and image data on social media sites. Streaming data is the preferred method when the data is new and changing quickly, or when large data collections are sought.

Streamed data is often ideal for data science research because it generally exists in large quantities and raw format. Much public streaming data is available for free and supported by Java APIs. In this chapter, we are going to examine how to acquire data from streaming sources, including Twitter, Flickr, and YouTube. Despite the use of different techniques and APIs, you will notice similarities between the techniques used to pull data from these sites.

Overview of audio/video/images in Java

There are a large number of formats used to represent images, videos, and audio. This type of data is typically stored in binary format. Analog audio streams are sampled and digitized. Images are often simply collections of bits representing the color of a pixel. The following are links that provide a more in-depth discussion of some of these formats:

Frequently, this type of data can be quite large and must be compressed. When data is compressed two approaches are used. The first is a lossless compression, where less space is used and there is no loss of information. The second is lossy, where information is lost. Losing information is not always a bad thing as sometimes the loss is not noticeable to humans.

As we will demonstrate in Chapter 3, Data Cleaning, this type of data often is compromised in an inconvenient fashion and may need to be cleaned. For example, there may be background noise in an audio recording or an image may need to be smoothed before it can be processed. Image smoothing is demonstrated in Chapter 3, Data Cleaning, using the OpenCV library.

Data acquisition techniques

In this section, we will illustrate how to acquire data from web pages. Web pages contain a potential bounty of useful information. We will demonstrate how to access web pages using several technologies, starting with a low-level approach supported by the HttpUrlConnection class. To find pages, a web crawler application is often used. Once a useful page has been identified, then information needs to be extracted from the page. This is often performed using an HTML parser. Extracting this information is important because it is often buried amid a clutter of HTML tags and JavaScript code.

Using the HttpUrlConnection class

The contents of a web page can be accessed using the HttpUrlConnection class. This is a low-level approach that requires the developer to do a lot of footwork to extract relevant content. However, he or she is able to exercise greater control over how the content is handled. In some situations, this approach may be preferable to using other API libraries.

We will demonstrate how to download the content of Wikipedia's data science page using this class. We start with a try/catch block to handle exceptions. A URL object is created using the data science URL string. The openConnection method will create a connection to the Wikipedia server as shown here:

    try { 
        URL url = new URL( 
            "https://en.wikipedia.org/wiki/Data_science"); 
        HttpURLConnection connection = (HttpURLConnection)  
            url.openConnection(); 
       ... 
    } catch (MalformedURLException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

The connection object is initialized with an HTTP GET command. The connect method is then executed to connect to the server:

    connection.setRequestMethod("GET"); 
    connection.connect(); 

Assuming no errors were encountered, we can determine whether the response was successful using the getResponseCode method. A normal return value is 200. The content of a web page can vary. For example, the getContentType method returns a string describing the page's content. The getContentLength method returns its length:

    out.println("Response Code: " + connection.getResponseCode()); 
    out.println("Content Type: " + connection.getContentType()); 
    out.println("Content Length: " + connection.getContentLength()); 

Assuming that we get an HTML formatted page, the next sequence illustrates how to get this content. A BufferedReader instance is created where one line at a time is read in from the web site and appended to a BufferedReader instance. The buffer is then displayed:

    InputStreamReader isr = new InputStreamReader((InputStream)  
        connection.getContent()); 
    BufferedReader br = new BufferedReader(isr); 
    StringBuilder buffer = new StringBuilder(); 
    String line; 
    do { 
        line = br.readLine(); 
        buffer.append(line + "\n"); 
    } while (line != null); 
    out.println(buffer.toString()); 

The abbreviated output is shown here:

Response Code: 200 
Content Type: text/html; charset=UTF-8 
Content Length: -1
<!DOCTYPE html> 
<html lang="en" dir="ltr" class="client-nojs"> 
<head> 
<meta charset="UTF-8"/>
<title>Data science - Wikipedia, the free encyclopedia</title> 
<script>document.documentElement.className =
...
"wgHostname":"mw1251"});});</script> 
</body>
</html>

While this is feasible, there are easier methods for getting the contents of a web page. One of these techniques is discussed in the next section.

Web crawlers in Java

Web crawling is the process of traversing a series of interconnected web pages and extracting relevant information from those pages. It does this by isolating and then following links on a page. While there are many precompiled datasets readily available, it may still be necessary to collect data directly off the Internet. Some sources such as news sites are continually being updated and need to be revisited from time to time.

A web crawler is an application that visits various sites and collects information. The web crawling process consists of a series of steps:

  1. Select a URL to visit
  2. Fetch the page
  3. Parse the page
  4. Extract relevant content
  5. Extract relevant URLs to visit

This process is repeated for each URL visited.

There are several issues that need to be considered when fetching and parsing a page such as:

  • Page importance: We do not want to process irrelevant pages.
  • Exclusively HTML: We will not normally follow links to images, for example.
  • Spider traps: We want to bypass sites that may result in an infinite number of requests. This can occur with dynamically generated pages where one request leads to another.
  • Repetition: It is important to avoid crawling the same page more than once.
  • Politeness: Do not make an excessive number of requests to a website. Observe the robot.txt files; they specify which parts of a site should not be crawled.

The process of creating a web crawler can be daunting. For all but the simplest needs, it is recommended that one of several open source web crawlers be used. A partial list follows:

We can either create our own web crawler or use an existing crawler and in this chapter we will examine both approaches. For specialized processing, it can be desirable to use a custom crawler. We will demonstrate how to create a simple web crawler in Java to provide more insight into how web crawlers work. This will be followed by a brief discussion of other web crawlers.

Creating your own web crawler

Now that we have a basic understanding of web crawlers, we are ready to create our own. In this simple web crawler, we will keep track of the pages visited using ArrayList instances. In addition, jsoup will be used to parse a web page and we will limit the number of pages we visit. Jsoup (https://jsoup.org/) is an open source HTML parser. This example demonstrates the basic structure of a web crawler and also highlights some of the issues involved in creating a web crawler.

We will use the SimpleWebCrawler class, as declared here:

public class SimpleWebCrawler { 
 
    private String topic; 
    private String startingURL; 
    private String urlLimiter; 
    private final int pageLimit = 20; 
    private ArrayList<String> visitedList = new ArrayList<>(); 
    private ArrayList<String> pageList = new ArrayList<>(); 
    ... 
    public static void main(String[] args) { 
        new SimpleWebCrawler(); 
    } 
 
} 

The instance variables are detailed here:

Variable

Use

topic

The keyword that needs to be in a page for the page to be accepted

startingURL

The URL of the first page

urlLimiter

A string that must be contained in a link before it will be followed

pageLimit

The maximum number of pages to retrieve

visitedList

The ArrayList containing pages that have already been visited

pageList

An ArrayList containing the URLs of the pages of interest

In the SimpleWebCrawler constructor, we initialize the instance variables to begin the search from the Wikipedia page for Bishop Rock, an island off the coast of Italy. This was chosen to minimize the number of pages that might be retrieved. As we will see, there are many more Wikipedia pages dealing with Bishop Rock than one might think.

The urlLimiter variable is set to Bishop_Rock, which will restrict the embedded links to follow to just those containing that string. Each page of interest must contain the value stored in the topic variable. The visitPage method performs the actual crawl:

    public SimpleWebCrawler() { 
        startingURL = https://en.wikipedia.org/wiki/Bishop_Rock, " 
            + "Isles_of_Scilly"; 
        urlLimiter = "Bishop_Rock"; 
        topic = "shipping route"; 
        visitPage(startingURL); 
    } 

In the visitPage method, the pageList ArrayList is checked to see whether the maximum number of accepted pages has been exceeded. If the limit has been exceeded, then the search terminates:

    public void visitPage(String url) { 
        if (pageList.size() >= pageLimit) { 
            return; 
        } 
       ... 
    } 

If the page has already been visited, then we ignore it. Otherwise, it is added to the visited list:

    if (visitedList.contains(url)) { 
        // URL already visited 
    } else { 
        visitedList.add(url); 
            ... 
    } 

Jsoup is used to parse the page and return a Document object. There are many different exceptions and problems that can occur such as a malformed URL, retrieval timeouts, or simply bad links. The catch block needs to handle these types of problems. We will provide a more in-depth explanation of jsoup in web scraping in Java:

    try { 
        Document doc = Jsoup.connect(url).get(); 
            ... 
        } 
    } catch (Exception ex) { 
        // Handle exceptions 
    } 

If the document contains the topic text, then the link is displayed and added to the pageList ArrayList. Each embedded link is obtained, and if the link contains the limiting text, then the visitPage method is called recursively:

    if (doc.text().contains(topic)) { 
        out.println((pageList.size() + 1) + ": [" + url + "]"); 
        pageList.add(url); 
 
        // Process page links 
        Elements questions = doc.select("a[href]"); 
        for (Element link : questions) { 
            if (link.attr("href").contains(urlLimiter)) { 
                visitPage(link.attr("abs:href")); 
            } 
        } 
    } 

This approach only examines links in those pages that contain the topic text. Moving the for loop outside of the if statement will test the links for all pages.

The output follows:

1: [https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly]
2: [https://en.wikipedia.org/wiki/Bishop_Rock_Lighthouse]
3: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=717634231#Lighthouse]
4: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=717634231]
5: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716622943]
6: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716622943]
7: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716608512]
8: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716608512]
...
20: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716603919]

In this example, we did not save the results of the crawl in an external source. Normally this is necessary and can be stored in a file or database.

Using the crawler4j web crawler

Here we will illustrate the use of the crawler4j (https://github.com/yasserg/crawler4j) web crawler. We will use an adapted version of the basic crawler found at https://github.com/yasserg/crawler4j/tree/master/src/test/java/edu/uci/ics/crawler4j/examples/basic. We will create two classes: CrawlerController and SampleCrawler. The former class set ups the crawler while the latter contains the logic that controls what pages will be processed.

As with our previous crawler, we will crawl the Wikipedia article dealing with Bishop Rock. The results using this crawler will be smaller as many extraneous pages are ignored.

Let's look at the CrawlerController class first. There are several parameters that are used with the crawler as detailed here:

  • Crawl storage folder: The location where crawl data is stored
  • Number of crawlers: This controls the number of threads used for the crawl
  • Politeness delay: How many seconds to pause between requests
  • Crawl depth: How deep the crawl will go
  • Maximum number of pages to fetch: How many pages to fetch
  • Binary data: Whether to crawl binary data such as PDF files

The basic class is shown here:

public class CrawlerController { 
 
  public static void main(String[] args) throws Exception { 
    int numberOfCrawlers = 2; 
    CrawlConfig config = new CrawlConfig(); 
    String crawlStorageFolder = "data"; 
     
    config.setCrawlStorageFolder(crawlStorageFolder); 
    config.setPolitenessDelay(500); 
    config.setMaxDepthOfCrawling(2); 
    config.setMaxPagesToFetch(20); 
    config.setIncludeBinaryContentInCrawling(false); 
    ... 
  } 
}

Next, the CrawlController class is created and configured. Notice the RobotstxtConfig and RobotstxtServer classes used to handle robot.txt files. These files contain instructions that are intended to be read by a web crawler. They provide direction to help a crawler to do a better job such as specifying which parts of a site should not be crawled. This is useful for auto generated pages:

    PageFetcher pageFetcher = new PageFetcher(config); 
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
    RobotstxtServer robotstxtServer =  
        new RobotstxtServer(robotstxtConfig, pageFetcher); 
    CrawlController controller =  
        new CrawlController(config, pageFetcher, robotstxtServer); 

The crawler needs to start at one or more pages. The addSeed method adds the starting pages. While we used the method only once here, it can be used as many times as needed:

    controller.addSeed( 
      "https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly"); 

The start method will begin the crawling process:

    controller.start(SampleCrawler.class, numberOfCrawlers); 

The SampleCrawler class contains two methods of interest. The first is the shouldVisit method that determines whether a page will be visited and the visit method that actually handles the page. We start with the class declaration and the declaration of a Java regular expression class Pattern object. It will be one way of determining whether a page will be visited. In this declaration, standard images are specified and will be ignored:

    public class SampleCrawler extends WebCrawler { 
        private static final Pattern IMAGE_EXTENSIONS =  
            Pattern.compile(".*\\.(bmp|gif|jpg|png)$"); 
 
        ... 
    } 

The shouldVisit method is passed a reference to the page where this URL was found along with the URL. If any of the images match, the method returns false and the page is ignored. In addition, the URL must start with https://en.wikipedia.org/wiki/. We added this to restrict our searches to the Wikipedia website:

    public boolean shouldVisit(Page referringPage, WebURL url) { 
        String href = url.getURL().toLowerCase(); 
        if (IMAGE_EXTENSIONS.matcher(href).matches()) { 
            return false; 
        } 
        return href.startsWith("https://en.wikipedia.org/wiki/"); 
    }

The visit method is passed a Page object representing the page being visited. In this implementation, only those pages containing the string shipping route will be processed. This further restricts the pages visited. When we find such a page, its URL, Text, and Text length are displayed:

    public void visit(Page page) { 
        String url = page.getWebURL().getURL(); 
 
        if (page.getParseData() instanceof HtmlParseData) { 
            HtmlParseData htmlParseData =  
                (HtmlParseData) page.getParseData(); 
            String text = htmlParseData.getText(); 
            if (text.contains("shipping route")) { 
                out.println("\nURL: " + url); 
                out.println("Text: " + text); 
                out.println("Text length: " + text.length()); 
            } 
        } 
    } 

The following is the truncated output of the program when executed:

URL: https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly
Text: Bishop Rock, Isles of Scilly...From Wikipedia, the free encyclopedia ... Jump to: ... navigation, search For the Bishop Rock in the Pacific Ocean, see Cortes Bank. Bishop Rock Bishop Rock Lighthouse (2005)
...
Text length: 14677

Notice that only one page was returned. This web crawler was able to identify and ignore previous versions of the main web page.

We could perform further processing, but this example provides some insight into how the API works. Significant amounts of information can be obtained when visiting a page. In the example, we only used the URL and the length of the text. The following is a sample of other data that you may be interested in obtaining:

  • URL path
  • Parent URL
  • Anchor
  • HTML text
  • Outgoing links
  • Document ID

Web scraping in Java

Web scraping is the process of extracting information from a web page. The page is typically formatted using a series of HTML tags. An HTML parser is used to navigate through a page or series of pages and to access the page's data or metadata.

Jsoup (https://jsoup.org/) is an open source Java library that facilitates extracting and manipulating HTML documents using an HTML parser. It is used for a number of purposes, including web scraping, extracting specific elements from an HTML page, and cleaning up HTML documents.

There are several ways of obtaining an HTML document that may be useful. The HTML document can be extracted from a:

  • URL
  • String
  • File

The first approach is illustrated next where the Wikipedia page for data science is loaded into a Document object. This Jsoup object represents the HTML document. The connect method connects to the site and the get method retrieves the document:

    try { 
        Document document = Jsoup.connect( 
            "https://en.wikipedia.org/wiki/Data_science").get(); 
        ... 
     } catch (IOException ex) { 
        // Handle exception 
    } 

Loading from a file uses the File class as shown next. The overloaded parse method uses the file to create the document object:

    try { 
        File file = new File("Example.html"); 
        Document document = Jsoup.parse(file, "UTF-8", ""); 
        ... 
    } catch (IOException ex) { 
        // Handle exception 
    } 

The Example.html file follows:

<html> 
<head><title>Example Document</title></head> 
<body> 
<p>The body of the document</p> 
Interesting Links: 
<br> 
<a href="https://en.wikipedia.org/wiki/Data_science">Data Science</a> 
<br> 
<a href="https://en.wikipedia.org/wiki/Jsoup">Jsoup</a> 
<br> 
Images: 
<br> 
 <img src="eyechart.jpg" alt="Eye Chart">  
</body> 
</html> 

To create a Document object from a string, we will use the following sequence where the parse method processes the string that duplicates the previous HTML file:

    String html = "<html>\n" 
        + "<head><title>Example Document</title></head>\n" 
        + "<body>\n" 
        + "<p>The body of the document</p>\n" 
        + "Interesting Links:\n" 
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Data_science">" + 
          "DataScience</a>\n"
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Jsoup">" + 
          "Jsoup</a>\n"
        + "<br>\n" 
        + "Images:\n" 
        + "<br>\n" 
        + " <img src="eyechart.jpg" alt="Eye Chart"> \n"
        + "</body>\n" 
        + "</html>"; 
    Document document = Jsoup.parse(html);

The Document class possesses a number of useful methods. The title method returns the title. To get the text contents of the document, the select method is used. This method uses a string specifying the element of a document to retrieve:

    String title = document.title(); 
    out.println("Title: " + title); 
    Elements element = document.select("body"); 
    out.println("  Text: " + element.text()); 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Title: Data science - Wikipedia, the free encyclopedia
Text: Data science From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with information science. Part of a 
...
policy About Wikipedia Disclaimers Contact Wikipedia Developers Cookie statement Mobile view

The parameter type of the select method is a string. By using a string, the type of information selected is easily changed. Details on how to formulate this string are found at the jsoup Javadocs for the Selector class at https://jsoup.org/apidocs/:

We can use the select method to retrieve the images in a document, as shown here:

    Elements images = document.select("img[src$=.png]"); 
    for (Element image : images) { 
        out.println("\nImage: " + image); 
    } 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Image: <img alt="Data Visualization" src="//upload.wikimedia.org/...>
Image: <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/ba/...>

Links can be easily retrieved as shown next:

    Elements links = document.select("a[href]"); 
    for (Element link : links) { 
        out.println("Link: " + link.attr("href") 
            + " Text: " + link.text()); 
    } 

The output for the Example.html page is shown here:

Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science
Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup

jsoup possesses many additional capabilities. However, this example demonstrates the web scraping process. There are also other Java HTML parsers available. A comparison of Java HTML parser, among others, can be found at https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers.

Using API calls to access common social media sites

Social media contain a wealth of information that can be processed and is used by many data analysis applications. In this section, we will illustrate how to access a few of these sources using their Java APIs. Most of them require some sort of access key, which is normally easy to obtain. We start with a discussion on the OAuth class, which provides one approach to authenticating access to a data source.

When working with the type of data source, it is important to keep in mind that the data is not always public. While it may be accessible, the owner of the data may be an individual who does not necessarily want the information shared. Most APIs provide a means to determine how the data can be distributed, and these requests should be honored. When private information is used, permission from the author must be obtained.

In addition, these sites have limits on the number of requests that can be made. Keep this in mind when pulling data from a site. If these limits need to be exceeded, then most sites provide a way of doing this.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Using the HttpUrlConnection class

The contents of a web page can be accessed using the HttpUrlConnection class. This is a low-level approach that requires the developer to do a lot of footwork to extract relevant content. However, he or she is able to exercise greater control over how the content is handled. In some situations, this approach may be preferable to using other API libraries.

We will demonstrate how to download the content of Wikipedia's data science page using this class. We start with a try/catch block to handle exceptions. A URL object is created using the data science URL string. The openConnection method will create a connection to the Wikipedia server as shown here:

    try { 
        URL url = new URL( 
            "https://en.wikipedia.org/wiki/Data_science"); 
        HttpURLConnection connection = (HttpURLConnection)  
            url.openConnection(); 
       ... 
    } catch (MalformedURLException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

The connection object is initialized with an HTTP GET command. The connect method is then executed to connect to the server:

    connection.setRequestMethod("GET"); 
    connection.connect(); 

Assuming no errors were encountered, we can determine whether the response was successful using the getResponseCode method. A normal return value is 200. The content of a web page can vary. For example, the getContentType method returns a string describing the page's content. The getContentLength method returns its length:

    out.println("Response Code: " + connection.getResponseCode()); 
    out.println("Content Type: " + connection.getContentType()); 
    out.println("Content Length: " + connection.getContentLength()); 

Assuming that we get an HTML formatted page, the next sequence illustrates how to get this content. A BufferedReader instance is created where one line at a time is read in from the web site and appended to a BufferedReader instance. The buffer is then displayed:

    InputStreamReader isr = new InputStreamReader((InputStream)  
        connection.getContent()); 
    BufferedReader br = new BufferedReader(isr); 
    StringBuilder buffer = new StringBuilder(); 
    String line; 
    do { 
        line = br.readLine(); 
        buffer.append(line + "\n"); 
    } while (line != null); 
    out.println(buffer.toString()); 

The abbreviated output is shown here:

Response Code: 200 
Content Type: text/html; charset=UTF-8 
Content Length: -1
<!DOCTYPE html> 
<html lang="en" dir="ltr" class="client-nojs"> 
<head> 
<meta charset="UTF-8"/>
<title>Data science - Wikipedia, the free encyclopedia</title> 
<script>document.documentElement.className =
...
"wgHostname":"mw1251"});});</script> 
</body>
</html>

While this is feasible, there are easier methods for getting the contents of a web page. One of these techniques is discussed in the next section.

Web crawlers in Java

Web crawling is the process of traversing a series of interconnected web pages and extracting relevant information from those pages. It does this by isolating and then following links on a page. While there are many precompiled datasets readily available, it may still be necessary to collect data directly off the Internet. Some sources such as news sites are continually being updated and need to be revisited from time to time.

A web crawler is an application that visits various sites and collects information. The web crawling process consists of a series of steps:

  1. Select a URL to visit
  2. Fetch the page
  3. Parse the page
  4. Extract relevant content
  5. Extract relevant URLs to visit

This process is repeated for each URL visited.

There are several issues that need to be considered when fetching and parsing a page such as:

  • Page importance: We do not want to process irrelevant pages.
  • Exclusively HTML: We will not normally follow links to images, for example.
  • Spider traps: We want to bypass sites that may result in an infinite number of requests. This can occur with dynamically generated pages where one request leads to another.
  • Repetition: It is important to avoid crawling the same page more than once.
  • Politeness: Do not make an excessive number of requests to a website. Observe the robot.txt files; they specify which parts of a site should not be crawled.

The process of creating a web crawler can be daunting. For all but the simplest needs, it is recommended that one of several open source web crawlers be used. A partial list follows:

We can either create our own web crawler or use an existing crawler and in this chapter we will examine both approaches. For specialized processing, it can be desirable to use a custom crawler. We will demonstrate how to create a simple web crawler in Java to provide more insight into how web crawlers work. This will be followed by a brief discussion of other web crawlers.

Creating your own web crawler

Now that we have a basic understanding of web crawlers, we are ready to create our own. In this simple web crawler, we will keep track of the pages visited using ArrayList instances. In addition, jsoup will be used to parse a web page and we will limit the number of pages we visit. Jsoup (https://jsoup.org/) is an open source HTML parser. This example demonstrates the basic structure of a web crawler and also highlights some of the issues involved in creating a web crawler.

We will use the SimpleWebCrawler class, as declared here:

public class SimpleWebCrawler { 
 
    private String topic; 
    private String startingURL; 
    private String urlLimiter; 
    private final int pageLimit = 20; 
    private ArrayList<String> visitedList = new ArrayList<>(); 
    private ArrayList<String> pageList = new ArrayList<>(); 
    ... 
    public static void main(String[] args) { 
        new SimpleWebCrawler(); 
    } 
 
} 

The instance variables are detailed here:

Variable

Use

topic

The keyword that needs to be in a page for the page to be accepted

startingURL

The URL of the first page

urlLimiter

A string that must be contained in a link before it will be followed

pageLimit

The maximum number of pages to retrieve

visitedList

The ArrayList containing pages that have already been visited

pageList

An ArrayList containing the URLs of the pages of interest

In the SimpleWebCrawler constructor, we initialize the instance variables to begin the search from the Wikipedia page for Bishop Rock, an island off the coast of Italy. This was chosen to minimize the number of pages that might be retrieved. As we will see, there are many more Wikipedia pages dealing with Bishop Rock than one might think.

The urlLimiter variable is set to Bishop_Rock, which will restrict the embedded links to follow to just those containing that string. Each page of interest must contain the value stored in the topic variable. The visitPage method performs the actual crawl:

    public SimpleWebCrawler() { 
        startingURL = https://en.wikipedia.org/wiki/Bishop_Rock, " 
            + "Isles_of_Scilly"; 
        urlLimiter = "Bishop_Rock"; 
        topic = "shipping route"; 
        visitPage(startingURL); 
    } 

In the visitPage method, the pageList ArrayList is checked to see whether the maximum number of accepted pages has been exceeded. If the limit has been exceeded, then the search terminates:

    public void visitPage(String url) { 
        if (pageList.size() >= pageLimit) { 
            return; 
        } 
       ... 
    } 

If the page has already been visited, then we ignore it. Otherwise, it is added to the visited list:

    if (visitedList.contains(url)) { 
        // URL already visited 
    } else { 
        visitedList.add(url); 
            ... 
    } 

Jsoup is used to parse the page and return a Document object. There are many different exceptions and problems that can occur such as a malformed URL, retrieval timeouts, or simply bad links. The catch block needs to handle these types of problems. We will provide a more in-depth explanation of jsoup in web scraping in Java:

    try { 
        Document doc = Jsoup.connect(url).get(); 
            ... 
        } 
    } catch (Exception ex) { 
        // Handle exceptions 
    } 

If the document contains the topic text, then the link is displayed and added to the pageList ArrayList. Each embedded link is obtained, and if the link contains the limiting text, then the visitPage method is called recursively:

    if (doc.text().contains(topic)) { 
        out.println((pageList.size() + 1) + ": [" + url + "]"); 
        pageList.add(url); 
 
        // Process page links 
        Elements questions = doc.select("a[href]"); 
        for (Element link : questions) { 
            if (link.attr("href").contains(urlLimiter)) { 
                visitPage(link.attr("abs:href")); 
            } 
        } 
    } 

This approach only examines links in those pages that contain the topic text. Moving the for loop outside of the if statement will test the links for all pages.

The output follows:

1: [https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly]
2: [https://en.wikipedia.org/wiki/Bishop_Rock_Lighthouse]
3: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=717634231#Lighthouse]
4: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=717634231]
5: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716622943]
6: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716622943]
7: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716608512]
8: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716608512]
...
20: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716603919]

In this example, we did not save the results of the crawl in an external source. Normally this is necessary and can be stored in a file or database.

Using the crawler4j web crawler

Here we will illustrate the use of the crawler4j (https://github.com/yasserg/crawler4j) web crawler. We will use an adapted version of the basic crawler found at https://github.com/yasserg/crawler4j/tree/master/src/test/java/edu/uci/ics/crawler4j/examples/basic. We will create two classes: CrawlerController and SampleCrawler. The former class set ups the crawler while the latter contains the logic that controls what pages will be processed.

As with our previous crawler, we will crawl the Wikipedia article dealing with Bishop Rock. The results using this crawler will be smaller as many extraneous pages are ignored.

Let's look at the CrawlerController class first. There are several parameters that are used with the crawler as detailed here:

  • Crawl storage folder: The location where crawl data is stored
  • Number of crawlers: This controls the number of threads used for the crawl
  • Politeness delay: How many seconds to pause between requests
  • Crawl depth: How deep the crawl will go
  • Maximum number of pages to fetch: How many pages to fetch
  • Binary data: Whether to crawl binary data such as PDF files

The basic class is shown here:

public class CrawlerController { 
 
  public static void main(String[] args) throws Exception { 
    int numberOfCrawlers = 2; 
    CrawlConfig config = new CrawlConfig(); 
    String crawlStorageFolder = "data"; 
     
    config.setCrawlStorageFolder(crawlStorageFolder); 
    config.setPolitenessDelay(500); 
    config.setMaxDepthOfCrawling(2); 
    config.setMaxPagesToFetch(20); 
    config.setIncludeBinaryContentInCrawling(false); 
    ... 
  } 
}

Next, the CrawlController class is created and configured. Notice the RobotstxtConfig and RobotstxtServer classes used to handle robot.txt files. These files contain instructions that are intended to be read by a web crawler. They provide direction to help a crawler to do a better job such as specifying which parts of a site should not be crawled. This is useful for auto generated pages:

    PageFetcher pageFetcher = new PageFetcher(config); 
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
    RobotstxtServer robotstxtServer =  
        new RobotstxtServer(robotstxtConfig, pageFetcher); 
    CrawlController controller =  
        new CrawlController(config, pageFetcher, robotstxtServer); 

The crawler needs to start at one or more pages. The addSeed method adds the starting pages. While we used the method only once here, it can be used as many times as needed:

    controller.addSeed( 
      "https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly"); 

The start method will begin the crawling process:

    controller.start(SampleCrawler.class, numberOfCrawlers); 

The SampleCrawler class contains two methods of interest. The first is the shouldVisit method that determines whether a page will be visited and the visit method that actually handles the page. We start with the class declaration and the declaration of a Java regular expression class Pattern object. It will be one way of determining whether a page will be visited. In this declaration, standard images are specified and will be ignored:

    public class SampleCrawler extends WebCrawler { 
        private static final Pattern IMAGE_EXTENSIONS =  
            Pattern.compile(".*\\.(bmp|gif|jpg|png)$"); 
 
        ... 
    } 

The shouldVisit method is passed a reference to the page where this URL was found along with the URL. If any of the images match, the method returns false and the page is ignored. In addition, the URL must start with https://en.wikipedia.org/wiki/. We added this to restrict our searches to the Wikipedia website:

    public boolean shouldVisit(Page referringPage, WebURL url) { 
        String href = url.getURL().toLowerCase(); 
        if (IMAGE_EXTENSIONS.matcher(href).matches()) { 
            return false; 
        } 
        return href.startsWith("https://en.wikipedia.org/wiki/"); 
    }

The visit method is passed a Page object representing the page being visited. In this implementation, only those pages containing the string shipping route will be processed. This further restricts the pages visited. When we find such a page, its URL, Text, and Text length are displayed:

    public void visit(Page page) { 
        String url = page.getWebURL().getURL(); 
 
        if (page.getParseData() instanceof HtmlParseData) { 
            HtmlParseData htmlParseData =  
                (HtmlParseData) page.getParseData(); 
            String text = htmlParseData.getText(); 
            if (text.contains("shipping route")) { 
                out.println("\nURL: " + url); 
                out.println("Text: " + text); 
                out.println("Text length: " + text.length()); 
            } 
        } 
    } 

The following is the truncated output of the program when executed:

URL: https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly
Text: Bishop Rock, Isles of Scilly...From Wikipedia, the free encyclopedia ... Jump to: ... navigation, search For the Bishop Rock in the Pacific Ocean, see Cortes Bank. Bishop Rock Bishop Rock Lighthouse (2005)
...
Text length: 14677

Notice that only one page was returned. This web crawler was able to identify and ignore previous versions of the main web page.

We could perform further processing, but this example provides some insight into how the API works. Significant amounts of information can be obtained when visiting a page. In the example, we only used the URL and the length of the text. The following is a sample of other data that you may be interested in obtaining:

  • URL path
  • Parent URL
  • Anchor
  • HTML text
  • Outgoing links
  • Document ID

Web scraping in Java

Web scraping is the process of extracting information from a web page. The page is typically formatted using a series of HTML tags. An HTML parser is used to navigate through a page or series of pages and to access the page's data or metadata.

Jsoup (https://jsoup.org/) is an open source Java library that facilitates extracting and manipulating HTML documents using an HTML parser. It is used for a number of purposes, including web scraping, extracting specific elements from an HTML page, and cleaning up HTML documents.

There are several ways of obtaining an HTML document that may be useful. The HTML document can be extracted from a:

  • URL
  • String
  • File

The first approach is illustrated next where the Wikipedia page for data science is loaded into a Document object. This Jsoup object represents the HTML document. The connect method connects to the site and the get method retrieves the document:

    try { 
        Document document = Jsoup.connect( 
            "https://en.wikipedia.org/wiki/Data_science").get(); 
        ... 
     } catch (IOException ex) { 
        // Handle exception 
    } 

Loading from a file uses the File class as shown next. The overloaded parse method uses the file to create the document object:

    try { 
        File file = new File("Example.html"); 
        Document document = Jsoup.parse(file, "UTF-8", ""); 
        ... 
    } catch (IOException ex) { 
        // Handle exception 
    } 

The Example.html file follows:

<html> 
<head><title>Example Document</title></head> 
<body> 
<p>The body of the document</p> 
Interesting Links: 
<br> 
<a href="https://en.wikipedia.org/wiki/Data_science">Data Science</a> 
<br> 
<a href="https://en.wikipedia.org/wiki/Jsoup">Jsoup</a> 
<br> 
Images: 
<br> 
 <img src="eyechart.jpg" alt="Eye Chart">  
</body> 
</html> 

To create a Document object from a string, we will use the following sequence where the parse method processes the string that duplicates the previous HTML file:

    String html = "<html>\n" 
        + "<head><title>Example Document</title></head>\n" 
        + "<body>\n" 
        + "<p>The body of the document</p>\n" 
        + "Interesting Links:\n" 
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Data_science">" + 
          "DataScience</a>\n"
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Jsoup">" + 
          "Jsoup</a>\n"
        + "<br>\n" 
        + "Images:\n" 
        + "<br>\n" 
        + " <img src="eyechart.jpg" alt="Eye Chart"> \n"
        + "</body>\n" 
        + "</html>"; 
    Document document = Jsoup.parse(html);

The Document class possesses a number of useful methods. The title method returns the title. To get the text contents of the document, the select method is used. This method uses a string specifying the element of a document to retrieve:

    String title = document.title(); 
    out.println("Title: " + title); 
    Elements element = document.select("body"); 
    out.println("  Text: " + element.text()); 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Title: Data science - Wikipedia, the free encyclopedia
Text: Data science From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with information science. Part of a 
...
policy About Wikipedia Disclaimers Contact Wikipedia Developers Cookie statement Mobile view

The parameter type of the select method is a string. By using a string, the type of information selected is easily changed. Details on how to formulate this string are found at the jsoup Javadocs for the Selector class at https://jsoup.org/apidocs/:

We can use the select method to retrieve the images in a document, as shown here:

    Elements images = document.select("img[src$=.png]"); 
    for (Element image : images) { 
        out.println("\nImage: " + image); 
    } 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Image: <img alt="Data Visualization" src="//upload.wikimedia.org/...>
Image: <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/ba/...>

Links can be easily retrieved as shown next:

    Elements links = document.select("a[href]"); 
    for (Element link : links) { 
        out.println("Link: " + link.attr("href") 
            + " Text: " + link.text()); 
    } 

The output for the Example.html page is shown here:

Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science
Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup

jsoup possesses many additional capabilities. However, this example demonstrates the web scraping process. There are also other Java HTML parsers available. A comparison of Java HTML parser, among others, can be found at https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers.

Using API calls to access common social media sites

Social media contain a wealth of information that can be processed and is used by many data analysis applications. In this section, we will illustrate how to access a few of these sources using their Java APIs. Most of them require some sort of access key, which is normally easy to obtain. We start with a discussion on the OAuth class, which provides one approach to authenticating access to a data source.

When working with the type of data source, it is important to keep in mind that the data is not always public. While it may be accessible, the owner of the data may be an individual who does not necessarily want the information shared. Most APIs provide a means to determine how the data can be distributed, and these requests should be honored. When private information is used, permission from the author must be obtained.

In addition, these sites have limits on the number of requests that can be made. Keep this in mind when pulling data from a site. If these limits need to be exceeded, then most sites provide a way of doing this.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Web crawlers in Java

Web crawling is the process of traversing a series of interconnected web pages and extracting relevant information from those pages. It does this by isolating and then following links on a page. While there are many precompiled datasets readily available, it may still be necessary to collect data directly off the Internet. Some sources such as news sites are continually being updated and need to be revisited from time to time.

A web crawler is an application that visits various sites and collects information. The web crawling process consists of a series of steps:

  1. Select a URL to visit
  2. Fetch the page
  3. Parse the page
  4. Extract relevant content
  5. Extract relevant URLs to visit

This process is repeated for each URL visited.

There are several issues that need to be considered when fetching and parsing a page such as:

  • Page importance: We do not want to process irrelevant pages.
  • Exclusively HTML: We will not normally follow links to images, for example.
  • Spider traps: We want to bypass sites that may result in an infinite number of requests. This can occur with dynamically generated pages where one request leads to another.
  • Repetition: It is important to avoid crawling the same page more than once.
  • Politeness: Do not make an excessive number of requests to a website. Observe the robot.txt files; they specify which parts of a site should not be crawled.

The process of creating a web crawler can be daunting. For all but the simplest needs, it is recommended that one of several open source web crawlers be used. A partial list follows:

We can either create our own web crawler or use an existing crawler and in this chapter we will examine both approaches. For specialized processing, it can be desirable to use a custom crawler. We will demonstrate how to create a simple web crawler in Java to provide more insight into how web crawlers work. This will be followed by a brief discussion of other web crawlers.

Creating your own web crawler

Now that we have a basic understanding of web crawlers, we are ready to create our own. In this simple web crawler, we will keep track of the pages visited using ArrayList instances. In addition, jsoup will be used to parse a web page and we will limit the number of pages we visit. Jsoup (https://jsoup.org/) is an open source HTML parser. This example demonstrates the basic structure of a web crawler and also highlights some of the issues involved in creating a web crawler.

We will use the SimpleWebCrawler class, as declared here:

public class SimpleWebCrawler { 
 
    private String topic; 
    private String startingURL; 
    private String urlLimiter; 
    private final int pageLimit = 20; 
    private ArrayList<String> visitedList = new ArrayList<>(); 
    private ArrayList<String> pageList = new ArrayList<>(); 
    ... 
    public static void main(String[] args) { 
        new SimpleWebCrawler(); 
    } 
 
} 

The instance variables are detailed here:

Variable

Use

topic

The keyword that needs to be in a page for the page to be accepted

startingURL

The URL of the first page

urlLimiter

A string that must be contained in a link before it will be followed

pageLimit

The maximum number of pages to retrieve

visitedList

The ArrayList containing pages that have already been visited

pageList

An ArrayList containing the URLs of the pages of interest

In the SimpleWebCrawler constructor, we initialize the instance variables to begin the search from the Wikipedia page for Bishop Rock, an island off the coast of Italy. This was chosen to minimize the number of pages that might be retrieved. As we will see, there are many more Wikipedia pages dealing with Bishop Rock than one might think.

The urlLimiter variable is set to Bishop_Rock, which will restrict the embedded links to follow to just those containing that string. Each page of interest must contain the value stored in the topic variable. The visitPage method performs the actual crawl:

    public SimpleWebCrawler() { 
        startingURL = https://en.wikipedia.org/wiki/Bishop_Rock, " 
            + "Isles_of_Scilly"; 
        urlLimiter = "Bishop_Rock"; 
        topic = "shipping route"; 
        visitPage(startingURL); 
    } 

In the visitPage method, the pageList ArrayList is checked to see whether the maximum number of accepted pages has been exceeded. If the limit has been exceeded, then the search terminates:

    public void visitPage(String url) { 
        if (pageList.size() >= pageLimit) { 
            return; 
        } 
       ... 
    } 

If the page has already been visited, then we ignore it. Otherwise, it is added to the visited list:

    if (visitedList.contains(url)) { 
        // URL already visited 
    } else { 
        visitedList.add(url); 
            ... 
    } 

Jsoup is used to parse the page and return a Document object. There are many different exceptions and problems that can occur such as a malformed URL, retrieval timeouts, or simply bad links. The catch block needs to handle these types of problems. We will provide a more in-depth explanation of jsoup in web scraping in Java:

    try { 
        Document doc = Jsoup.connect(url).get(); 
            ... 
        } 
    } catch (Exception ex) { 
        // Handle exceptions 
    } 

If the document contains the topic text, then the link is displayed and added to the pageList ArrayList. Each embedded link is obtained, and if the link contains the limiting text, then the visitPage method is called recursively:

    if (doc.text().contains(topic)) { 
        out.println((pageList.size() + 1) + ": [" + url + "]"); 
        pageList.add(url); 
 
        // Process page links 
        Elements questions = doc.select("a[href]"); 
        for (Element link : questions) { 
            if (link.attr("href").contains(urlLimiter)) { 
                visitPage(link.attr("abs:href")); 
            } 
        } 
    } 

This approach only examines links in those pages that contain the topic text. Moving the for loop outside of the if statement will test the links for all pages.

The output follows:

1: [https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly]
2: [https://en.wikipedia.org/wiki/Bishop_Rock_Lighthouse]
3: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=717634231#Lighthouse]
4: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=717634231]
5: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716622943]
6: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716622943]
7: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716608512]
8: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716608512]
...
20: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716603919]

In this example, we did not save the results of the crawl in an external source. Normally this is necessary and can be stored in a file or database.

Using the crawler4j web crawler

Here we will illustrate the use of the crawler4j (https://github.com/yasserg/crawler4j) web crawler. We will use an adapted version of the basic crawler found at https://github.com/yasserg/crawler4j/tree/master/src/test/java/edu/uci/ics/crawler4j/examples/basic. We will create two classes: CrawlerController and SampleCrawler. The former class set ups the crawler while the latter contains the logic that controls what pages will be processed.

As with our previous crawler, we will crawl the Wikipedia article dealing with Bishop Rock. The results using this crawler will be smaller as many extraneous pages are ignored.

Let's look at the CrawlerController class first. There are several parameters that are used with the crawler as detailed here:

  • Crawl storage folder: The location where crawl data is stored
  • Number of crawlers: This controls the number of threads used for the crawl
  • Politeness delay: How many seconds to pause between requests
  • Crawl depth: How deep the crawl will go
  • Maximum number of pages to fetch: How many pages to fetch
  • Binary data: Whether to crawl binary data such as PDF files

The basic class is shown here:

public class CrawlerController { 
 
  public static void main(String[] args) throws Exception { 
    int numberOfCrawlers = 2; 
    CrawlConfig config = new CrawlConfig(); 
    String crawlStorageFolder = "data"; 
     
    config.setCrawlStorageFolder(crawlStorageFolder); 
    config.setPolitenessDelay(500); 
    config.setMaxDepthOfCrawling(2); 
    config.setMaxPagesToFetch(20); 
    config.setIncludeBinaryContentInCrawling(false); 
    ... 
  } 
}

Next, the CrawlController class is created and configured. Notice the RobotstxtConfig and RobotstxtServer classes used to handle robot.txt files. These files contain instructions that are intended to be read by a web crawler. They provide direction to help a crawler to do a better job such as specifying which parts of a site should not be crawled. This is useful for auto generated pages:

    PageFetcher pageFetcher = new PageFetcher(config); 
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
    RobotstxtServer robotstxtServer =  
        new RobotstxtServer(robotstxtConfig, pageFetcher); 
    CrawlController controller =  
        new CrawlController(config, pageFetcher, robotstxtServer); 

The crawler needs to start at one or more pages. The addSeed method adds the starting pages. While we used the method only once here, it can be used as many times as needed:

    controller.addSeed( 
      "https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly"); 

The start method will begin the crawling process:

    controller.start(SampleCrawler.class, numberOfCrawlers); 

The SampleCrawler class contains two methods of interest. The first is the shouldVisit method that determines whether a page will be visited and the visit method that actually handles the page. We start with the class declaration and the declaration of a Java regular expression class Pattern object. It will be one way of determining whether a page will be visited. In this declaration, standard images are specified and will be ignored:

    public class SampleCrawler extends WebCrawler { 
        private static final Pattern IMAGE_EXTENSIONS =  
            Pattern.compile(".*\\.(bmp|gif|jpg|png)$"); 
 
        ... 
    } 

The shouldVisit method is passed a reference to the page where this URL was found along with the URL. If any of the images match, the method returns false and the page is ignored. In addition, the URL must start with https://en.wikipedia.org/wiki/. We added this to restrict our searches to the Wikipedia website:

    public boolean shouldVisit(Page referringPage, WebURL url) { 
        String href = url.getURL().toLowerCase(); 
        if (IMAGE_EXTENSIONS.matcher(href).matches()) { 
            return false; 
        } 
        return href.startsWith("https://en.wikipedia.org/wiki/"); 
    }

The visit method is passed a Page object representing the page being visited. In this implementation, only those pages containing the string shipping route will be processed. This further restricts the pages visited. When we find such a page, its URL, Text, and Text length are displayed:

    public void visit(Page page) { 
        String url = page.getWebURL().getURL(); 
 
        if (page.getParseData() instanceof HtmlParseData) { 
            HtmlParseData htmlParseData =  
                (HtmlParseData) page.getParseData(); 
            String text = htmlParseData.getText(); 
            if (text.contains("shipping route")) { 
                out.println("\nURL: " + url); 
                out.println("Text: " + text); 
                out.println("Text length: " + text.length()); 
            } 
        } 
    } 

The following is the truncated output of the program when executed:

URL: https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly
Text: Bishop Rock, Isles of Scilly...From Wikipedia, the free encyclopedia ... Jump to: ... navigation, search For the Bishop Rock in the Pacific Ocean, see Cortes Bank. Bishop Rock Bishop Rock Lighthouse (2005)
...
Text length: 14677

Notice that only one page was returned. This web crawler was able to identify and ignore previous versions of the main web page.

We could perform further processing, but this example provides some insight into how the API works. Significant amounts of information can be obtained when visiting a page. In the example, we only used the URL and the length of the text. The following is a sample of other data that you may be interested in obtaining:

  • URL path
  • Parent URL
  • Anchor
  • HTML text
  • Outgoing links
  • Document ID

Web scraping in Java

Web scraping is the process of extracting information from a web page. The page is typically formatted using a series of HTML tags. An HTML parser is used to navigate through a page or series of pages and to access the page's data or metadata.

Jsoup (https://jsoup.org/) is an open source Java library that facilitates extracting and manipulating HTML documents using an HTML parser. It is used for a number of purposes, including web scraping, extracting specific elements from an HTML page, and cleaning up HTML documents.

There are several ways of obtaining an HTML document that may be useful. The HTML document can be extracted from a:

  • URL
  • String
  • File

The first approach is illustrated next where the Wikipedia page for data science is loaded into a Document object. This Jsoup object represents the HTML document. The connect method connects to the site and the get method retrieves the document:

    try { 
        Document document = Jsoup.connect( 
            "https://en.wikipedia.org/wiki/Data_science").get(); 
        ... 
     } catch (IOException ex) { 
        // Handle exception 
    } 

Loading from a file uses the File class as shown next. The overloaded parse method uses the file to create the document object:

    try { 
        File file = new File("Example.html"); 
        Document document = Jsoup.parse(file, "UTF-8", ""); 
        ... 
    } catch (IOException ex) { 
        // Handle exception 
    } 

The Example.html file follows:

<html> 
<head><title>Example Document</title></head> 
<body> 
<p>The body of the document</p> 
Interesting Links: 
<br> 
<a href="https://en.wikipedia.org/wiki/Data_science">Data Science</a> 
<br> 
<a href="https://en.wikipedia.org/wiki/Jsoup">Jsoup</a> 
<br> 
Images: 
<br> 
 <img src="eyechart.jpg" alt="Eye Chart">  
</body> 
</html> 

To create a Document object from a string, we will use the following sequence where the parse method processes the string that duplicates the previous HTML file:

    String html = "<html>\n" 
        + "<head><title>Example Document</title></head>\n" 
        + "<body>\n" 
        + "<p>The body of the document</p>\n" 
        + "Interesting Links:\n" 
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Data_science">" + 
          "DataScience</a>\n"
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Jsoup">" + 
          "Jsoup</a>\n"
        + "<br>\n" 
        + "Images:\n" 
        + "<br>\n" 
        + " <img src="eyechart.jpg" alt="Eye Chart"> \n"
        + "</body>\n" 
        + "</html>"; 
    Document document = Jsoup.parse(html);

The Document class possesses a number of useful methods. The title method returns the title. To get the text contents of the document, the select method is used. This method uses a string specifying the element of a document to retrieve:

    String title = document.title(); 
    out.println("Title: " + title); 
    Elements element = document.select("body"); 
    out.println("  Text: " + element.text()); 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Title: Data science - Wikipedia, the free encyclopedia
Text: Data science From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with information science. Part of a 
...
policy About Wikipedia Disclaimers Contact Wikipedia Developers Cookie statement Mobile view

The parameter type of the select method is a string. By using a string, the type of information selected is easily changed. Details on how to formulate this string are found at the jsoup Javadocs for the Selector class at https://jsoup.org/apidocs/:

We can use the select method to retrieve the images in a document, as shown here:

    Elements images = document.select("img[src$=.png]"); 
    for (Element image : images) { 
        out.println("\nImage: " + image); 
    } 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Image: <img alt="Data Visualization" src="//upload.wikimedia.org/...>
Image: <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/ba/...>

Links can be easily retrieved as shown next:

    Elements links = document.select("a[href]"); 
    for (Element link : links) { 
        out.println("Link: " + link.attr("href") 
            + " Text: " + link.text()); 
    } 

The output for the Example.html page is shown here:

Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science
Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup

jsoup possesses many additional capabilities. However, this example demonstrates the web scraping process. There are also other Java HTML parsers available. A comparison of Java HTML parser, among others, can be found at https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers.

Using API calls to access common social media sites

Social media contain a wealth of information that can be processed and is used by many data analysis applications. In this section, we will illustrate how to access a few of these sources using their Java APIs. Most of them require some sort of access key, which is normally easy to obtain. We start with a discussion on the OAuth class, which provides one approach to authenticating access to a data source.

When working with the type of data source, it is important to keep in mind that the data is not always public. While it may be accessible, the owner of the data may be an individual who does not necessarily want the information shared. Most APIs provide a means to determine how the data can be distributed, and these requests should be honored. When private information is used, permission from the author must be obtained.

In addition, these sites have limits on the number of requests that can be made. Keep this in mind when pulling data from a site. If these limits need to be exceeded, then most sites provide a way of doing this.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Creating your own web crawler

Now that we have a basic understanding of web crawlers, we are ready to create our own. In this simple web crawler, we will keep track of the pages visited using ArrayList instances. In addition, jsoup will be used to parse a web page and we will limit the number of pages we visit. Jsoup (https://jsoup.org/) is an open source HTML parser. This example demonstrates the basic structure of a web crawler and also highlights some of the issues involved in creating a web crawler.

We will use the SimpleWebCrawler class, as declared here:

public class SimpleWebCrawler { 
 
    private String topic; 
    private String startingURL; 
    private String urlLimiter; 
    private final int pageLimit = 20; 
    private ArrayList<String> visitedList = new ArrayList<>(); 
    private ArrayList<String> pageList = new ArrayList<>(); 
    ... 
    public static void main(String[] args) { 
        new SimpleWebCrawler(); 
    } 
 
} 

The instance variables are detailed here:

Variable

Use

topic

The keyword that needs to be in a page for the page to be accepted

startingURL

The URL of the first page

urlLimiter

A string that must be contained in a link before it will be followed

pageLimit

The maximum number of pages to retrieve

visitedList

The ArrayList containing pages that have already been visited

pageList

An ArrayList containing the URLs of the pages of interest

In the SimpleWebCrawler constructor, we initialize the instance variables to begin the search from the Wikipedia page for Bishop Rock, an island off the coast of Italy. This was chosen to minimize the number of pages that might be retrieved. As we will see, there are many more Wikipedia pages dealing with Bishop Rock than one might think.

The urlLimiter variable is set to Bishop_Rock, which will restrict the embedded links to follow to just those containing that string. Each page of interest must contain the value stored in the topic variable. The visitPage method performs the actual crawl:

    public SimpleWebCrawler() { 
        startingURL = https://en.wikipedia.org/wiki/Bishop_Rock, " 
            + "Isles_of_Scilly"; 
        urlLimiter = "Bishop_Rock"; 
        topic = "shipping route"; 
        visitPage(startingURL); 
    } 

In the visitPage method, the pageList ArrayList is checked to see whether the maximum number of accepted pages has been exceeded. If the limit has been exceeded, then the search terminates:

    public void visitPage(String url) { 
        if (pageList.size() >= pageLimit) { 
            return; 
        } 
       ... 
    } 

If the page has already been visited, then we ignore it. Otherwise, it is added to the visited list:

    if (visitedList.contains(url)) { 
        // URL already visited 
    } else { 
        visitedList.add(url); 
            ... 
    } 

Jsoup is used to parse the page and return a Document object. There are many different exceptions and problems that can occur such as a malformed URL, retrieval timeouts, or simply bad links. The catch block needs to handle these types of problems. We will provide a more in-depth explanation of jsoup in web scraping in Java:

    try { 
        Document doc = Jsoup.connect(url).get(); 
            ... 
        } 
    } catch (Exception ex) { 
        // Handle exceptions 
    } 

If the document contains the topic text, then the link is displayed and added to the pageList ArrayList. Each embedded link is obtained, and if the link contains the limiting text, then the visitPage method is called recursively:

    if (doc.text().contains(topic)) { 
        out.println((pageList.size() + 1) + ": [" + url + "]"); 
        pageList.add(url); 
 
        // Process page links 
        Elements questions = doc.select("a[href]"); 
        for (Element link : questions) { 
            if (link.attr("href").contains(urlLimiter)) { 
                visitPage(link.attr("abs:href")); 
            } 
        } 
    } 

This approach only examines links in those pages that contain the topic text. Moving the for loop outside of the if statement will test the links for all pages.

The output follows:

1: [https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly]
2: [https://en.wikipedia.org/wiki/Bishop_Rock_Lighthouse]
3: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=717634231#Lighthouse]
4: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=717634231]
5: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716622943]
6: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716622943]
7: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716608512]
8: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716608512]
...
20: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716603919]

In this example, we did not save the results of the crawl in an external source. Normally this is necessary and can be stored in a file or database.

Using the crawler4j web crawler

Here we will illustrate the use of the crawler4j (https://github.com/yasserg/crawler4j) web crawler. We will use an adapted version of the basic crawler found at https://github.com/yasserg/crawler4j/tree/master/src/test/java/edu/uci/ics/crawler4j/examples/basic. We will create two classes: CrawlerController and SampleCrawler. The former class set ups the crawler while the latter contains the logic that controls what pages will be processed.

As with our previous crawler, we will crawl the Wikipedia article dealing with Bishop Rock. The results using this crawler will be smaller as many extraneous pages are ignored.

Let's look at the CrawlerController class first. There are several parameters that are used with the crawler as detailed here:

  • Crawl storage folder: The location where crawl data is stored
  • Number of crawlers: This controls the number of threads used for the crawl
  • Politeness delay: How many seconds to pause between requests
  • Crawl depth: How deep the crawl will go
  • Maximum number of pages to fetch: How many pages to fetch
  • Binary data: Whether to crawl binary data such as PDF files

The basic class is shown here:

public class CrawlerController { 
 
  public static void main(String[] args) throws Exception { 
    int numberOfCrawlers = 2; 
    CrawlConfig config = new CrawlConfig(); 
    String crawlStorageFolder = "data"; 
     
    config.setCrawlStorageFolder(crawlStorageFolder); 
    config.setPolitenessDelay(500); 
    config.setMaxDepthOfCrawling(2); 
    config.setMaxPagesToFetch(20); 
    config.setIncludeBinaryContentInCrawling(false); 
    ... 
  } 
}

Next, the CrawlController class is created and configured. Notice the RobotstxtConfig and RobotstxtServer classes used to handle robot.txt files. These files contain instructions that are intended to be read by a web crawler. They provide direction to help a crawler to do a better job such as specifying which parts of a site should not be crawled. This is useful for auto generated pages:

    PageFetcher pageFetcher = new PageFetcher(config); 
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
    RobotstxtServer robotstxtServer =  
        new RobotstxtServer(robotstxtConfig, pageFetcher); 
    CrawlController controller =  
        new CrawlController(config, pageFetcher, robotstxtServer); 

The crawler needs to start at one or more pages. The addSeed method adds the starting pages. While we used the method only once here, it can be used as many times as needed:

    controller.addSeed( 
      "https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly"); 

The start method will begin the crawling process:

    controller.start(SampleCrawler.class, numberOfCrawlers); 

The SampleCrawler class contains two methods of interest. The first is the shouldVisit method that determines whether a page will be visited and the visit method that actually handles the page. We start with the class declaration and the declaration of a Java regular expression class Pattern object. It will be one way of determining whether a page will be visited. In this declaration, standard images are specified and will be ignored:

    public class SampleCrawler extends WebCrawler { 
        private static final Pattern IMAGE_EXTENSIONS =  
            Pattern.compile(".*\\.(bmp|gif|jpg|png)$"); 
 
        ... 
    } 

The shouldVisit method is passed a reference to the page where this URL was found along with the URL. If any of the images match, the method returns false and the page is ignored. In addition, the URL must start with https://en.wikipedia.org/wiki/. We added this to restrict our searches to the Wikipedia website:

    public boolean shouldVisit(Page referringPage, WebURL url) { 
        String href = url.getURL().toLowerCase(); 
        if (IMAGE_EXTENSIONS.matcher(href).matches()) { 
            return false; 
        } 
        return href.startsWith("https://en.wikipedia.org/wiki/"); 
    }

The visit method is passed a Page object representing the page being visited. In this implementation, only those pages containing the string shipping route will be processed. This further restricts the pages visited. When we find such a page, its URL, Text, and Text length are displayed:

    public void visit(Page page) { 
        String url = page.getWebURL().getURL(); 
 
        if (page.getParseData() instanceof HtmlParseData) { 
            HtmlParseData htmlParseData =  
                (HtmlParseData) page.getParseData(); 
            String text = htmlParseData.getText(); 
            if (text.contains("shipping route")) { 
                out.println("\nURL: " + url); 
                out.println("Text: " + text); 
                out.println("Text length: " + text.length()); 
            } 
        } 
    } 

The following is the truncated output of the program when executed:

URL: https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly
Text: Bishop Rock, Isles of Scilly...From Wikipedia, the free encyclopedia ... Jump to: ... navigation, search For the Bishop Rock in the Pacific Ocean, see Cortes Bank. Bishop Rock Bishop Rock Lighthouse (2005)
...
Text length: 14677

Notice that only one page was returned. This web crawler was able to identify and ignore previous versions of the main web page.

We could perform further processing, but this example provides some insight into how the API works. Significant amounts of information can be obtained when visiting a page. In the example, we only used the URL and the length of the text. The following is a sample of other data that you may be interested in obtaining:

  • URL path
  • Parent URL
  • Anchor
  • HTML text
  • Outgoing links
  • Document ID
Web scraping in Java

Web scraping is the process of extracting information from a web page. The page is typically formatted using a series of HTML tags. An HTML parser is used to navigate through a page or series of pages and to access the page's data or metadata.

Jsoup (https://jsoup.org/) is an open source Java library that facilitates extracting and manipulating HTML documents using an HTML parser. It is used for a number of purposes, including web scraping, extracting specific elements from an HTML page, and cleaning up HTML documents.

There are several ways of obtaining an HTML document that may be useful. The HTML document can be extracted from a:

  • URL
  • String
  • File

The first approach is illustrated next where the Wikipedia page for data science is loaded into a Document object. This Jsoup object represents the HTML document. The connect method connects to the site and the get method retrieves the document:

    try { 
        Document document = Jsoup.connect( 
            "https://en.wikipedia.org/wiki/Data_science").get(); 
        ... 
     } catch (IOException ex) { 
        // Handle exception 
    } 

Loading from a file uses the File class as shown next. The overloaded parse method uses the file to create the document object:

    try { 
        File file = new File("Example.html"); 
        Document document = Jsoup.parse(file, "UTF-8", ""); 
        ... 
    } catch (IOException ex) { 
        // Handle exception 
    } 

The Example.html file follows:

<html> 
<head><title>Example Document</title></head> 
<body> 
<p>The body of the document</p> 
Interesting Links: 
<br> 
<a href="https://en.wikipedia.org/wiki/Data_science">Data Science</a> 
<br> 
<a href="https://en.wikipedia.org/wiki/Jsoup">Jsoup</a> 
<br> 
Images: 
<br> 
 <img src="eyechart.jpg" alt="Eye Chart">  
</body> 
</html> 

To create a Document object from a string, we will use the following sequence where the parse method processes the string that duplicates the previous HTML file:

    String html = "<html>\n" 
        + "<head><title>Example Document</title></head>\n" 
        + "<body>\n" 
        + "<p>The body of the document</p>\n" 
        + "Interesting Links:\n" 
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Data_science">" + 
          "DataScience</a>\n"
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Jsoup">" + 
          "Jsoup</a>\n"
        + "<br>\n" 
        + "Images:\n" 
        + "<br>\n" 
        + " <img src="eyechart.jpg" alt="Eye Chart"> \n"
        + "</body>\n" 
        + "</html>"; 
    Document document = Jsoup.parse(html);

The Document class possesses a number of useful methods. The title method returns the title. To get the text contents of the document, the select method is used. This method uses a string specifying the element of a document to retrieve:

    String title = document.title(); 
    out.println("Title: " + title); 
    Elements element = document.select("body"); 
    out.println("  Text: " + element.text()); 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Title: Data science - Wikipedia, the free encyclopedia
Text: Data science From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with information science. Part of a 
...
policy About Wikipedia Disclaimers Contact Wikipedia Developers Cookie statement Mobile view

The parameter type of the select method is a string. By using a string, the type of information selected is easily changed. Details on how to formulate this string are found at the jsoup Javadocs for the Selector class at https://jsoup.org/apidocs/:

We can use the select method to retrieve the images in a document, as shown here:

    Elements images = document.select("img[src$=.png]"); 
    for (Element image : images) { 
        out.println("\nImage: " + image); 
    } 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Image: <img alt="Data Visualization" src="//upload.wikimedia.org/...>
Image: <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/ba/...>

Links can be easily retrieved as shown next:

    Elements links = document.select("a[href]"); 
    for (Element link : links) { 
        out.println("Link: " + link.attr("href") 
            + " Text: " + link.text()); 
    } 

The output for the Example.html page is shown here:

Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science
Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup

jsoup possesses many additional capabilities. However, this example demonstrates the web scraping process. There are also other Java HTML parsers available. A comparison of Java HTML parser, among others, can be found at https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers.

Using API calls to access common social media sites

Social media contain a wealth of information that can be processed and is used by many data analysis applications. In this section, we will illustrate how to access a few of these sources using their Java APIs. Most of them require some sort of access key, which is normally easy to obtain. We start with a discussion on the OAuth class, which provides one approach to authenticating access to a data source.

When working with the type of data source, it is important to keep in mind that the data is not always public. While it may be accessible, the owner of the data may be an individual who does not necessarily want the information shared. Most APIs provide a means to determine how the data can be distributed, and these requests should be honored. When private information is used, permission from the author must be obtained.

In addition, these sites have limits on the number of requests that can be made. Keep this in mind when pulling data from a site. If these limits need to be exceeded, then most sites provide a way of doing this.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Using the crawler4j web crawler

Here we will illustrate the use of the crawler4j (https://github.com/yasserg/crawler4j) web crawler. We will use an adapted version of the basic crawler found at https://github.com/yasserg/crawler4j/tree/master/src/test/java/edu/uci/ics/crawler4j/examples/basic. We will create two classes: CrawlerController and SampleCrawler. The former class set ups the crawler while the latter contains the logic that controls what pages will be processed.

As with our previous crawler, we will crawl the Wikipedia article dealing with Bishop Rock. The results using this crawler will be smaller as many extraneous pages are ignored.

Let's look at the CrawlerController class first. There are several parameters that are used with the crawler as detailed here:

  • Crawl storage folder: The location where crawl data is stored
  • Number of crawlers: This controls the number of threads used for the crawl
  • Politeness delay: How many seconds to pause between requests
  • Crawl depth: How deep the crawl will go
  • Maximum number of pages to fetch: How many pages to fetch
  • Binary data: Whether to crawl binary data such as PDF files

The basic class is shown here:

public class CrawlerController { 
 
  public static void main(String[] args) throws Exception { 
    int numberOfCrawlers = 2; 
    CrawlConfig config = new CrawlConfig(); 
    String crawlStorageFolder = "data"; 
     
    config.setCrawlStorageFolder(crawlStorageFolder); 
    config.setPolitenessDelay(500); 
    config.setMaxDepthOfCrawling(2); 
    config.setMaxPagesToFetch(20); 
    config.setIncludeBinaryContentInCrawling(false); 
    ... 
  } 
}

Next, the CrawlController class is created and configured. Notice the RobotstxtConfig and RobotstxtServer classes used to handle robot.txt files. These files contain instructions that are intended to be read by a web crawler. They provide direction to help a crawler to do a better job such as specifying which parts of a site should not be crawled. This is useful for auto generated pages:

    PageFetcher pageFetcher = new PageFetcher(config); 
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
    RobotstxtServer robotstxtServer =  
        new RobotstxtServer(robotstxtConfig, pageFetcher); 
    CrawlController controller =  
        new CrawlController(config, pageFetcher, robotstxtServer); 

The crawler needs to start at one or more pages. The addSeed method adds the starting pages. While we used the method only once here, it can be used as many times as needed:

    controller.addSeed( 
      "https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly"); 

The start method will begin the crawling process:

    controller.start(SampleCrawler.class, numberOfCrawlers); 

The SampleCrawler class contains two methods of interest. The first is the shouldVisit method that determines whether a page will be visited and the visit method that actually handles the page. We start with the class declaration and the declaration of a Java regular expression class Pattern object. It will be one way of determining whether a page will be visited. In this declaration, standard images are specified and will be ignored:

    public class SampleCrawler extends WebCrawler { 
        private static final Pattern IMAGE_EXTENSIONS =  
            Pattern.compile(".*\\.(bmp|gif|jpg|png)$"); 
 
        ... 
    } 

The shouldVisit method is passed a reference to the page where this URL was found along with the URL. If any of the images match, the method returns false and the page is ignored. In addition, the URL must start with https://en.wikipedia.org/wiki/. We added this to restrict our searches to the Wikipedia website:

    public boolean shouldVisit(Page referringPage, WebURL url) { 
        String href = url.getURL().toLowerCase(); 
        if (IMAGE_EXTENSIONS.matcher(href).matches()) { 
            return false; 
        } 
        return href.startsWith("https://en.wikipedia.org/wiki/"); 
    }

The visit method is passed a Page object representing the page being visited. In this implementation, only those pages containing the string shipping route will be processed. This further restricts the pages visited. When we find such a page, its URL, Text, and Text length are displayed:

    public void visit(Page page) { 
        String url = page.getWebURL().getURL(); 
 
        if (page.getParseData() instanceof HtmlParseData) { 
            HtmlParseData htmlParseData =  
                (HtmlParseData) page.getParseData(); 
            String text = htmlParseData.getText(); 
            if (text.contains("shipping route")) { 
                out.println("\nURL: " + url); 
                out.println("Text: " + text); 
                out.println("Text length: " + text.length()); 
            } 
        } 
    } 

The following is the truncated output of the program when executed:

URL: https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly
Text: Bishop Rock, Isles of Scilly...From Wikipedia, the free encyclopedia ... Jump to: ... navigation, search For the Bishop Rock in the Pacific Ocean, see Cortes Bank. Bishop Rock Bishop Rock Lighthouse (2005)
...
Text length: 14677

Notice that only one page was returned. This web crawler was able to identify and ignore previous versions of the main web page.

We could perform further processing, but this example provides some insight into how the API works. Significant amounts of information can be obtained when visiting a page. In the example, we only used the URL and the length of the text. The following is a sample of other data that you may be interested in obtaining:

  • URL path
  • Parent URL
  • Anchor
  • HTML text
  • Outgoing links
  • Document ID
Web scraping in Java

Web scraping is the process of extracting information from a web page. The page is typically formatted using a series of HTML tags. An HTML parser is used to navigate through a page or series of pages and to access the page's data or metadata.

Jsoup (https://jsoup.org/) is an open source Java library that facilitates extracting and manipulating HTML documents using an HTML parser. It is used for a number of purposes, including web scraping, extracting specific elements from an HTML page, and cleaning up HTML documents.

There are several ways of obtaining an HTML document that may be useful. The HTML document can be extracted from a:

  • URL
  • String
  • File

The first approach is illustrated next where the Wikipedia page for data science is loaded into a Document object. This Jsoup object represents the HTML document. The connect method connects to the site and the get method retrieves the document:

    try { 
        Document document = Jsoup.connect( 
            "https://en.wikipedia.org/wiki/Data_science").get(); 
        ... 
     } catch (IOException ex) { 
        // Handle exception 
    } 

Loading from a file uses the File class as shown next. The overloaded parse method uses the file to create the document object:

    try { 
        File file = new File("Example.html"); 
        Document document = Jsoup.parse(file, "UTF-8", ""); 
        ... 
    } catch (IOException ex) { 
        // Handle exception 
    } 

The Example.html file follows:

<html> 
<head><title>Example Document</title></head> 
<body> 
<p>The body of the document</p> 
Interesting Links: 
<br> 
<a href="https://en.wikipedia.org/wiki/Data_science">Data Science</a> 
<br> 
<a href="https://en.wikipedia.org/wiki/Jsoup">Jsoup</a> 
<br> 
Images: 
<br> 
 <img src="eyechart.jpg" alt="Eye Chart">  
</body> 
</html> 

To create a Document object from a string, we will use the following sequence where the parse method processes the string that duplicates the previous HTML file:

    String html = "<html>\n" 
        + "<head><title>Example Document</title></head>\n" 
        + "<body>\n" 
        + "<p>The body of the document</p>\n" 
        + "Interesting Links:\n" 
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Data_science">" + 
          "DataScience</a>\n"
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Jsoup">" + 
          "Jsoup</a>\n"
        + "<br>\n" 
        + "Images:\n" 
        + "<br>\n" 
        + " <img src="eyechart.jpg" alt="Eye Chart"> \n"
        + "</body>\n" 
        + "</html>"; 
    Document document = Jsoup.parse(html);

The Document class possesses a number of useful methods. The title method returns the title. To get the text contents of the document, the select method is used. This method uses a string specifying the element of a document to retrieve:

    String title = document.title(); 
    out.println("Title: " + title); 
    Elements element = document.select("body"); 
    out.println("  Text: " + element.text()); 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Title: Data science - Wikipedia, the free encyclopedia
Text: Data science From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with information science. Part of a 
...
policy About Wikipedia Disclaimers Contact Wikipedia Developers Cookie statement Mobile view

The parameter type of the select method is a string. By using a string, the type of information selected is easily changed. Details on how to formulate this string are found at the jsoup Javadocs for the Selector class at https://jsoup.org/apidocs/:

We can use the select method to retrieve the images in a document, as shown here:

    Elements images = document.select("img[src$=.png]"); 
    for (Element image : images) { 
        out.println("\nImage: " + image); 
    } 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Image: <img alt="Data Visualization" src="//upload.wikimedia.org/...>
Image: <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/ba/...>

Links can be easily retrieved as shown next:

    Elements links = document.select("a[href]"); 
    for (Element link : links) { 
        out.println("Link: " + link.attr("href") 
            + " Text: " + link.text()); 
    } 

The output for the Example.html page is shown here:

Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science
Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup

jsoup possesses many additional capabilities. However, this example demonstrates the web scraping process. There are also other Java HTML parsers available. A comparison of Java HTML parser, among others, can be found at https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers.

Using API calls to access common social media sites

Social media contain a wealth of information that can be processed and is used by many data analysis applications. In this section, we will illustrate how to access a few of these sources using their Java APIs. Most of them require some sort of access key, which is normally easy to obtain. We start with a discussion on the OAuth class, which provides one approach to authenticating access to a data source.

When working with the type of data source, it is important to keep in mind that the data is not always public. While it may be accessible, the owner of the data may be an individual who does not necessarily want the information shared. Most APIs provide a means to determine how the data can be distributed, and these requests should be honored. When private information is used, permission from the author must be obtained.

In addition, these sites have limits on the number of requests that can be made. Keep this in mind when pulling data from a site. If these limits need to be exceeded, then most sites provide a way of doing this.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Web scraping in Java

Web scraping is the process of extracting information from a web page. The page is typically formatted using a series of HTML tags. An HTML parser is used to navigate through a page or series of pages and to access the page's data or metadata.

Jsoup (https://jsoup.org/) is an open source Java library that facilitates extracting and manipulating HTML documents using an HTML parser. It is used for a number of purposes, including web scraping, extracting specific elements from an HTML page, and cleaning up HTML documents.

There are several ways of obtaining an HTML document that may be useful. The HTML document can be extracted from a:

  • URL
  • String
  • File

The first approach is illustrated next where the Wikipedia page for data science is loaded into a Document object. This Jsoup object represents the HTML document. The connect method connects to the site and the get method retrieves the document:

    try { 
        Document document = Jsoup.connect( 
            "https://en.wikipedia.org/wiki/Data_science").get(); 
        ... 
     } catch (IOException ex) { 
        // Handle exception 
    } 

Loading from a file uses the File class as shown next. The overloaded parse method uses the file to create the document object:

    try { 
        File file = new File("Example.html"); 
        Document document = Jsoup.parse(file, "UTF-8", ""); 
        ... 
    } catch (IOException ex) { 
        // Handle exception 
    } 

The Example.html file follows:

<html> 
<head><title>Example Document</title></head> 
<body> 
<p>The body of the document</p> 
Interesting Links: 
<br> 
<a href="https://en.wikipedia.org/wiki/Data_science">Data Science</a> 
<br> 
<a href="https://en.wikipedia.org/wiki/Jsoup">Jsoup</a> 
<br> 
Images: 
<br> 
 <img src="eyechart.jpg" alt="Eye Chart">  
</body> 
</html> 

To create a Document object from a string, we will use the following sequence where the parse method processes the string that duplicates the previous HTML file:

    String html = "<html>\n" 
        + "<head><title>Example Document</title></head>\n" 
        + "<body>\n" 
        + "<p>The body of the document</p>\n" 
        + "Interesting Links:\n" 
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Data_science">" + 
          "DataScience</a>\n"
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Jsoup">" + 
          "Jsoup</a>\n"
        + "<br>\n" 
        + "Images:\n" 
        + "<br>\n" 
        + " <img src="eyechart.jpg" alt="Eye Chart"> \n"
        + "</body>\n" 
        + "</html>"; 
    Document document = Jsoup.parse(html);

The Document class possesses a number of useful methods. The title method returns the title. To get the text contents of the document, the select method is used. This method uses a string specifying the element of a document to retrieve:

    String title = document.title(); 
    out.println("Title: " + title); 
    Elements element = document.select("body"); 
    out.println("  Text: " + element.text()); 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Title: Data science - Wikipedia, the free encyclopedia
Text: Data science From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with information science. Part of a 
...
policy About Wikipedia Disclaimers Contact Wikipedia Developers Cookie statement Mobile view

The parameter type of the select method is a string. By using a string, the type of information selected is easily changed. Details on how to formulate this string are found at the jsoup Javadocs for the Selector class at https://jsoup.org/apidocs/:

We can use the select method to retrieve the images in a document, as shown here:

    Elements images = document.select("img[src$=.png]"); 
    for (Element image : images) { 
        out.println("\nImage: " + image); 
    } 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Image: <img alt="Data Visualization" src="//upload.wikimedia.org/...>
Image: <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/ba/...>

Links can be easily retrieved as shown next:

    Elements links = document.select("a[href]"); 
    for (Element link : links) { 
        out.println("Link: " + link.attr("href") 
            + " Text: " + link.text()); 
    } 

The output for the Example.html page is shown here:

Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science
Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup

jsoup possesses many additional capabilities. However, this example demonstrates the web scraping process. There are also other Java HTML parsers available. A comparison of Java HTML parser, among others, can be found at https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers.

Using API calls to access common social media sites

Social media contain a wealth of information that can be processed and is used by many data analysis applications. In this section, we will illustrate how to access a few of these sources using their Java APIs. Most of them require some sort of access key, which is normally easy to obtain. We start with a discussion on the OAuth class, which provides one approach to authenticating access to a data source.

When working with the type of data source, it is important to keep in mind that the data is not always public. While it may be accessible, the owner of the data may be an individual who does not necessarily want the information shared. Most APIs provide a means to determine how the data can be distributed, and these requests should be honored. When private information is used, permission from the author must be obtained.

In addition, these sites have limits on the number of requests that can be made. Keep this in mind when pulling data from a site. If these limits need to be exceeded, then most sites provide a way of doing this.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Using API calls to access common social media sites

Social media contain a wealth of information that can be processed and is used by many data analysis applications. In this section, we will illustrate how to access a few of these sources using their Java APIs. Most of them require some sort of access key, which is normally easy to obtain. We start with a discussion on the OAuth class, which provides one approach to authenticating access to a data source.

When working with the type of data source, it is important to keep in mind that the data is not always public. While it may be accessible, the owner of the data may be an individual who does not necessarily want the information shared. Most APIs provide a means to determine how the data can be distributed, and these requests should be honored. When private information is used, permission from the author must be obtained.

In addition, these sites have limits on the number of requests that can be made. Keep this in mind when pulling data from a site. If these limits need to be exceeded, then most sites provide a way of doing this.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Summary

In this chapter, we discussed types of data that are useful for data science and readily accessible on the Internet. This discussion included details about file specifications and formats for the most common types of data sources.

We also examined Java APIs and other techniques for retrieving data, and illustrated this process with multiple sources. In particular, we focused on types of text-based document formats and multimedia files. We used web crawlers to access websites and then performed web scraping to retrieve data from the sites we encountered.

Finally, we extracted data from social media sites and examined the available Java support. We retrieved data from Twitter, Wikipedia, Flickr, and YouTube and examined the available API support.

 

Chapter 3. Data Cleaning

Real-world data is frequently dirty and unstructured, and must be reworked before it is usable. Data may contain errors, have duplicate entries, exist in the wrong format, or be inconsistent. The process of addressing these types of issues is called data cleaning. Data cleaning is also referred to as data wrangling, massaging, reshaping , or munging. Data merging, where data from multiple sources is combined, is often considered to be a data cleaning activity.

We need to clean data because any analysis based on inaccurate data can produce misleading results. We want to ensure that the data we work with is quality data. Data quality involves:

  • Validity: Ensuring that the data possesses the correct form or structure
  • Accuracy: The values within the data are truly representative of the dataset
  • Completeness: There are no missing elements
  • Consistency: Changes to data are in sync
  • Uniformity: The same units of measurement are used

There are several techniques and tools used to clean data. We will examine the following approaches:

  • Handling different types of data
  • Cleaning and manipulating text data
  • Filling in missing data
  • Validating data

In addition, we will briefly examine several image enhancement techniques.

There are often many ways to accomplish the same cleaning task. For example, there are a number of GUI tools that support data cleaning, such as OpenRefine (http://openrefine.org/). This tool allows a user to read in a dataset and clean it using a variety of techniques. However, it requires a user to interact with the application for each dataset that needs to be cleaned. It is not conducive to automation.

We will focus on how to clean data using Java code. Even then, there may be different techniques to clean the data. We will show multiple approaches to provide the reader with insights on how it can be done. Sometimes, this will use core Java string classes, and at other time, it may use specialized libraries.

These libraries often are more expressive and efficient. However, there are times when using a simple string function is more than adequate to address the problem. Showing complimentary techniques will improve the reader's skill set.

The basic text based tasks include:

  • Data transformation
  • Data imputation (handling missing data)
  • Subsetting data
  • Sorting data
  • Validating data

In this chapter, we are interested in cleaning data. However, part of this process is extracting information from various data sources. The data may be stored in plaintext or in binary form. We need to understand the various formats used to store data before we can begin the cleaning process. Many of these formats were introduced in Chapter 2, Data Acquisition, but we will go into greater detail in the following sections.

Handling data formats

Data comes in all types of forms. We will examine the more commonly used formats and show how they can be extracted from various data sources. Before we can clean data it needs to be extracted from a data source such as a file. In this section, we will build upon the introduction to data formats found in Chapter 2, Data Acquisition, and show how to extract all or part of a dataset. For example, from an HTML page we may want to extract only the text without markup. Or perhaps we are only interested in its figures.

These data formats can be quite complex. The intent of this section is to illustrate the basic techniques commonly used with that data format. Full treatment of a specific data format is beyond the scope of this book. Specifically, we will introduce how the following data formats can be processed from Java:

  • CSV data
  • Spreadsheets
  • Portable Document Format, or PDF files
  • Javascript Object Notation, or JSON files

There are many other file types not addressed here. For example, jsoup is useful for parsing HTML documents. Since we introduced how this is done in the Web scraping in Java section of Chapter 2, Data Acquisition, we will not duplicate the effort here.

Handling CSV data

A common technique for separating information is to use commas or similar separators. Knowing how to work with CSV data allows us to utilize this type of data in our analysis efforts. When we deal with CSV data there are several issues including escaped data and embedded commas.

We will examine a few basic techniques for processing comma-separated data. Due to the row-column structure of CSV data, these techniques will read data from a file and place the data in a two-dimensional array. First, we will use a combination of the Scanner class to read in tokens and the String class split method to separate the data and store it in the array. Next, we will explore using the third-party library, OpenCSV, which offers a more efficient technique.

However, the first approach may only be appropriate for quick and dirty processing of data. We will discuss each of these techniques since they are useful in different situations.

We will use a dataset downloaded from https://www.data.gov/ containing U.S. demographic statistics sorted by ZIP code. This dataset can be downloaded at https://catalog.data.gov/dataset/demographic-statistics-by-zip-code-acfc9. For our purposes, this dataset has been stored in the file Demographics.csv. In this particular file, every row contains the same number of columns. However, not all data will be this clean and the solutions shown next take into account the possibility for jagged arrays.

Note

A jagged array is an array where the number of columns may be different for different rows. For example, row 2 may have 5 elements while row 3 may have 6 elements. When using jagged arrays you have to be careful with your column indexes.

First, we use the Scanner class to read in data from our data file. We will temporarily store the data in an ArrayList since we will not always know how many rows our data contains.

try (Scanner csvData = new Scanner(new File("Demographics.csv"))) {
    ArrayList<String> list = new ArrayList<String>();         
    while (csvData.hasNext()) {             
        list.add(csvData.nextLine());     
} catch (FileNotFoundException ex) {     
    // Handle exceptions 
} 

The list is converted to an array using the toArray method. This version of the method uses a String array as an argument so that the method will know what type of array to create. A two-dimension array is then created to hold the CSV data.

String[] tempArray = list.toArray(new String[1]); 
String[][] csvArray = new String[tempArray.length][]; 

The split method is used to create an array of Strings for each row. This array is assigned to a row of the csvArray.

for(int i=0; i<tempArray.length; i++) { 
    csvArray[i] = tempArray[i].split(","); 
} 

Our next technique will use a third-party library to read in and process CSV data. There are multiple options available, but we will focus on the popular OpenCSV (http://opencsv.sourceforge.net). This library offers several advantages over our previous technique. We can have an arbitrary number of items on each row without worrying about handling exceptions. We also do not need to worry about embedded commas or embedded carriage returns within the data tokens. The library also allows us to choose between reading the entire file at once or using an iterator to process data line-by-line.

First, we need to create an instance of the CSVReader class. Notice the second parameter allows us to specify the delimiter, a useful feature if we have similar file format delimited by tabs or dashes, for example. If we want to read the entire file at one time, we use the readAll method.

CSVReader dataReader = new CSVReader(new    FileReader("Demographics.csv"),','); 
ArrayList<String> holdData = (ArrayList)dataReader.readAll();

We can then process the data as we did above, by splitting the data into a two-dimension array using String class methods. Alternatively, we can process the data one line at a time. In the example that follows, each token is printed out individually but the tokens can also be stored in a two-dimension array or other data structure as appropriate.

CSVReader dataReader = new CSVReader(new    FileReader("Demographics.csv"),','); 
String[] nextLine; 
while ((nextLine = dataReader.readNext()) != null){ 
for(String token : nextLine){ 
    out.println(token); 
  } 
} 
dataReader.close(); 

We can now clean or otherwise process the array.

Handling spreadsheets

Spreadsheets have proven to be a very popular tool for processing numeric and textual data. Due to the wealth of information that has been stored in spreadsheets over the past decades, knowing how to extract information from spreadsheets enables us to take advantage of this widely available data source. In this section, we will demonstrate how this is accomplished using the Apache POI API.

Open Office also supports a spreadsheet application. Open Office documents are stored in XML format which makes it readily accessible using XML parsing technologies. However, the Apache ODF Toolkit (http://incubator.apache.org/odftoolkit/) provides a means of accessing data within a document without knowing the format of the OpenOffice document. This is currently an incubator project and is not fully mature. There are a number of other APIs that can assist in processing OpenOffice documents as detailed on the Open Document Format (ODF) for developers (http://www.opendocumentformat.org/developers/) page.

Handling Excel spreadsheets

Apache POI (http://poi.apache.org/index.html) is a set of APIs providing access to many Microsoft products including Excel and Word. It consists of a series of components designed to access a specific Microsoft product. An overview of these components is found at http://poi.apache.org/overview.html.

In this section we will demonstrate how to read a simple Excel spreadsheet using the XSSF component to access Excel 2007+ spreadsheets. The Javadocs for the Apache POI API is found at https://poi.apache.org/apidocs/index.html.

We will use a simple Excel spreadsheet consisting of a series of rows containing an ID along with minimum, maximum, and average values. These numbers are not intended to represent any specific type of data. The spreadsheet follows:

ID

Minimum

Maximum

Average

12345

45

89

65.55

23456

78

96

86.75

34567

56

89

67.44

45678

86

99

95.67

We start with a try-with-resources block to handle any IOExceptions that may occur:

try (FileInputStream file = new FileInputStream( 
        new File("Sample.xlsx"))) { 
    ... 
    } 
} catch (IOException e) { 
    // Handle exceptions 
} 

An instance of a XSSFWorkbook class is created using the spreadsheet. Since a workbook may consists of multiple spreadsheets, we select the first one using the getSheetAt method.

XSSFWorkbook workbook = new XSSFWorkbook(file); 
XSSFSheet sheet = workbook.getSheetAt(0); 

The next step is to iterate through the rows, and then each column, of the spreadsheet:

for(Row row : sheet) { 
    for (Cell cell : row) { 
        ... 
    } 
out.println(); 

Each cell of the spreadsheet may use a different format. We use the getCellType method to determine its type and then use the appropriate method to extract the data in the cell. In this example we are only working with numeric and text data.

switch (cell.getCellType()) { 
    case Cell.CELL_TYPE_NUMERIC: 
        out.print(cell.getNumericCellValue() + "\t"); 
        break; 
    case Cell.CELL_TYPE_STRING: 
        out.print(cell.getStringCellValue() + "\t"); 
        break; 
   } 

When executed we get the following output:

ID Minimum Maximum Average 
12345.0 45.0 89.0 65.55
23456.0 78.0 96.0 86.75
34567.0 56.0 89.0 67.44
45678.0 86.0 99.0 95.67

POI supports other more sophisticated classes and methods to extract data.

Handling PDF files

There are several APIs supporting the extraction of text from a PDF file. Here we will use PDFBox. The Apache PDFBox (https://pdfbox.apache.org/) is an open source API that allows Java programmers to work with PDF documents. In this section we will illustrate how to extract simple text from a PDF document. Javadocs for the PDFBox API is found at https://pdfbox.apache.org/docs/2.0.1/javadocs/.

This is a simple PDF file. It consists of several bullets:

  • Line 1
  • Line 2
  • Line 3

This is the end of the document.

A try block is used to catch  IOExceptions. The PDDocument class will represent the PDF document being processed. Its load method will load in the PDF file specified by the File object:

try { 
    PDDocument document = PDDocument.load(new File("PDF File.pdf")); 
    ... 
} catch (Exception e) { 
    // Handle exceptions 
} 

Once loaded, the PDFTextStripper class getText method will extract the text from the file. The text is then displayed as shown here:

PDFTextStripper Tstripper = new PDFTextStripper(); 
String documentText = Tstripper.getText(document); 
System.out.println(documentText); 

The output of this example follows. Notice that the bullets are returned as question marks.

This is a simple PDF file. It consists of several bullets: 
? Line 1 
? Line 2 
? Line 3 
This is the end of the document.

This is a brief introduction to the use of PDFBox. It is a very powerful tool when we need to extract and otherwise manipulate PDF documents.

Handling JSON

In Chapter 2, Data Acquisition we learned that certain YouTube searches return JSON formatted results. Specifically, the SearchResult class holds information relating to a specific search. In that section we illustrate how to use YouTube specific techniques to extract information. In this section we will illustrate how to extract JSON information using the Jackson JSON implementation.

JSON supports three models for processing data:

  • Streaming API - JSON data is processed token by token
  • Tree model - The JSON data is held entirely in memory and then processed
  • Data binding - The JSON data is transformed to a Java object

Using JSON streaming API

We will illustrate the first two approaches. The first approach is more efficient and is used when a large amount of data is processed. The second technique is convenient but the data must not be too large. The third technique is useful when it is more convenient to use specific Java classes to process data. For example, if the JSON data represent an address then a specific Java address class cane be defined to hold and process the data.

There are several Java libraries that support JSON processing including:

We will use the Jackson Project (https://github.com/FasterXML/jackson). Documentation is found at https://github.com/FasterXML/jackson-docs. We will use two JSON files to demonstrate how it can be used. The first file, Person.json, is shown next where a single person data is stored. It consists of four fields where the last field is an array of location information.

{  
   "firstname":"Smith", 
   "lastname":"Peter",  
   "phone":8475552222, 
   "address":["100 Main Street","Corpus","Oklahoma"]  
} 

The code sequence that follows shows how to extract the values for each of the fields. Within the try-catch block a JsonFactory instance is created which then creates a JsonParser instance based on the Person.json file.

try { 
    JsonFactory jsonfactory = new JsonFactory(); 
    JsonParser parser = jsonfactory.createParser(new File("Person.json")); 
    ... 
    parser.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

The nextToken method returns a token. However, the JsonParser object keeps track of the current token. In the while loop the nextToken method returns and advances the parser to the next token. The getCurrentName method returns the field name for the token. The while loop terminates when the last token is reached.

while (parser.nextToken() != JsonToken.END_OBJECT) { 
    String token = parser.getCurrentName(); 
    ... 
} 

The body of the loop consists of a series of if statements that processes the field by its name. Since the address field is an array, another loop will extract each of its elements until the ending array token is reached.

if ("firstname".equals(token)) { 
    parser.nextToken(); 
    String fname = parser.getText(); 
    out.println("firstname : " + fname); 
} 
if ("lastname".equals(token)) { 
    parser.nextToken(); 
    String lname = parser.getText(); 
    out.println("lastname : " + lname); 
} 
if ("phone".equals(token)) { 
    parser.nextToken(); 
    long phone = parser.getLongValue(); 
    out.println("phone : " + phone); 
} 
if ("address".equals(token)) { 
    out.println("address :"); 
    parser.nextToken(); 
    while (parser.nextToken() != JsonToken.END_ARRAY) { 
        out.println(parser.getText()); 
    } 
} 

The output of this example follows:

firstname : Smith
lastname : Peter
phone : 8475552222
address :
100 Main Street
Corpus
Oklahoma

However, JSON objects are frequently more complex than the previous example. Here a Persons.json file consists of an array of three persons:

{ 
   "persons": { 
      "groupname": "school", 
      "person": 
         [  
            {"firstname":"Smith", 
              "lastname":"Peter",  
              "phone":8475552222, 
              "address":["100 Main Street","Corpus","Oklahoma"] }, 
           {"firstname":"King", 
              "lastname":"Sarah",  
              "phone":8475551111, 
              "address":["200 Main Street","Corpus","Oklahoma"] }, 
           {"firstname":"Frost", 
              "lastname":"Nathan",  
              "phone":8475553333, 
              "address":["300 Main Street","Corpus","Oklahoma"] } 
         ] 
   } 
} 

To process this file, we use a similar set of code as shown previously. We create the parser and then enter a loop as before:

try { 
    JsonFactory jsonfactory = new JsonFactory(); 
    JsonParser parser = jsonfactory.createParser(new File("Person.json")); 
    while (parser.nextToken() != JsonToken.END_OBJECT) { 
        String token = parser.getCurrentName(); 
        ... 
    } 
    parser.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

However, we need to find the persons field and then extract each of its elements. The groupname field is extracted and displayed as shown here:

if ("persons".equals(token)) { 
    JsonToken jsonToken = parser.nextToken(); 
    jsonToken = parser.nextToken(); 
    token = parser.getCurrentName(); 
    if ("groupname".equals(token)) { 
        parser.nextToken(); 
        String groupname = parser.getText(); 
        out.println("Group : " + groupname); 
        ... 
    } 
} 

Next, we find the person field and call a parsePerson method to better organize the code:

parser.nextToken(); 
token = parser.getCurrentName(); 
if ("person".equals(token)) { 
    out.println("Found person"); 
    parsePerson(parser); 
} 

The parsePerson method follows which is very similar to the process used in the first example.

public void parsePerson(JsonParser parser) throws IOException { 
    while (parser.nextToken() != JsonToken.END_ARRAY) { 
        String token = parser.getCurrentName(); 
        if ("firstname".equals(token)) { 
            parser.nextToken(); 
            String fname = parser.getText(); 
            out.println("firstname : " + fname); 
        } 
        if ("lastname".equals(token)) { 
            parser.nextToken(); 
            String lname = parser.getText(); 
            out.println("lastname : " + lname); 
        } 
        if ("phone".equals(token)) { 
            parser.nextToken(); 
            long phone = parser.getLongValue(); 
            out.println("phone : " + phone); 
        } 
        if ("address".equals(token)) { 
            out.println("address :"); 
            parser.nextToken(); 
            while (parser.nextToken() != JsonToken.END_ARRAY) { 
                out.println(parser.getText()); 
            } 
        } 
    } 
} 

The output follows:

Group : school
Found person
firstname : Smith
lastname : Peter
phone : 8475552222
address :
100 Main Street
Corpus
Oklahoma
firstname : King
lastname : Sarah
phone : 8475551111
address :
200 Main Street
Corpus
Oklahoma
firstname : Frost
lastname : Nathan
phone : 8475553333address :
300 Main Street
Corpus
Oklahoma

Using the JSON tree API

The second approach is to use the tree model. An ObjectMapper instance is used to create a JsonNode instance using the Persons.json file. The fieldNames method returns Iterator allowing us to process each element of the file.

try { 
    ObjectMapper mapper = new ObjectMapper(); 
    JsonNode node = mapper.readTree(new File("Persons.json")); 
    Iterator<String> fieldNames = node.fieldNames(); 
    while (fieldNames.hasNext()) { 
        ... 
        fieldNames.next(); 
    } 
} catch (IOException ex) { 
    // Handle exceptions 
} 

Since the JSON file contains a persons field, we will obtain a JsonNode instance representing the field and then iterate over each of its elements.

JsonNode personsNode = node.get("persons"); 
Iterator<JsonNode> elements = personsNode.iterator(); 
while (elements.hasNext()) { 
    ... 
} 

Each element is processed one at a time. If the element type is a string, we assume that this is the groupname field.

JsonNode element = elements.next(); 
JsonNodeType nodeType = element.getNodeType(); 
 
if (nodeType == JsonNodeType.STRING) { 
    out.println("Group: " + element.textValue()); 
} 

If the element is an array, we assume it contains a series of persons where each person is processed by the parsePerson method:

if (nodeType == JsonNodeType.ARRAY) { 
    Iterator<JsonNode> fields = element.iterator(); 
    while (fields.hasNext()) { 
        parsePerson(fields.next()); 
    } 
}

The parsePerson method is shown next:

public void parsePerson(JsonNode node) { 
    Iterator<JsonNode> fields = node.iterator(); 
    while(fields.hasNext()) { 
        JsonNode subNode = fields.next(); 
        out.println(subNode.asText()); 
    } 
}

The output follows:

Group: school
Smith
Peter
8475552222
King
Sarah
8475551111
Frost
Nathan
8475553333

There is much more to JSON than we are able to illustrate here. However, this should give you an idea of how this type of data can be handled.

The nitty gritty of cleaning text

Strings are used to support text processing so using a good string library is important. Unfortunately, the java.lang.String class has some limitations. To address these limitations, you can either implement your own special string functions as needed or you can use a third-party library.

Creating your own library can be useful, but you will basically be reinventing the wheel. It may be faster to write a simple code sequence to implement some functionality, but to do things right, you will need to test them. Third-party libraries have already been tested and have been used on hundreds of projects. They provide a more efficient way of processing text.

There are several text processing APIs in addition to those found in Java. We will demonstrate two of these:

Java provides many supports for cleaning text data, including methods in the String class. These methods are ideal for simple text cleaning and small amounts of data but can also be efficient with larger, complex datasets. We will demonstrate several String class methods in a moment. Some of the most helpful String class methods are summarized in the following table:

Method Name

Return Type

Description

trim

String

Removes leading and trailing blank spaces

toUpperCase/toLowerCase

String

Changes the casing of the entire string

replaceAll

String

Replaces all occurrences of a character sequence within the string

contains

boolean

Determines whether a given character sequence exists within the string

compareTo

compareToIgnoreCase

int

Compares two strings lexographically and returns an integer representing their relationship

matches

boolean

Determines whether the string matches a given regular expression

join

String

Combines two or more strings with a specified delimiter

split

String[]

Separates elements of a given string using a specified delimiter

Many text operations are simplified by the use of regular expressions. Regular expressions use standardized syntax to represent patterns in text, which can be used to locate and manipulate text matching the pattern.

A regular expression is simply a string itself. For example, the string Hello, my name is Sally can be used as a regular expression to find those exact words within a given text. This is very specific and not broadly applicable, but we can use a different regular expression to make our code more effective. Hello, my name is \\w will match any text that starts with Hello, my name is and ends with a word character.

We will use several examples of more complex regular expressions, and some of the more useful syntax options are summarized in the following table. Note each must be double-escaped when used in a Java application.

Option

Description

\d

Any digit: 0-9

\D

Any non-digit

\s

Any whitespace character

\S

Any non-whitespace character

\w

Any word character (including digits): A-Z, a-z, and 0-9

\W

Any non-word character

The size and source of text data varies wildly from application to application but the methods used to transform the data remain the same. You may actually need to read data from a file, but for simplicity's sake, we will be using a string containing the beginning sentences of Herman Melville's Moby Dick for several examples within this chapter. Unless otherwise specified, the text will assumed to be as shown next:

String dirtyText = "Call me Ishmael. Some years ago- never mind how"; 
dirtyText += " long precisely - having little or no money in my purse,"; 
dirtyText += " and nothing particular to interest me on shore, I thought";  
dirtyText += " I would sail about a little and see the watery part of the world."; 

Using Java tokenizers to extract words

Often it is most efficient to analyze text data as tokens. There are multiple tokenizers available in the core Java libraries as well as third-party tokenizers. We will demonstrate various tokenizers throughout this chapter. The ideal tokenizer will depend upon the limitations and requirements of an individual application.

Java core tokenizers

StringTokenizer was the first and most basic tokenizer and has been available since Java 1. It is not recommended for use in new development as the String class's split method is considered more efficient. While it does provide a speed advantage for files with narrowly defined and set delimiters, it is less flexible than other tokenizer options. The following is a simple implementation of the StringTokenizer class that splits a string on spaces:

StringTokenizer tokenizer = new StringTokenizer(dirtyText," "); 
while(tokenizer.hasMoreTokens()){ 
  out.print(tokenizer.nextToken() + " "); 
} 

When we set the  dirtyText variable to hold our text from Moby Dick, shown previously, we get the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely...

StreamTokenizer is another core Java tokenizer. StreamTokenizer grants more information about the tokens retrieved, and allows the user to specify data types to parse, but is considered more difficult to use than StreamTokenizer or the split method. The String class split method is the simplest way to split strings up based on a delimiter, but it does not provide a way to parse the split strings and you can only specify one delimiter for the entire string. For these reasons, it is not a true tokenizer, but it can be useful for data cleaning.

The Scanner class is designed to allow you to parse strings into different data types. We used it previously in the Handling CSV data section and we will address it again in the Removing stop words section.

Third-party tokenizers and libraries

Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer. This class provides more advanced support than the standard StringTokenizer class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer:

StrTokenizer tokenizer = new StrTokenizer(text); 
while (tokenizer.hasNext()) { 
  out.print(tokenizer.next() + " "); 
} 

This operates in a similar fashion to StringTokenizer and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.

When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...

We can modify our constructor as follows:

StrTokenizer tokenizer = new StrTokenizer(text,","); 

The output for this implementation is:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse
and nothing particular to interest me on shore
I thought I would sail about a little and see the watery part of the world.

Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher object.

Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner class and the Splitter class. Tokenization is accomplished in Guava using its Splitter class's split method. The following is a simple example:

Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults(); 
Iterable<String> words = simpleSplit.split(dirtyText);  
for(String token: words){ 
  out.print(token); 
} 

This splits each token on commas and produces output like our last example. We can modify the parameter of the on method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.

LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory interface. We implement a LingPipe IndoEuropeanTokenizerFactory tokenizer in the Simple text cleaning section.

Transforming data into a usable form

Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.

Simple text cleaning

We will use the string shown before from Moby Dick to demonstrate some of the basic String class methods. Notice the use of the toLowerCase and trim methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:

out.println(dirtyText); 
dirtyText =    dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); 
dirtyText = dirtyText.trim(); 
while(dirtyText.contains("  ")){ 
  dirtyText = dirtyText.replaceAll("  ", " "); 
} 
out.println(dirtyText);  

When executed, the code produces the following output, truncated:

Call me Ishmael. Some years ago- never mind how long precisely -
call me ishmael some years ago never mind how long precisely

Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String array. The split method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W, which represents anything that is not a word character:

out.println(dirtyText); 
dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); 
String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); 
for(String clean : cleanText){ 
  out.print(clean + " ");
} 

This code produces the same output as shown previously.

Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join method joins every word in the array words and inserts a space between each word:

out.println(dirtyText); 
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = String.join(" ", words); 
out.println(cleanText); 

Again, this code produces the same output as shown previously. An alternate version of the join method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner class:

out.println(dirtyText);  
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = Joiner.on(" ").skipNulls().join(words); 
out.println(cleanText); 

This version provides additional options, including skipping nulls, as shown before. The output remains the same.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString()); 

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString()); 

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
} 

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...

Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 

Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14

Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

  • myemail@mail.com
  • MyEmail@some.mail.com
  • My.Email.123!@mail.net

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("myemail@mail.com")); 
out.println(validateEmail("My.Email.123!@mail.net")); 
out.println(validateEmail("myEmail")); 

The output follows:

myemail@mail.com is a valid email address
My.Email.123!@mail.net is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Using Java tokenizers to extract words

Often it is most efficient to analyze text data as tokens. There are multiple tokenizers available in the core Java libraries as well as third-party tokenizers. We will demonstrate various tokenizers throughout this chapter. The ideal tokenizer will depend upon the limitations and requirements of an individual application.

Java core tokenizers

StringTokenizer was the first and most basic tokenizer and has been available since Java 1. It is not recommended for use in new development as the String class's split method is considered more efficient. While it does provide a speed advantage for files with narrowly defined and set delimiters, it is less flexible than other tokenizer options. The following is a simple implementation of the StringTokenizer class that splits a string on spaces:

StringTokenizer tokenizer = new StringTokenizer(dirtyText," "); 
while(tokenizer.hasMoreTokens()){ 
  out.print(tokenizer.nextToken() + " "); 
} 

When we set the  dirtyText variable to hold our text from Moby Dick, shown previously, we get the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely...

StreamTokenizer is another core Java tokenizer. StreamTokenizer grants more information about the tokens retrieved, and allows the user to specify data types to parse, but is considered more difficult to use than StreamTokenizer or the split method. The String class split method is the simplest way to split strings up based on a delimiter, but it does not provide a way to parse the split strings and you can only specify one delimiter for the entire string. For these reasons, it is not a true tokenizer, but it can be useful for data cleaning.

The Scanner class is designed to allow you to parse strings into different data types. We used it previously in the Handling CSV data section and we will address it again in the Removing stop words section.

Third-party tokenizers and libraries

Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer. This class provides more advanced support than the standard StringTokenizer class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer:

StrTokenizer tokenizer = new StrTokenizer(text); 
while (tokenizer.hasNext()) { 
  out.print(tokenizer.next() + " "); 
} 

This operates in a similar fashion to StringTokenizer and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.

When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...

We can modify our constructor as follows:

StrTokenizer tokenizer = new StrTokenizer(text,","); 

The output for this implementation is:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse
and nothing particular to interest me on shore
I thought I would sail about a little and see the watery part of the world.

Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher object.

Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner class and the Splitter class. Tokenization is accomplished in Guava using its Splitter class's split method. The following is a simple example:

Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults(); 
Iterable<String> words = simpleSplit.split(dirtyText);  
for(String token: words){ 
  out.print(token); 
} 

This splits each token on commas and produces output like our last example. We can modify the parameter of the on method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.

LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory interface. We implement a LingPipe IndoEuropeanTokenizerFactory tokenizer in the Simple text cleaning section.

Transforming data into a usable form

Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.

Simple text cleaning

We will use the string shown before from Moby Dick to demonstrate some of the basic String class methods. Notice the use of the toLowerCase and trim methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:

out.println(dirtyText); 
dirtyText =    dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); 
dirtyText = dirtyText.trim(); 
while(dirtyText.contains("  ")){ 
  dirtyText = dirtyText.replaceAll("  ", " "); 
} 
out.println(dirtyText);  

When executed, the code produces the following output, truncated:

Call me Ishmael. Some years ago- never mind how long precisely -
call me ishmael some years ago never mind how long precisely

Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String array. The split method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W, which represents anything that is not a word character:

out.println(dirtyText); 
dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); 
String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); 
for(String clean : cleanText){ 
  out.print(clean + " ");
} 

This code produces the same output as shown previously.

Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join method joins every word in the array words and inserts a space between each word:

out.println(dirtyText); 
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = String.join(" ", words); 
out.println(cleanText); 

Again, this code produces the same output as shown previously. An alternate version of the join method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner class:

out.println(dirtyText);  
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = Joiner.on(" ").skipNulls().join(words); 
out.println(cleanText); 

This version provides additional options, including skipping nulls, as shown before. The output remains the same.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString()); 

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString()); 

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
} 

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...

Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 

Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14

Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

  • myemail@mail.com
  • MyEmail@some.mail.com
  • My.Email.123!@mail.net

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("myemail@mail.com")); 
out.println(validateEmail("My.Email.123!@mail.net")); 
out.println(validateEmail("myEmail")); 

The output follows:

myemail@mail.com is a valid email address
My.Email.123!@mail.net is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Java core tokenizers

StringTokenizer was the first and most basic tokenizer and has been available since Java 1. It is not recommended for use in new development as the String class's split method is considered more efficient. While it does provide a speed advantage for files with narrowly defined and set delimiters, it is less flexible than other tokenizer options. The following is a simple implementation of the StringTokenizer class that splits a string on spaces:

StringTokenizer tokenizer = new StringTokenizer(dirtyText," "); 
while(tokenizer.hasMoreTokens()){ 
  out.print(tokenizer.nextToken() + " "); 
} 

When we set the  dirtyText variable to hold our text from Moby Dick, shown previously, we get the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely...

StreamTokenizer is another core Java tokenizer. StreamTokenizer grants more information about the tokens retrieved, and allows the user to specify data types to parse, but is considered more difficult to use than StreamTokenizer or the split method. The String class split method is the simplest way to split strings up based on a delimiter, but it does not provide a way to parse the split strings and you can only specify one delimiter for the entire string. For these reasons, it is not a true tokenizer, but it can be useful for data cleaning.

The Scanner class is designed to allow you to parse strings into different data types. We used it previously in the Handling CSV data section and we will address it again in the Removing stop words section.

Third-party tokenizers and libraries

Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer. This class provides more advanced support than the standard StringTokenizer class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer:

StrTokenizer tokenizer = new StrTokenizer(text); 
while (tokenizer.hasNext()) { 
  out.print(tokenizer.next() + " "); 
} 

This operates in a similar fashion to StringTokenizer and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.

When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...

We can modify our constructor as follows:

StrTokenizer tokenizer = new StrTokenizer(text,","); 

The output for this implementation is:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse
and nothing particular to interest me on shore
I thought I would sail about a little and see the watery part of the world.

Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher object.

Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner class and the Splitter class. Tokenization is accomplished in Guava using its Splitter class's split method. The following is a simple example:

Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults(); 
Iterable<String> words = simpleSplit.split(dirtyText);  
for(String token: words){ 
  out.print(token); 
} 

This splits each token on commas and produces output like our last example. We can modify the parameter of the on method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.

LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory interface. We implement a LingPipe IndoEuropeanTokenizerFactory tokenizer in the Simple text cleaning section.

Transforming data into a usable form

Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.

Simple text cleaning

We will use the string shown before from Moby Dick to demonstrate some of the basic String class methods. Notice the use of the toLowerCase and trim methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:

out.println(dirtyText); 
dirtyText =    dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); 
dirtyText = dirtyText.trim(); 
while(dirtyText.contains("  ")){ 
  dirtyText = dirtyText.replaceAll("  ", " "); 
} 
out.println(dirtyText);  

When executed, the code produces the following output, truncated:

Call me Ishmael. Some years ago- never mind how long precisely -
call me ishmael some years ago never mind how long precisely

Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String array. The split method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W, which represents anything that is not a word character:

out.println(dirtyText); 
dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); 
String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); 
for(String clean : cleanText){ 
  out.print(clean + " ");
} 

This code produces the same output as shown previously.

Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join method joins every word in the array words and inserts a space between each word:

out.println(dirtyText); 
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = String.join(" ", words); 
out.println(cleanText); 

Again, this code produces the same output as shown previously. An alternate version of the join method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner class:

out.println(dirtyText);  
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = Joiner.on(" ").skipNulls().join(words); 
out.println(cleanText); 

This version provides additional options, including skipping nulls, as shown before. The output remains the same.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString()); 

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString()); 

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
} 

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...
Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 
Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14
Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

  • myemail@mail.com
  • MyEmail@some.mail.com
  • My.Email.123!@mail.net

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("myemail@mail.com")); 
out.println(validateEmail("My.Email.123!@mail.net")); 
out.println(validateEmail("myEmail")); 

The output follows:

myemail@mail.com is a valid email address
My.Email.123!@mail.net is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Third-party tokenizers and libraries

Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer. This class provides more advanced support than the standard StringTokenizer class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer:

StrTokenizer tokenizer = new StrTokenizer(text); 
while (tokenizer.hasNext()) { 
  out.print(tokenizer.next() + " "); 
} 

This operates in a similar fashion to StringTokenizer and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.

When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...

We can modify our constructor as follows:

StrTokenizer tokenizer = new StrTokenizer(text,","); 

The output for this implementation is:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse
and nothing particular to interest me on shore
I thought I would sail about a little and see the watery part of the world.

Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher object.

Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner class and the Splitter class. Tokenization is accomplished in Guava using its Splitter class's split method. The following is a simple example:

Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults(); 
Iterable<String> words = simpleSplit.split(dirtyText);  
for(String token: words){ 
  out.print(token); 
} 

This splits each token on commas and produces output like our last example. We can modify the parameter of the on method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.

LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory interface. We implement a LingPipe IndoEuropeanTokenizerFactory tokenizer in the Simple text cleaning section.

Transforming data into a usable form

Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.

Simple text cleaning

We will use the string shown before from Moby Dick to demonstrate some of the basic String class methods. Notice the use of the toLowerCase and trim methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:

out.println(dirtyText); 
dirtyText =    dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); 
dirtyText = dirtyText.trim(); 
while(dirtyText.contains("  ")){ 
  dirtyText = dirtyText.replaceAll("  ", " "); 
} 
out.println(dirtyText);  

When executed, the code produces the following output, truncated:

Call me Ishmael. Some years ago- never mind how long precisely -
call me ishmael some years ago never mind how long precisely

Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String array. The split method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W, which represents anything that is not a word character:

out.println(dirtyText); 
dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); 
String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); 
for(String clean : cleanText){ 
  out.print(clean + " ");
} 

This code produces the same output as shown previously.

Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join method joins every word in the array words and inserts a space between each word:

out.println(dirtyText); 
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = String.join(" ", words); 
out.println(cleanText); 

Again, this code produces the same output as shown previously. An alternate version of the join method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner class:

out.println(dirtyText);  
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = Joiner.on(" ").skipNulls().join(words); 
out.println(cleanText); 

This version provides additional options, including skipping nulls, as shown before. The output remains the same.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString()); 

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString()); 

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
} 

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...
Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 
Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14
Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

  • myemail@mail.com
  • MyEmail@some.mail.com
  • My.Email.123!@mail.net

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("myemail@mail.com")); 
out.println(validateEmail("My.Email.123!@mail.net")); 
out.println(validateEmail("myEmail")); 

The output follows:

myemail@mail.com is a valid email address
My.Email.123!@mail.net is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Transforming data into a usable form

Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.

Simple text cleaning

We will use the string shown before from Moby Dick to demonstrate some of the basic String class methods. Notice the use of the toLowerCase and trim methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:

out.println(dirtyText); 
dirtyText =    dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); 
dirtyText = dirtyText.trim(); 
while(dirtyText.contains("  ")){ 
  dirtyText = dirtyText.replaceAll("  ", " "); 
} 
out.println(dirtyText);  

When executed, the code produces the following output, truncated:

Call me Ishmael. Some years ago- never mind how long precisely -
call me ishmael some years ago never mind how long precisely

Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String array. The split method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W, which represents anything that is not a word character:

out.println(dirtyText); 
dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); 
String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); 
for(String clean : cleanText){ 
  out.print(clean + " ");
} 

This code produces the same output as shown previously.

Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join method joins every word in the array words and inserts a space between each word:

out.println(dirtyText); 
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = String.join(" ", words); 
out.println(cleanText); 

Again, this code produces the same output as shown previously. An alternate version of the join method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner class:

out.println(dirtyText);  
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = Joiner.on(" ").skipNulls().join(words); 
out.println(cleanText); 

This version provides additional options, including skipping nulls, as shown before. The output remains the same.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString()); 

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString()); 

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
} 

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...

Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 

Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14

Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

  • myemail@mail.com
  • MyEmail@some.mail.com
  • My.Email.123!@mail.net

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("myemail@mail.com")); 
out.println(validateEmail("My.Email.123!@mail.net")); 
out.println(validateEmail("myEmail")); 

The output follows:

myemail@mail.com is a valid email address
My.Email.123!@mail.net is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Simple text cleaning

We will use the string shown before from Moby Dick to demonstrate some of the basic String class methods. Notice the use of the toLowerCase and trim methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:

out.println(dirtyText); 
dirtyText =    dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); 
dirtyText = dirtyText.trim(); 
while(dirtyText.contains("  ")){ 
  dirtyText = dirtyText.replaceAll("  ", " "); 
} 
out.println(dirtyText);  

When executed, the code produces the following output, truncated:

Call me Ishmael. Some years ago- never mind how long precisely -
call me ishmael some years ago never mind how long precisely

Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String array. The split method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W, which represents anything that is not a word character:

out.println(dirtyText); 
dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); 
String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); 
for(String clean : cleanText){ 
  out.print(clean + " ");
} 

This code produces the same output as shown previously.

Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join method joins every word in the array words and inserts a space between each word:

out.println(dirtyText); 
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = String.join(" ", words); 
out.println(cleanText); 

Again, this code produces the same output as shown previously. An alternate version of the join method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner class:

out.println(dirtyText);  
String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+"); 
String cleanText = Joiner.on(" ").skipNulls().join(words); 
out.println(cleanText); 

This version provides additional options, including skipping nulls, as shown before. The output remains the same.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString()); 

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString()); 

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
} 

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...
Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
   br 
         .lines() 
         .filter(s -> !s.equals("")) 
         .forEach(s -> out.println(s)); 
} catch (IOException ex) { 
   // Handle exceptions 
} 
Sorting text

Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator interface in conjunction with a lambda expression.

We start by declaring our Comparator variable compareInts. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare method, which determines which integer is larger:

 Comparator<Integer> compareInts = (Integer first, Integer second) ->
   Integer.compare(first, second); 
 

We can now call the sort method as we did previously:

 
Collections.sort(numsList,compareInts); 
out.println("Sorted integers using Lambda: " + numsList.toString()); 
 

Our output follows:

Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]

We then mimic the process with our wordsList. Notice the use of the compareTo method rather than compare:

 
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); 
Collections.sort(wordsList,compareWords); 
out.println("Sorted words using Lambda: " + wordsList.toString()); 

When this code is executed, we should see the following output:

Sorted words using Lambda: [boat, cat, dog, house, road, zoo]

In our next example, we are going to use the Collections class to perform basic sorting on String and integer data. For this example, wordList and numsList are both ArrayList and are initialized as follows:

List<String> wordsList 
        = Stream.of("cat", "dog", "house", "boat", "road", "zoo") 
        .collect(Collectors.toList()); 
List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) 
        .collect(Collectors.toList()); 

First, we will print our original version of each list followed by a call to the sort method. We then display our data, sorted in ascending fashion:

out.println("Original Word List: " + wordsList.toString()); 
Collections.sort(wordsList); 
out.println("Ascending Word List: " + wordsList.toString()); 
out.println("Original Integer List: " + numsList.toString()); 
Collections.sort(numsList); 
out.println("Ascending Integer List: " + numsList.toString()); 

The output follows:

Original Word List: [cat, dog, house, boat, road, zoo]
Ascending Word List: [boat, cat, dog, house, road, zoo]
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]

Next, we will replace the sort method with the reverse method of the Collections class in our integer data example. This method simply takes the elements and stores them in reverse order:

 out.println("Original Integer List: " + numsList.toString()); 
 Collections.reverse(numsList); 
 out.println("Reversed Integer List: " + numsList.toString()); 
 
 

The output displays our new numsList:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]

In our next example, we handle the sort using the Comparator interface. We will continue to use our numsList and assume that no sorting has occurred yet. First we create two objects that implement the Comparator interface. The sort method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare is a Java 8 method reference. This is can be used where a lambda expression is used:

out.println("Original Integer List: " + numsList.toString()); 
Comparator<Integer> basicOrder = Integer::compare; 
Comparator<Integer> descendOrder = basicOrder.reversed(); 
Collections.sort(numsList,descendOrder); 
out.println("Descending Integer List: " + numsList.toString()); 
 

After we execute this code, we will see the following output:

Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44]
Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]

In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog class that contains two properties, name and age, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList and then printing the names and ages of each Dog:

 
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); 
dogs.add(new Dogs("Zoey", 8)); 
dogs.add(new Dogs("Roxie", 10)); 
dogs.add(new Dogs("Kylie", 7)); 
dogs.add(new Dogs("Shorty", 14)); 
dogs.add(new Dogs("Ginger", 7)); 
dogs.add(new Dogs("Penny", 7)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output should resemble:

Name Age
Zoey 8
Roxie 10
Kylie 7
Shorty 14
Ginger 7
Penny 7

Next, we are going to use method chaining and the double colon operator to reference methods from the Dog class. We first call comparing followed by thenComparing to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog objects sorted first by Name and then by Age:

      dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

Our output follows:

Name Age
Ginger 7
Kylie 7
Penny 7
Roxie 10
Shorty 14
Zoey 8

Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:

   dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); 
out.println("Name " + " Age"); 
for(Dogs d : dogs){ 
      out.println(d.getName() + " " + d.getAge()); 
} 

And our output is:

Name Age
Ginger 7
Kylie 7
Penny 7
Zoey 8
Roxie 10
Shorty 14
Data validation

Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.

Validating data types

Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateInt method. This technique is easily modified for the other major data types supported in the standard Java library, including Float and Double.

We need to use a try-catch block here to catch a NumberFormatException. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt method of the Integer class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:

public static void validateInt(String toValidate){ 
try{ 
      int validInt = Integer.parseInt(toValidate); 
      out.println(validInt + " is a valid integer"); 
}catch(NumberFormatException e){ 
      out.println(toValidate + " is not a valid integer"); 
 
} 

We will use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The Apache Commons contain an IntegerValidator class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator methods to accomplish our goal:

public static String validateInt(String text){ 
      IntegerValidator intValidator = IntegerValidator.getInstance(); 
      if(intValidator.isValid(text)){ 
            return text + " is a valid integer"; 
      }else{ 
            return text + " is not a valid integer"; 
      }      
} 

We again use the following method calls to test our method:

validateInt("1234"); 
validateInt("Ishmael"); 

The output follows:

1234 is a valid integer
Ishmael is not a valid integer

The IntegerValidator class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number objects to Integer objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.

Validating dates

Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.

To do this, we have created another simple method called validateDate. The method takes two String parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat class using the format specified in the parameter. Then we call the parse method to convert our String date to a Date object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:

 
public static String validateDate(String theDate, String dateFormat){ 
      try { 
            SimpleDateFormat format = new SimpleDateFormat(dateFormat); 
            Date test = format.parse(theDate); 
            if(format.format(test).equals(theDate)){ 
                  return theDate.toString() + " is a valid date"; 
            }else{ 
                  return theDate.toString() + " is not a valid date"; 
            } 
      } catch (ParseException e) { 
            return theDate.toString() + " is not a valid date"; 
      } 
} 

We make the following method calls to test our method:

String dateFormat = "MM/dd/yyyy"; 
out.println(validateDate("12/12/1982",dateFormat)); 
out.println(validateDate("12/12/82",dateFormat)); 
out.println(validateDate("Ishmael",dateFormat)); 

The output follows:

12/12/1982 is a valid date
12/12/82 is not a valid date
Ishmael is not a valid date

This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.

Validating e-mail addresses

It is also common to need to validate e-mail addresses. While most e-mail addresses have the @ symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:

  • myemail@mail.com
  • MyEmail@some.mail.com
  • My.Email.123!@mail.net

One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.

We use the Pattern and Matcher classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:

public static String validateEmail(String email) { 
      String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" +
          "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + 
          "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; 
      Pattern.compile(emailRegex); 
      Matcher matcher = pattern.matcher(email); 
      if(matcher.matches()){ 
            return email + " is a valid email address"; 
      }else{ 
            return email + " is not a valid email address"; 
      } 
} 

We make the following method calls to test our data:

out.println(validateEmail("myemail@mail.com")); 
out.println(validateEmail("My.Email.123!@mail.net")); 
out.println(validateEmail("myEmail")); 

The output follows:

myemail@mail.com is a valid email address
My.Email.123!@mail.net is a valid email address
myEmail is not a valid email address

There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress class to validate whether a given string is a valid e-mail address or not:

    public static String validateEmailStandard(String email){ 
        try{ 
            InternetAddress testEmail = new InternetAddress(email); 
            testEmail.validate(); 
            return email + " is a valid email address"; 
        }catch(AddressException e){ 
            return email + " is not a valid email address"; 
        } 
    } 

When tested against the same data as in the previous example, our output is identical. However, consider the following method call:

    out.println(validateEmailStandard("myEmail@mail")); 
 

Despite not being in standard e-mail format, the output is as follows:

myEmail@mail is a valid email address

Additionally, the validate method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.

One last option we will look at is the Apache Commons EmailValidator class. This class's isValid method examines an e-mail address and determines whether it is valid or not. Our validateEmail method shown previously is modified as follows to use EmailValidator:

public static String validateEmailApache(String email){ 
      email = email.trim(); 
      EmailValidator eValidator = EmailValidator.getInstance(); 
      if(eValidator.isValid(email)){ 
            return email + " is a valid email address."; 
      }else{ 
            return email + " is not a valid email address."; 
      } 
} 
 

Validating ZIP codes

Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:

public static void validateZip(String zip){ 
      String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; 
      Pattern pattern = Pattern.compile(zipRegex); 
      Matcher matcher = pattern.matcher(zip); 
      if(matcher.matches()){ 
            out.println(zip + " is a valid zip code"); 
      }else{ 
            out.println(zip + " is not a valid zip code"); 
      } 
} 

We make the following method calls to test our data:

out.println(validateZip("12345")); 
out.println(validateZip("12345-6789")); 
out.println(validateZip("123")); 

The output follows:

12345 is a valid zip code
12345-6789 is a valid zip code
123 is not a valid zip code

Validating names

Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L} provides this flexibility. We also use  \\s-', to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:

public static void validateName(String name){ 
      String nameRegex = "^[\\p{L}\\s-',]+$"; 
      Pattern pattern = Pattern.compile(nameRegex); 
      Matcher matcher = pattern.matcher(name); 
      if(matcher.matches()){ 
            out.println(name + " is a valid name"); 
      }else{ 
            out.println(name + " is not a valid name"); 
      } 
} 

We make the following method calls to test our data:

validateName("Bobby Smith, Jr."); 
validateName("Bobby Smith the 4th"); 
validateName("Albrecht Müller"); 
validateName("François Moreau"); 

The output follows:

Bobby Smith, Jr. is a valid name
Bobby Smith the 4th is not a valid name
Albrecht Müller is a valid name
François Moreau is a valid name

Notice that the comma and period in Bobby Smith, Jr. are acceptable, but the 4 in 4th is not. Additionally, the special characters in François and Müller are considered valid.

Removing stop words

Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList using the Arrays class's asList method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:

Scanner readStop = new Scanner(new File("C://stopwords.txt")); 
ArrayList<String> words = new    ArrayList<String>(Arrays.asList((dirtyText)); 
out.println("Original clean text: " + words.toString()); 

We also create a new ArrayList to hold a list of stop words actually found in our text. This will allow us to use the ArrayList class removeAll method shortly. Next, we use our Scanner to read through our file of stop words. Notice how we also call the toLowerCase and trim methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains method to determine whether our text contains the given stop word. If so, we add it to our foundWords ArrayList. Once we have processed all the stop words, we call removeAll to remove them from our text:

ArrayList<String> foundWords = new ArrayList(); 
while(readStop.hasNextLine()){ 
  String stopWord = readStop.nextLine().toLowerCase(); 
  if(words.contains(stopWord)){ 
    foundWords.add(stopWord); 
  } 
} 
words.removeAll(foundWords); 
out.println("Text without stop words: " + words.toString()); 

The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:

Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world]
Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely

There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer method uses a char array, so we call toCharArray against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:

text = text.toLowerCase().trim(); 
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; 
fact = new EnglishStopTokenizerFactory(fact); 
Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); 
for(String word : tok){ 
  out.print(word + " "); 
} 

The output follows:

Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .

Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.

Finding words in text

The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains method and the equals method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:

dirtyText = dirtyText.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
int count = 0; 

Next, we call the contains method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:

 
if(dirtyText.contains(toFind)){ 
      String[] words = dirtyText.split(" "); 
      for(String word : words){ 
            if(word.equals(toFind)){ 
                  count++; 
            } 
      } 
out.println("Found " + toFind + " " + count + " times in the text."); 
} 

In this example, we set toFind to the letter I. This produced the following output:

Found i 2 times in the text.

We also have the option to use the Scanner class to search through an entire file. One helpful method is the findWithinHorizon method. This uses a Scanner to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner will be searched by default:

dirtyText = dirtyText.toLowerCase().trim();  
toFind = toFind.toLowerCase().trim(); 
Scanner textLine = new Scanner(dirtyText); 
out.println("Found " + textLine.findWithinHorizon(toFind, 10)); 

This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.

It can also be more efficient to search an entire file using a BufferedReader. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader object from our path and process our file as long as the next line is not empty:

String path = "C:// MobyDick.txt"; 
try { 
    String textLine = ""; 
    toFind = toFind.toLowerCase().trim(); 
    BufferedReader textToClean = new BufferedReader( 
        new FileReader(path)); 
    while((textLine = textToClean.readLine()) != null){ 
        line++; 
        if(textLine.toLowerCase().trim().contains(toFind)){ 
            out.println("Found " + toFind + " in " + textLine); 
           } 
    } 
    textToClean.close(); 
} catch (IOException ex) { 
    // Handle exceptions 
} 

We again test our data by searching for the word I in the first sentences of Moby Dick. The truncated output follows:

Found i in Call me Ishmael...

Finding and replacing text

We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains method. If we find the text, we call the replaceAll method to modify our string:

text = text.toLowerCase().trim(); 
toFind = toFind.toLowerCase().trim(); 
out.println(text); 
 
if(text.contains(toFind)){ 
      text = text.replaceAll(toFind, replaceWith); 
      out.println(text); 
} 

To test this code, we set toFind to the word I and replaceWith to Ishmael. Our output follows:

call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world.
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.

Apache Commons also provides a replace method with several variations in the StringUtils class. This class provides much of the same functionality as the String class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me with X to demonstrate the replace method:

out.println(text); 
out.println(StringUtils.replace(text, "me", "X")); 

The truncated output follows:

Call me Ishmael. Some years ago- never mind how long precisely -
Call X Ishmael. SoX years ago- never mind how long precisely -

Notice how every instance of me has been replaced, even those instances contained within other words, such as some. This can be avoided by adding spaces around me , although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.

The StringUtils class also provides a replacePattern method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:

out.println(text); 
text = StringUtils.replacePattern(text, "\\W\\s", " "); 
out.println(text); 

This will produce the following truncated output:

Call me Ishmael. Some years ago- never mind how long precisely - 
Call me Ishmael Some years ago never mind how long precisely

Google Guava provides additional support for matching and modify text data using the CharMatcher class. CharMatcher not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.

In this example, we are going to use the replace method to simply replace all instances of the word me with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom method and print our string again:

text = text.replace("me", " "); 
out.println("With double spaces: " + text); 
String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); 
out.println("With double spaces removed: " + spaced); 

Our output is truncated as follows:

With double spaces: Call Ishmael. So years ago- ...
With double spaces removed: Call Ishmael. So years ago- ...
Data imputation

Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.

Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.

To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.

We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:

   double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; 
   double sum = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 

Notice that for the numbers used in this execution, the output is as follows:

The average temperature is 70.33

Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum:

   double sum = 0; 
   tempList[0] = 0; 
   for(double d : tempList){ 
         sum += d; 
   } 
   out.printf("The average temperature is %1$,.2f", sum/12); 
 

This will change the average temperature displayed in our output:

The average temperature is 66.17

Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.

One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.

When it is essential to handle null data, Java's Optional class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null for the purposes of demonstrating these methods:

   String useName = ""; 
   String[] nameList =
         {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; 
   Optional<String> tempName; 
   for(String name : nameList){ 
         tempName = Optional.ofNullable(name); 
         useName = tempName.orElse("DEFAULT"); 
         out.println("Name to use = " + useName); 
   } 

We first created a variable called useName to hold the name we will actually print out. We also created an instance of the Optional class called tempName. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional class ofNullable method. This method tests whether a particular value is null or not. On the next line, we call the orElse method to either assign a value from the array to useName or, if the element is null, assign DEFAULT. Our output follows:

Name to use = Amy
Name to use = Bob
Name to use = Sally
Name to use = Sue
Name to use = Don
Name to use = Rick
Name to use = DEFAULT
Name to use = Betsy

The Optional class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new 
ArrayList<>(Arrays.asList(nums))); 
SortedSet<Integer> partNumsList; 
out.println("Original List: " + fullNumsList.toString()  
    + " " + fullNumsList.last()); 

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

 
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()  
    + " " + partNumsList.size());       

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); 
Set<Integer> partNumsList = fullNumsList 
         .stream() 
         .skip(5) 
         .collect(toCollection(TreeSet::new)); 
out.println("SubSet of List: " + partNumsList.toString());  

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header line