Reader small image

You're reading from  Java for Data Science

Product typeBook
Published inJan 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785280115
Edition1st Edition
Languages
Concepts
Right arrow
Authors (2):
Richard M. Reese
Richard M. Reese
author image
Richard M. Reese

Richard Reese has worked in the industry and academics for the past 29 years. For 10 years he provided software development support at Lockheed and at one point developed a C based network application. He was a contract instructor providing software training to industry for 5 years. Richard is currently an Associate Professor at Tarleton State University in Stephenville Texas. Richard is the author of various books and video courses some of which are as follows: Natural Language Processing with Java. Java for Data Science Getting Started with Natural Language Processing in Java
Read more about Richard M. Reese

Jennifer L. Reese
Jennifer L. Reese
author image
Jennifer L. Reese

Jennifer L. Reese studied computer science at Tarleton State University. She also earned her M.Ed. from Tarleton in December 2016. She currently teaches computer science to high-school students. Her interests include the integration of computer science concepts with other academic disciplines, increasing diversity in computer science courses, and the application of data science to the field of education. She has co-authored two books: Java for Data Science and Java 7 New Features Cookbook. She previously worked as a software engineer. In her free time she enjoys reading, cooking, and traveling—especially to any destination with a beach. She is a musician and appreciates a variety of musical genres.
Read more about Jennifer L. Reese

View More author details
Right arrow

Chapter 2. Data Acquisition

It is never much fun to work with code that is not formatted properly or uses variable names that do not convey their intended purpose. The same can be said of data, except that bad data can result in inaccurate results. Thus, data acquisition is an important step in the analysis of data. Data is available from a number of sources but must be retrieved and ultimately processed before it can be useful. It is available from a variety of sources. We can find it in numerous public data sources as simple files, or it may be found in more complex forms across the Internet. In this chapter, we will demonstrate how to acquire data from several of these, including various Internet sites and several social media sites.

We can access data from the Internet by downloading specific files or through a process known as web scraping, which involves extracting the contents of a web page. We also explore a related topic known as web crawling, which involves applications that examine...

Understanding the data formats used in data science applications


When we discuss data formats, we are referring to content format, as opposed to the underlying file format, which may not even be visible to most developers. We cannot examine all available formats due to the vast number of formats available. Instead, we will tackle several of the more common formats, providing adequate examples to address the most common data retrieval needs. Specifically, we will demonstrate how to retrieve data stored in the following formats:

  • HTML

  • PDF

  • CSV/TSV

  • Spreadsheets

  • Databases

  • JSON

  • XML

Some of these formats are well supported and documented elsewhere. For example, XML has been in use for years and there are several well-established techniques for accessing XML data in Java. For these types of data, we will outline the major techniques available and show a few examples to illustrate how they work. This will provide those readers who are not familiar with the technology some insight into their nature.

The most...

Data acquisition techniques


In this section, we will illustrate how to acquire data from web pages. Web pages contain a potential bounty of useful information. We will demonstrate how to access web pages using several technologies, starting with a low-level approach supported by the HttpUrlConnection class. To find pages, a web crawler application is often used. Once a useful page has been identified, then information needs to be extracted from the page. This is often performed using an HTML parser. Extracting this information is important because it is often buried amid a clutter of HTML tags and JavaScript code.

Using the HttpUrlConnection class

The contents of a web page can be accessed using the HttpUrlConnection class. This is a low-level approach that requires the developer to do a lot of footwork to extract relevant content. However, he or she is able to exercise greater control over how the content is handled. In some situations, this approach may be preferable to using other API libraries...

Summary


In this chapter, we discussed types of data that are useful for data science and readily accessible on the Internet. This discussion included details about file specifications and formats for the most common types of data sources.

We also examined Java APIs and other techniques for retrieving data, and illustrated this process with multiple sources. In particular, we focused on types of text-based document formats and multimedia files. We used web crawlers to access websites and then performed web scraping to retrieve data from the sites we encountered.

Finally, we extracted data from social media sites and examined the available Java support. We retrieved data from Twitter, Wikipedia, Flickr, and YouTube and examined the available API support.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Java for Data Science
Published in: Jan 2017Publisher: PacktISBN-13: 9781785280115
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Richard M. Reese

Richard Reese has worked in the industry and academics for the past 29 years. For 10 years he provided software development support at Lockheed and at one point developed a C based network application. He was a contract instructor providing software training to industry for 5 years. Richard is currently an Associate Professor at Tarleton State University in Stephenville Texas. Richard is the author of various books and video courses some of which are as follows: Natural Language Processing with Java. Java for Data Science Getting Started with Natural Language Processing in Java
Read more about Richard M. Reese

author image
Jennifer L. Reese

Jennifer L. Reese studied computer science at Tarleton State University. She also earned her M.Ed. from Tarleton in December 2016. She currently teaches computer science to high-school students. Her interests include the integration of computer science concepts with other academic disciplines, increasing diversity in computer science courses, and the application of data science to the field of education. She has co-authored two books: Java for Data Science and Java 7 New Features Cookbook. She previously worked as a software engineer. In her free time she enjoys reading, cooking, and traveling—especially to any destination with a beach. She is a musician and appreciates a variety of musical genres.
Read more about Jennifer L. Reese