Data analysis is the process of organizing, cleaning, transforming, and modeling data to obtain useful information and ultimately, new knowledge. The terms data analytics, business analytics, data mining, artificial intelligence, machine learning, knowledge discovery, and big data are also used to describe similar processes. The distinctions of these fields probably lie more in their areas of application than in their fundamental nature. Some argue that these are all part of the new discipline of data science.
The central process of gaining useful information from organized data is managed by the application of computer science algorithms. Consequently, these will be a central focus of this book.
Data analysis is both an old field and a new one. Its origins lie among the mathematical fields of numerical methods and statistical analysis, which reach back into the eighteenth century. But many of the methods that we shall study gained prominence much more recently, with the ubiquitous force of the internet and the consequent availability of massive datasets.
In this first chapter, we look at a few famous historical examples of data analysis. These can help us appreciate the importance of the science and its promise for the future.
Data is as old as civilization itself, maybe even older. The 17,000-year-old paintings in the Lascaux caves in France could well have been attempts by those primitive dwellers to record their greatest hunting triumphs. Those records provide us with data about humanity in the Paleolithic era. That data was not analyzed, in the modern sense, to obtain new knowledge. But its existence does attest to the need humans have to preserve their ideas in data.
Five thousand years ago, the Sumerians of ancient Mesopotamia recorded far more important data on clay tablets. That cuneiform writing included substantial accounting data about daily business transactions. To apply that data, the Sumerians invented not only text writing, but also the first number system.
In 1086, King William the Conqueror ordered a massive collection of data to determine the extent of the lands and properties of the crown and of his subjects. This was called the Domesday Book, because it was a final tallying of people's (material) lives. That data was analyzed to determine ownership and tax obligations for centuries to follow.
On November 11, 1572, a young Danish nobleman named Tycho Brahe observed the supernova of a star that we now call SN 1572. From that time until his death 30 years later, he devoted his wealth and energies to the accumulation of astronomical data. His young German assistant, Johannes Kepler, spent 18 years analyzing that data before he finally formulated his three laws of planetary motion in 1618.
Historians of science usually attribute Kepler's achievement as the beginning of the Scientific Revolution. Here were the essential steps of the scientific method: observe nature, collect the data, analyze the data, formulate a theory, and then test that theory with more data. Note the central step here: data analysis.
Of course, Kepler did not have either of the modern tools that data analysts use today: algorithms and computers on which to implement them. He did, however, apply one technological breakthrough that surely facilitated his number crunching: logarithms. In 1620, he stated that Napier's invention of logarithms in 1614 had been essential to his discovery of the third law of planetary motion.
Kepler's achievements had a profound effect upon Galileo Galilei a generation later, and upon Isaac Newton a generation after him. Both men practiced the scientific method with spectacular success.
One of Newton's few friends was Edmund Halley, the man who first computed the orbit of his eponymous comet. Halley was a polymath, with expertise in astronomy, mathematics, physics, meteorology, geophysics, and cartography.
In 1693, Halley analyzed mortality data that had been compiled by Caspar Neumann in Breslau, Germany. Like Kepler's work with Brahe's data 90 years earlier, Halley's analysis led to new knowledge. His published results allowed the British government to sell life annuities at the appropriate price, based on the age of the annuitant.
Most data today is still numeric. But most of the algorithms we will be studying apply to a much broader range of possible values, including text, images, audio and video files, and even complete web pages on the internet.
In 1821, a young Cambridge student named Charles Babbage was poring over some trigonometric and logarithmic tables that had been recently computed by hand. When he realized how many errors they had, he exclaimed, "I wish to God these calculations had been executed by steam." He was suggesting that the tables could have been computed automatically by some mechanism that would be powered by a steam engine.
Babbage was a mathematician by avocation, holding the same Lucasian Chair of Mathematics at Cambridge University that Isaac Newton had held 150 years earlier and that Stephen Hawking would hold 150 years later. However, he spent a large part of his life working on automatic computing. Having invented the idea of a programmable computer, he is generally regarded as the first computer scientist. His assistant, Lady Ada Lovelace, has been recognized as the first computer programmer.
Babbage's goal was to build a machine that could analyze data to obtain useful information, the central step of data analysis. By automating that step, it could be carried out on much larger datasets and much more rapidly. His interest in trigonometric and logarithmic tables was related to his objective of improving methods of navigation, which was critical to the expanding British Empire.
In 1854, cholera broke out among the poor in London. The epidemic spread quickly, partly because nobody knew the source of the problem. But a physician named John Snow suspected it was caused by contaminated water. At that time, most Londoners drew their water from public wells that were supplied directly from the River Thames. The following figure shows the map that Snow drew, with black rectangles indicating the frequencies of cholera occurrences:
If you look closely, you can also see the locations of nine public water pumps, marked as black dots and labeled PUMP. From this data, we can easily see that the pump at the corner of Broad Street and Cambridge Street is in the middle of the epidemic. This data analysis led Snow to investigate the water supply at that pump, discovering that raw sewage was leaking into it through a break in the pipe.
By also locating the public pumps on the map, he demonstrated that the source was probably the pump at the corner of Broad Street and Cambridge Street. This was one of the first great examples of the successful application of data analysis to public health (for more information, see https://www1.udel.edu/johnmack/frec682/cholera/cholera2.html). President James K. Polk and composer Pyotr Ilyich Tchaikovsky were among the millions who died from cholera in the nineteenth century. But even today the disease is still a pandemic, killing around 100,000 per year world-wide.
The decennial United States Census was mandated by the U. S. Constitution in 1789 for the purposes of apportioning representatives and taxes. The first census was taken in 1790 when the U. S. population was under four million. It simply counted free men. But by 1880, the country had grown to over 50 million, and the census itself had become much more complicated, recording dependents, parents, places of birth, property, and income.
The 1880 census took over eight years to compile. The United States Census Bureau realized that some sort of automation would be required to complete the 1890 census. They hired a young engineer named Herman Hollerith, who had proposed a system of electronic tabulating machines that would use punched cards to record the data.
Hollerith was awarded a Ph.D. from MIT for his achievement. In 1911, he founded the Computing-Tabulating-Recording Company, which became the International Business Machines Corporation (IBM) in 1924. Recently IBM built the supercomputer Watson, which was probably the most successful commercial application of data mining and artificial intelligence yet produced.
During World War II, the U. S. Navy had battleships with guns that could shoot 2700-pound projectiles 24 miles. At that range, a projectile spent almost 90 seconds in flight. In addition to the guns' elevation, angle of amplitude, and initial speed of propulsion, those trajectories were also affected by the motion of the ship, the weather conditions, and even the motion of the earth's rotation. Accurate calculations of those trajectories posed great problems.
To solve these computational problems, the U. S. Army contracted an engineering team at the University of Pennsylvania to build the Electronic Numerical Integrator and Computer (ENIAC), the first complete electronic programmable digital computer. Although not completed until after the war was over, it was a huge success.
It was also enormous, occupying a large room and requiring a staff of engineers and programmers to operate. The input and output data for the computer were recorded on Hollerith cards. These could be read automatically by other machines that could then print their contents.
ENIAC played an important role in the development of the hydrogen bomb. Instead of artillery tables, it was used to simulate the first test run for the project. That involved over a million cards.
In 1979, Harvard student Dan Bricklin was watching his professor correct entries in a table of finance data on a chalkboard. After correcting a mistake in one entry, the professor proceeded to correct the corresponding marginal entries. Bricklin realized that such tedious work could be done much more easily and accurately on his new Apple II microcomputer. This resulted in his invention of VisiCalc, the first spreadsheet computer program for microcomputers. Many agree that that innovation transformed the microcomputer from a hobbyist's game platform to a serious business tool.
The consequence of Bricklin's VisiCalc was a paradigm shift in commercial computing. Spreadsheet calculations, an essential form of commercial data processing, had until then required very large and expensive mainframe computing centers. Now they could be done by a single person on a personal computer. When the IBM PC was released two years later, VisiCalc was regarded as essential software for business and accounting.
The 1854 cholera epidemic case is a good example for understanding the differences between data, information, and knowledge. The data that Dr. Snow used, the locations of cholera outbreaks and water pumps, was already available. But the connection between them had not yet been discovered. By plotting both datasets on the same city map, he was able to determine that the pump at Broad street and Cambridge street was the source of the contamination. That connection was new information. That finally led to the new knowledge that the disease is transmitted by foul water, and thus the new knowledge on how to prevent the disease.
Java runs the same way on all computers
It supports the object-oriented programming (OOP) paradigm
Its Javadoc documentation is easy to access and use
Most open-source software is written in Java, including that which is used for data analysis
Java was developed in 1995 by a team led by James Gosling at Sun Microsystems. In 2010, the Oracle Corporation bought Sun for $7.4 B and has supported Java since then. The current version is Java 8, released in 2014. But by the time you buy this book, Java 9 should be available; it is scheduled to be released in late 2017.
As the title of this book suggests, we will be using Java in all our examples.
These are quite similar in how they work, so once you have used one, it's easy to switch to another.
Although all the Java examples in this book can be run at the command line, we will instead show them running on NetBeans. This has several advantages, including:
Code listings include line numbers
Standard indentation rules are followed automatically
Code syntax coloring
Here is the standard Hello World program in NetBeans:
When you run this program in NetBeans, you will see some of its syntax coloring: gray for comments, blue for reserved words, green for objects, and orange for strings.
Or, sometimes just we'll show the
main() method, like this:
Nevertheless, all the complete source code files are available for download at the Packt Publishing website.
Here is the output from the Hello World program:
The first part of this chapter described some important historical events that have led to the development of data analysis: ancient commercial record keeping, royal compilations of land and property, and accurate mathematical models in astronomy, physics, and navigation. It was this activity that led Babbage to invent the computer. Data analysis was borne from necessity in the advance of civilization, from the identification of the source of cholera, to the management of economic data, and the modern processing of massive datasets.
This chapter also briefly explained our choice of the Java programming language for the implementation of the data analysis algorithms to be studied in this book. And finally, it introduced the NetBeans IDE, which we will also use throughout the book.