The activity of analyzing data is as old as human culture. The earliest known form of writing is not an epic poem or religious text, but data. The Ishango bone is an engraved fibula of a baboon which was carved in central Africa 20,000 years ago. Some scholars hypothesized that the carvings represent an early number system, as it lists several prime numbers, while others believe it to be a calendar. Some researchers dismiss these ideas and believe the markings merely improve grip when using the bone as a club. Whatever their purpose, the groupings of the markings are distinctly mathematical, as shown in the Figure 1.1 (Pletser, V. (2012). Does the Ishango Bone Indicate Knowledge of the Base 12? An Interpretation of a Prehistoric Discovery, the First Mathematical Tool of Humankind. Eprint—https://arxiv.org/abs/1204.1019)
The following image shows the markings on the Ishango Bone:
Figure 1.1: Markings on the Ishango Bone
Ancient cultures around the world collected data by observing nature and the stars to predict when they needed to move camp, start sowing crops, hunt seasonal animals, and to obtain whatever other knowledge they required for survival. These proto-scientific methods were the first attempts at science, as these early researchers collected data to explain the world in logical terms. These primitive forms of science helped these people to understand their world and control their destiny, which is precisely what contemporary science seeks to achieve.
Mathematics was an integral part of ancient civilizations. Sumeria, Egypt, Rome, and other advanced ancient civilizations used mathematics to manage their society and build their elaborate cities. The origins of civilization as we now know it lies in Mesopotamia, current-day Iraq. Archaeologists have excavated thousands of clay tablets that record their day-to-day activities such as land sales, delivery of goods, and other commercial transactions. Around that same time in Pharaonic Egypt, the first census took place, recording demographic data about its inhabitants. (Kelleher, J.D., & Tierney, B. (2018). Data science. Cambridge, Massachusetts: The MIT Press) These examples show that collecting data and using it to control and improve our world is an ancient human activity.
This time was also a period of the first significant mathematical discoveries and inventions. Mathematics was, however, more than a language to model the world. To the great Ancient Greek mathematician Pythagoras, numbers possessed meaning beyond their ability to describe quantity. In these early days of intellectual exploration, divination was the most popular method to predict the future. Astrologers mapped the skies or studied the entrails of a bird to find a relationship between these patterns and their world. In these divination systems, mathematics was practiced as a tool to manage society through engineering and bookkeeping, not as a tool to describe the world.
The scientific revolution of the seventeenth century replaced divination with a mathematical approach to understanding the world. Since the work of René Descartes, mathematics has taken the form of a method to describe the world and to predict its future. (Davis, P.J., & Hersh, R. (1990). Descartes' Dream. The World According to Mathematics London: Penguin) This revolution in how we perceive the world mathematically is what enabled the industrial revolution. Early technology enhanced our physical capabilities with machines, while modern technology improves our minds with computers. Machines make us stronger and faster, and their development revolutionized society during the first industrial revolution. Computers enhance aspects of our mental abilities, and we are in the middle of a second industrial revolution, which is not fueled by oil and coal, but by data.
The idea that data can be used to understand the world is thus almost as old as humanity itself and has gradually evolved into what we now call data science. We can use some basic data science to review the development of this term over time.
Figure 1.2: Frequency of the bi-gram 'data science' in literature and Google searches occurrence ordered as per highest percentage
The combination of the words data and science might seem relatively new, but the Google N-gram Viewer shows that this bi-gram has been in use since the middle of the last century. An n-gram is a sequence of words, with a bi-gram being any combination of two words. Google's n-gram viewer is a searchable database of millions of scanned books published before 2008. This database is a source for predictive text algorithms, as it contains a fantastic amount of knowledge about how people use various languages. (Google Books Ngram Viewer—https://books.google.com/ngrams/graph?content=data+science&year_start=1900&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cdata%20science%3B%2Cc0 Retrieved 25 January 2019)
The n-gram database shows that the term data science emerged in the middle of the last century, when electronic computation became a topic of study. In those days, the discipline was a science about storing and manipulating data. The current definition has drifted away from this initial academic activity to a business activity.
Judging by another Google database, data science started its journey from obscurity to becoming the latest business fad only a few years ago. The
Google Trends database shows the frequency of search terms over time. Google Trends reveals a steady increase in the popularity of data science as a search term, starting in 2013. (Google Trends—https://trends.google.com/trends/explore?date=all&q=data%20science Retrieved 25 January 2019)
Figure 1.2 visualizes these two trends. The horizontal axis shows the years from 1960 until recently. The vertical axis shows the relative number of occurrences compared to the maximum, which is how Google reports search numbers. In an absolute sense, the number of occurrences in books was much lower than current search volumes. While the increase in attention has steeply risen since 2012, the term started its journey toward being a business buzzword in the 1960s. Although we can speak of the recent hype, the use of the bi-gram data science shows a slow evolution, with a recent spike in interest.
The expectations of the benefits of data science are very high. Business authors position data science and its natural partner, 'big data', as a panacea for all societal problems and a means to increase business profits. (Clegg, B. (2017). Big Data: How the Information Revolution Is Transforming Our Lives. Icon Books). In a 2012 article in Harvard Business Review, Davenport and Patil even proclaimed the role of data scientist as the "sexiest job of the 21st century". (Davenport, T.H., & Patil, D.J. (2012). Data scientist: The sexiest job of the 21st century—https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century Harvard Business Review, 90(10), 70–76) Who would not want to be part of a new profession with such enticing career prospects? It is not a stretch to hypothesize that their article was one of the causes of the increased search volume reported by Google.
The recent popularity of data science as a business activity suggests that it is just a fancy way of saying business analysis. In my talks about data science in Australia and New Zealand, I regularly meet fellow managers who are skeptical about the proclaimed wizard-like abilities of data scientists and the unbounded promises of machine learning. Much of the data science promise relates to the success stories of internet corporations such as Google and Facebook and many other smaller players in the digital economy. For these organizations, data science is a core competency, as their value proposition is centered around data.
For organizations that deliver physical products or non-digital services, data science is about improving how they collect, store, and analyze data to extract more value from this resource. The objective of data science is not the data itself but is closely intertwined with the strategic goals of the organization. These objectives broadly range from increasing the return to shareholders to providing benefit to society overall. Whatever the kind of organization you are in, the purpose of data science is to assist managers to change reality into a more desirable state. A data scientist achieves this by measuring the current and past states of reality and using mathematical tools to predict a future state.
The term data science is unfortunate in the way it is now used, because it is paradoxically not a science of data. A data scientist is not somebody who researches the properties of data. Other definitions see data scientists as mathematicians and computer scientists that invent new ways of analyzing data. More commonly, data science is closely related to business outcomes.
Data science is a systematic and strategic approach to using data to solve practical problems. The problems of the data scientist are practical because pure science has a different objective to business. A data scientist in an organization is less interested in a generalized solution to a problem and focuses on improving how the organization achieves its goals. Perhaps the combination of the words data and science should be reserved for academics.
There are some signals that the excitement of the past few years is waning. Data science blogger Matt Tucker has declared the death of the data scientist. (Tucker, M. (2018). The Death of the Data Scientist. Retrieved 9 February 2019 from Data Science Central—https://www.datasciencecentral.com/profiles/blogs/the-death-of-the-data-scientist) For many business problems, the hardcore methods of machine learning result in over-analyzing the issue. Tucker provides an anecdote of a group of data scientists who spent a lot of time fine-tuning a complex neural network. The data scientists gave up when a young graduate with expertise in the subject matter used a linear regression that was more accurate than their neural network.
The negative chatter on the internet about data science as a business discipline might imply that the hype is receding. We should, however, not throw the baby out with the bathwater. The recent interest in analyzing data has raised the stakes in how organizations use this valuable resource. Even after the inflated expectations recede, data science as a profession has a useful contribution to make. All data, big and small, is a resource to improve how organizations perform.
This book looks at data science as the strategic and systematic approach to the fine art of analyzing data to solve business problems. This conceptualization of data science is not a complete definition. Computational analysis of data is also practiced as a science and as a scientific method for research in many areas. This book is written from a business perspective, and these other uses on data science are not further considered.
The previous section showed that data science is not necessarily just hype, but a strategic and systematic approach to using data. Using data in organizations is also called business analytics or evidence-based management. There are also specific approaches, such as Six-Sigma, that use statistical analysis to improve business processes. Many advocates of data science claim that the old and new approaches are different. Most definitions of data science focus on pattern recognition using large sets of data through machine learning. (Kelleher & Tierney (2018)). How does data science relate to its predecessor buzzwords? To understand this difference, we need to explore the early history of using data in business.
The idea that management can be science is just over a century old. Frederick Taylor was an American engineer who was dissatisfied with how factories were managed. He was a hands-on engineer who spent much time on the factory floor. Taylor noticed how workers used rules of thumb, instead of analyzing problems systematically. He writes, in The Principles of Scientific Management (1911), how he improved the process of manually loading massive lumps of iron at the Midvale Steel Company by measuring processes and analyzing the data. (Taylor, F.W. (1997). The Principles of Scientific Management. Mineola, N.Y: Dover Publications)
Although Taylor revolutionized the way we manage organizations, he despised laborers. Taylor believed that it "would be possible to train an intelligent gorilla to become more efficient" than a factory worker. His quest for scientific management was driven by an urge to remove power from the workforce and look at business processes in an abstract mathematical sense. His work was controversial in his own time, as it was the subject of a formal government inquiry. This background about Taylor is not just a bit of trivia, but a valuable lesson about ensuring to include a human dimension in what we analyze. The positive legacy of Taylor is that he planted the seed for a scientific approach to managing an organization. All methods share his ideal of using data to prevent biases in management.
Managers are faced with deciding what to do next in uncertain environments and often use their experience and intuition to determine the next course of action, instead of data and logic. While experience and intuition are highly valuable, our minds are prone to biases and non-rational thinking. Our rationality is not unlimited but is bounded by factors outside of our control. The amount of information, the time available to solve a problem, and our mental capacity are all limited. Our brains are wired to quickly recognize patterns in nature because it helps us in our daily lives. Mental shortcuts, the rules of thumb despised by Taylor, help us to make fast decisions in emergencies, but they can also lead to sub-optimal outcomes.
The world of business is not something we have necessarily evolved to navigate, and we are thus not very good at interpreting large amounts of abstract data. Because our minds are programmed to recognize patterns, we often see regularities where there are none, which psychologists call pareidolia. This condition causes us to recognize animals in clouds or see the image of Jesus in a piece of toast, or a face on the surface of Mars. Pareidolia serves us well because it enables graphical communication, but it becomes a hindrance when analyzing large sets of data. Interestingly, neural networks can also be trained to experience pareidolia. When manipulating the settings of image-recognition software, computers can be taught to recognize images in random data and effectively hallucinate a new reality. (Mordvintsev, A., Olah, C. & Tyka, M. (2015). Inceptionism—http://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html: Going Deeper into Neural Networks. Retrieved 15 February 2019)
Besides inherent biases through the limits of our rationality, social circumstances can also prevent us from optimizing decisions. Groupthink and office politics are often strong drivers of decisions in organizations. Social belonging is a strong motivator for our behavior and is one of the major driving forces behind advertising. Asch's conformity experiment illustrates how strong these social biases can be. Solomon Asch demonstrated that even when people are fully aware of the rational answer to a simple question, they will in most instances yield their opinion to match that of the group, even when it is clearly the wrong choice. (Asch Conformity Experiment (YouTube—https://www.youtube.com/watch?v=TYIh4MkcfJA). Downloaded 14 February 2019)
One of the greatest revolutions in human thinking is the 15th-century Copernican twist. From our limited perspective, the earth seems flat and the sun and moon revolve around us. When Copernicus looked through a telescope to amplify his naked-eye observations, a new reality emerged, and with it, a better model of our solar system. What we learned from Copernicus is that we need to enhance our perception and thinking skills with technology to draw correct conclusions. Data science is to business what the telescope is to astronomy. Sound analysis of data helps us to remove our natural biases and replace our rules of thumb with logic.
Just like a long-enough lever can make us physically strong enough to lift the world, the tools of data science make us mentally stronger in understanding and controlling the world. While the uncertainties of the realities of business can never be eliminated, evidence-based management ensures that managers make decisions based on the best available data. Data science is the toolkit that assists managers to base their decisions on evidence. Using the principles of data science will improve the way managers decide between alternative courses of action.
Using a scientific approach to data is, however, not a simple road to success. Data science is a human activity that encompasses all the biases and limitations. The results of data science are also not ethically neutral and require a moral perspective to ensure that no harm is done. The key to minimizing these biases is to use a systematic approach.
The Data Revolution
Since Taylor's first writings, businesses and non-profit organizations have sought to become driven by evidence to reduce unconscious bias in their decisions. Although data science is merely a new term for something that has existed for decades, some recent developments have created a watershed between the old and new ways of doing business. The difference between traditional business analysis and the new world of data science is threefold.
Firstly, businesses have much more data available than ever before. The move to electronic transactions means that almost every process leaves a digital footprint. Collecting and storing this data has become exponentially cheaper than in the days of pencil and paper. Many organizations collect this data without maximizing the value they extract from it. After the data is used for its intended purpose, it becomes 'dark data', stored on servers but languishing in obscurity. This data provides opportunities to optimize how an organization operates by recycling and analyzing it to learn about the past to create a better future.
Secondly, the computing power that is now available in a tablet was not long ago the domain of supercomputers. Piotr Luszczek showed that an iPad 2 matches the performance of the world's fastest computer in 1985. (Larabel, M. (2012). Apple iPad 2 As Fast As The Cray-2 Supercomputer. Retrieved 4 February 2019 from (Phoronix—https://www.phoronix.com/scan.php?page=news_item&px=MTE4NjU)) The affordability of vast computing power enables even small organizations to reap the benefits of advanced analytics.
Lastly, complex machine learning algorithms are freely available as open source software, and a laptop is all that is needed to implement sophisticated mathematical analyses. The R language for statistical computing, and Python, are both potent tools that can undertake a vast array of data science tasks such as complex visualizations and machine learning. These languages are 'Swiss army chainsaws' that can tackle any business analysis problem. Part of their power lies in the healthy communities that support each other in their journey to mastering these languages.
These three changes have caused a revolution in how we create value from data. The barriers to entry for even small organizations to leverage information technology are very low. The only hurdle is to make sense of the fast-moving developments and follow a strategic approach instead of chasing the hype.
This revolution is not necessarily only about powerful machine learning algorithms, but about a more scientific way of solving business problems. The definition of data science in this book is not restricted to machine learning, big data, and artificial intelligence. These developments are essential aspects of data science, but they do not define the field.
The Elements of Data Science
Now that we have defined data science within the context of managing a business, we can start describing the elements of data science. The best way to unpack the art and craft of data science is Drew Conway's often-cited Venn diagram, as shown in Figure 1.3. (Conway, D. (2010). (The data science Venn diagram—http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram). Downloaded 27 January 2019)
Conway defines three competencies that a data scientist, or a data science team as a collective, need to possess. The diagram positions data science as an interdisciplinary activity with three dimensions: domain knowledge, mathematics, and computer science. A data scientist is somebody who understands the subject matter under consideration in mathematical terms and writes computer code to solve problems.
Figure 1.3: Conway's data science Venn diagram
The most significant skill within a data science function is domain knowledge. While the results of advanced applied mathematics such as machine learning are impressive, without understanding the reality that these models describe, they are devoid of meaning and can cause more harm than good. Anyone analyzing a problem needs to understand the context of the issues and the potential solutions. The subject of data science is not the data itself, but the reality this data describes. Data science is about things and people in the real world, not about numbers and algorithms.
A domain expert understands the impact of any confounding variables on the outcomes. An experienced subject-matter expert can quickly perform a sanity check on the process and results of the analysis. Domain knowledge is essential because each area of expertise uses a different paradigm to understand the world.
Each domain of human enquiry or activity has different methodologies to collect and analyze data. Analyzing objective engineering data follows a different approach to subjective data about people or unstructured data in a corpus of text. The analyst needs to be familiar with the tools of the trade within the problem domain. The example of a graduate professional beating a team of machine learning experts with a linear regression shows the importance of domain knowledge.
Domain expertise can also become a source of bias and prevent innovative ways of looking at information. Solutions developed through systematic research can contradict long-held beliefs about a specific topic that are sometimes hard to shift. Implementing data science is thus as much a cultural process as it is a scientific one, which is the topic of Chapter 4, The Data-Driven Organization.
The analyst uses mathematical skills to convert data into actionable insights. Mathematics consists of pure mathematics as a science, and applied mathematics that helps us to solve problems. The scope of applied mathematics is broad, and data science is opportunistic in choosing the most suitable method. Various types of regression models, graph theory, k-means clustering, decision trees, and so on, are some of the favorite tools of a data scientist. The creative application of complex applied mathematics is one of the two distinguishing factors between traditional business analysis and data science.
Combining subject-matter expertise with mathematical skills is the domain of traditional research and analysis. The notion of conventional research is, however, evolving toward using the principles of data science by using reproducible computer code and sharing the source data through websites such as FigShare (https://figshare.com/).
Numbers are the foundations of mathematics, and the craft of quantitative science is to describe our analogue reality in a model that we can manipulate to predict the future. Not all mathematical skills are necessarily about numbers but can also revolve around logical relationships between words and concepts. Contemporary numerical methods help us to understand relationships between people, the logical structure of a text, and many other aspects beyond the realm of traditional numeric analysis.
Not that long ago, most of the information collected by an organization was stored on paper and archived in copious volumes of arch lever files. Analyzing this information was an arduous task that involved many hours of transcribing information into a format that is useful for analysis.
In the twenty-first century, almost all data is an electronic resource. To create value from this resource, data engineers extract it from a database, combine it with other sources, and clean the data before analysts can make sense of it. This requirement implies that a data scientist needs to have computing skills. Conway uses the term hacking skills, which many people interpret as negative. Conway is, however, not referring to a hacker in the sense of somebody who nefariously uses computers, but in the original meaning of the word as a developer with creative computing skills. The core competency of a hacker, developer, coder, or whatever other term might be preferable, is algorithmic thinking and understanding the logic of data structures. These competencies are vital in extracting and cleaning data to prepare it for the next step of the data science process.
The importance of hacking skills for a data scientist implies that we should move away from point-and-click systems and spreadsheets and instead write code in a suitable programming language. The flexibility and power of a programming language far exceed the capabilities of graphical user interfaces and leads to reproducible analysis, as discussed in Chapter 2, Good Data Science.
The mathematical interpretation of reality needs to be translated into computer code. One of the factors that spearheaded data science into popularity is that the available toolkit has grown substantially in the past ten years. Open source computing languages such as R and Python can implement complex algorithms that were previously the domain of specialized software and supercomputers. Open source software has accelerated innovation in how we analyze data and has placed complex machine learning within reach of anyone who is willing to try to learn the skills.
Conway defines the danger zone as the area where domain knowledge and computing skills combine, without a good grounding in mathematics. Somebody might have enough computing skills to be pushing buttons on a business intelligence platform or spreadsheet. The user-friendliness of some analysis platforms can be detrimental to the outcomes of the analysis because they create the illusion of accuracy. Point-and-click analysis hides the inner workings from the user, creating a black-box result. Although the data might be perfectly structured, valid and reliable, a wrongly applied analytical method leads to useless outcomes.
The Unicorn Data Scientist?
Conway's diagram is often cited in the literature on data science. His simple model helped to define the craft of data science. Other data scientists have proposed more complex models, but they all originate with Conway's basic idea.
The diagram illustrates that the difference between traditional research skills or business analytics lies in the ability to understand and write code. A data scientist understands the problem they seek to resolve, they have the mathematical expertise to analyze the problem, and they possess the computing skills to convert this knowledge into outcomes.
It could be argued that the so-called skills are missing from this picture. However, communication, managing people, facilitating change and so on, are competencies that belong to every professional who works in a complex environment, not just the data scientist.
Some critics of this idea point out that these people are unicorns – that is, they don't exist. Data scientists that possess all these skills are mythical employees that don't exist in the real world. Most data scientists start from either mathematics or computer science, after which it is hard to become a domain expert. This book is written from the point of view that we can breed unicorns by teaching domain experts how to write code and, where required, enhance their mathematical skills.
The Purpose of Data Science
In summary, the promises of data science within organizations have gained a lot of popularity over the past six years. The downside of this popularity is that self-proclaimed futurists have exaggerated the benefits of a strategic and systematic approach to analyzing data. To obtain value from this new approach to using data requires a pragmatic approach beyond the hype. For most organizations, data science will look very differently from the digital utopia portrayed in popular publications.
This chapter defines data science as the strategic and systematic use of data to create value for organizations or society overall. The purpose of using data to improve how organizations perform is to reduce bias in decisions. The original objections that Frederick Taylor held against rules of thumb more than a century ago still stands. Computational analysis of data is a valuable tool to achieve this reduced bias in deciding about future courses of action.
Data science is an interdisciplinary activity that combines domain knowledge with competencies in mathematics and computer science. The data revolution of the past decades has caused an exponential increase in available data, computing capabilities and open source software. Data science is paradoxically not a science about data but a scientific way to use data to influence reality positively. Expertise about the reality under consideration, or domain knowledge, drives data science. Mathematics and computer science are the tools that enable a deeper understanding of our reality and help us to optimize our decisions.
Now that we have an idea of what data science is and what it consists of, we need to define what good data science looks like. The following chapter expands on this description of data science by presenting a normative model of data science. This model defines best practice as the useful, sound and aesthetic analysis of data.