In this chapter, we introduce readers to the concept of social media mining. We discuss sentiment analysis, the nature of contemporary online communication, and the facets of Big Data that allow social media mining to be such a powerful tool. Additionally, we discuss some of the potential pitfalls of socially generated data and argue for a quantitative approach to social media mining.
People are highly opinionated. We hold opinions about everything from international politics to pizza delivery. Sentiment analysis, synonymously referred to as opinion mining, is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions through written language. Practically speaking, this field allows us to measure, and thus harness, opinions. Up until the last 40 years or so, opinion mining hardly existed. This is because opinions were elicited in surveys rather than in text documents, computers were not powerful enough to store or sort a large amount of information, and algorithms did not exist to extract opinion information from written language.
The explosion of sentiment-laden content on the Internet, the increase in computing power, and advances in data mining techniques have turned social data mining into a thriving academic field and crucial commercial domain. Professor Richard Hamming famously pushes researchers to ask themselves, "What are the important problems in my field?" Researchers in the broad area of natural language processing (NLP) cannot help but list sentiment analysis as one such pressing problem. Sentiment analysis is not only a prominent and challenging research area, but also a powerful tool currently being employed in almost every business and social domain. This prominence is due, at least in part, to the centrality of opinions as both measures and causes of human behavior.
This book is an introduction to social data mining. For us, social data refers to data generated by people or by their interactions. More specifically, social data for the purposes of this book will usually refer to data in text form produced by people for other people's consumption. Data mining is a set of tools and techniques used to describe and make inferences about data. We approach social data mining with a potent mix of applied statistics and social science theory. As for tools, we utilize and provide an introduction to the statistical programming language R.
The book covers important topics and latest developments in the field of social data mining with many references and resources for continued learning. We hope it will be of interest to an audience with a wide array of substantive interests from fields such as marketing, sociology, politics, and sales. We have striven to make it accessible enough to be useful for beginners while simultaneously directing researchers and practitioners already active in the field towards resources for further learning. Code and additional material will be available online at http://socialmediaminingr.com as well as on the authors' GitHub account, https://github.com/SocialMediaMininginR.
The state of communication section describes the fundamentally altered modes of social communication fostered by the Internet. The interconnected, social, rapid, and public exchange of information detailed here underlies the power of social data mining. Now more than ever before, information can go viral, a phrase first cited as early as 2004.
By changing the manner in which we connect with each other, the Internet changed the way we interact—communication is now bi-directional and many-to-many. Networks are now self-organized, and information travels along every dimension, varying systematically depending on direction and purpose. This new economy with ideas as currency has impacted nearly every person. More than ever, people rely on context and information before making decisions or purchases, and by extension, more and more on peer effects and interactions rather than centralized sources.
The traditional modes of communication are represented mainly by radio and television, which are isotropic and one-to-many. It took 38 years for radio broadcasters and 13 years for television to reach an audience of 50 million, but the Internet did it in just four years (Gallup).
Not only has the nature of communication changed, but also its scale. There were 50 pages on the World Wide Web (WWW) in 1993. Today, the full impact and scope of the WWW is difficult to measure, but we can get a rough sense of its size: the Indexed Web contains at least 1.7 billion pages as of February 2014 (World Wide Web size). The WWW is the largest, most widely used source of information, with nearly 2.4 billion users (Wikipedia). 70 percent of these users use it daily to both contribute and receive information in order to learn about the world around them and to influence that same world—constantly organizing information around pieces that reflect their desires.
In today's connected world, many of us are members of at least one, if not more, social networking service. The influence and reach of social media enterprises such as Facebook is staggering. Facebook has 1.11 billion monthly active users and 751 million monthly active users of their mobile products (Facebook key facts). Twitter has more than 200 million (Twitter blog) active users. As communication tools, they offer a global reach to huge multinational audiences, delivering messages almost instantaneously.
Connectedness and social media have altered the way we organize our communications. Today we have dramatically more friends and more friends of friends, and we can communicate with these higher order connections faster and more frequently than ever before. It is difficult to ignore the abundance of mimicry (that is, copying or reposting) and repeated social interactions in our social networks. This mimicry is a result of virtual social interactions organized into reaffirming or oppositional feedback loops. We self-organize these interactions via (often preferential) attachments that form organic, shifting networks. There is little question of whether or not social media has already impacted your life and changed the manner in which you communicate. Our beliefs and perceptions of reality, as well as the choices we make, are largely conditioned by our neighbors in virtual and physical networks. When we need to make a decision, we seek out for opinions of others—more and more of those opinions are provided by virtual networks.
Information bounce is the resonance of content within and between social networks often powered by social media such as customer reviews, forums, blogs, microblogs, and other user-generated content. This notion represents a significant change when compared to how information has traveled throughout history; individuals no longer need to exclusively rely on close ties within their physical social networks. Social media has both made our close ties closer and the number of weak ties exponentially greater. Beyond our denser and larger social networks is a general eagerness to incorporate information from other networks with similar interests and desires. The increased access to networks of various types has, in fact, conditioned us to seek even more information; after all, ignoring available information would constitute irrational behavior.
These fundamental changes to the nature and scope of communication are crucial due to the importance of ideas in today's economic and social interactions. Today, and in the future, ideas will be of central importance, especially those ideas that bounce and go viral. Ideas that go viral are those that resonate and spur on social movements, which may have political and social purposes or reshape businesses and allow companies such as Nike and Apple to produce outsized returns on capital. This book introduces readers to the tools necessary to measure ideas and opinions derived from social data at scale. Along the way, we'll describe strategies for dealing with Big Data.
People create 2.5 quintillion bytes (2.5 * 1018) of data, or nearly 2.3 million Terabytes of data every day, so much that 90 percent of the data in the world today has been created in the last two years alone. Furthermore, rather than being a large collection of disparate data, much of this data flow consists of data on similar things, generating huge data-sets with billions upon billions of observations. Big Data refers not only to the deluge of data being generated, but also to the astronomical size of data-sets themselves. Both factors create challenges and opportunities for data scientists.
This data comes from everywhere: physical sensors used to gather information, human sensors such as the social web, transaction records, and cell phone GPS signals to name a few. This data is not only big but is growing at an increasing rate. The data used in this book, namely, Twitter data, is no exception. Twitter was launched in March 21, 2006, and it took 3 years, 2 months, and 1 day to reach 1 billion tweets. Twitter users now send 1 billion tweets every 2.5 days.
What proportion of data is Big Data? It turns out that most data-sets are (relatively) small. This may come as a surprise in light of the contemporary excitement surrounding Big Data. The reason for the large number of small data-sets is that data that is not socially generated and publicly displayed is time consuming and expensive to collect. As such, academics, businesses, and other organizations with data needs tend to collect only the minimum amount of information necessary to gain purchase on their questions. These data-sets are usually small and focused and are curated by the organizations that use them; they usually do not plan on updating or adding fresh data to them. The poor management of these data often leads to their misplacement, thereby generating dark data—data that is suspected to exist or ought to exist but is difficult or impossible to find. The problem of dark data is real and prevalent in the myriad of small, locally collected data-sets. The utter lack of central management of data in the tail of the data size distribution invariably causes these sets of data to be forgotten. In spite of the fact that most data is not big, it is primarily the Big Data sets that exhibit exponential growth, propelling the number of bytes created by humans moving upwards daily.
Big Data differs substantially from other data not only in its size and velocity, but also in its scope and density. Big Data is large in scope, that is, it is created by everyone and by itself and thus is informative about a wide audience. This characteristic makes it very useful for studying populations, as the inferences we can make generalize to large groups of people. Compare that with, say, opinions gleaned from a focus group or small survey. These opinions, while highly accurate and easy to obtain, may or may not be reflective of the views of the wider public. Thus, Big Data's scope is a real benefit, at least in terms of generalizing evidence to wide populations.
However, Big Data's density is fairly low. By density, we mean the degree to which Big Data, and especially social data, is directly applicable to questions we want to answer. Again, a comparison to small data is useful. Prior to the explosion of Big Data and the proliferation of tools used to harness it, companies or political campaigns largely used focus groups or surveys to obtain information about public sentiments relevant to their endeavors. The focus groups and surveys furnished organizations with data that was directly applicable to their purpose, and often this data would already be measured with meaningful units. For instance, respondents would describe how much they liked or disliked a new product, or rate a political candidate's TV appearances from 1 to 5. Compare that with social data, where opinion-laden text is buried among terabytes of unrelated information and comes in a form that must be subjected to analysis just to generate a measure of the opinion. Thus, low density of big social data presents unique challenges to organizations trying to utilize opinion data.
The size and scope of Big Data helps us overcome some of the hurdles caused by its low density. For instance, even though each unique piece of social data may have little applicability to our particular task, these small bits of information quickly become useful as we aggregate them across thousands or millions of people. Like the proverbial bundle of sticks—none of which could support inferences alone—when tied together, these small bits of information can be a powerful tool for understanding the opinions of the online populace.
The sheer scope of Big Data has other benefits as well. The size and coverage of many social data-sets creates coverage overlaps in time, space, and topic. This allows analysts to cross-refer socially generated sets against one another or against small-scale data-sets designed to examine niche questions. This type of cross-coverage can generate consilience (Osborne)—the principle that states evidence from independent, unrelated sources can converge to form strong conclusions. That is, when multiple sources of evidence are in agreement, the conclusion can be very strong even when none of the individual sources of evidence are very strong on their own. A crucial characteristic of socially generated data is that it is opinionated. This point underpins the usefulness of big social data for sentiment analysis, and is novel. For the first time in history, interested parties can put their fingers to the pulse of the masses because the masses are frequently opining about what is important to them. They opine with and for each other and anyone else who cares to listen. In sum, opinionated data is the great enabler of opinion-based research.
Opinion data generated by humans in real time presents tremendous opportunities. However, big social data will only prove useful to the extent that it is valid. This section tackles the extent to which socially generated data can be used to accurately measure individual and/or group-level opinions head-on.
One potential indicator of the validity of socially generated data is the extent of its consumption for factual content. Online media has expanded significantly over the past 20 years. For example, online news is displacing print and broadcast. More and more Americans distrust mainstream media, with a majority (60 percent) now having little to no faith in traditional media to report news fully, accurately, and fairly. Instead, people are increasingly turning to the Internet to research, connect, and share opinions and views. This was especially evident during the 2012 election where social media played a large role in information transmission (Gallup).
Politics is not the only realm affected by social Big Data. People are increasingly relying on the opinions of others to inform about their consumption preferences. Let's have a look at this:
91 percent of people report having gone into a store because of an online experience
89 percent of consumers conduct research using search engines
62 percent of consumers end up making a purchase in a store after researching it online
72 percent of consumers trust online reviews as much as personal recommendations
78 percent of consumers say that posts made by companies on social media influence their purchases
If individuals are willing to use social data as a touchstone for decision making in their own lives, perhaps this is prima facie evidence of its validity. Other Big Data thinkers point out that much of what people do online constitutes their genuine actions and intentions. The breadcrumbs left from when people execute online transactions, send messages, or spend time on web pages constitute what Alex Petland of MIT calls honest signals. These signals are honest insofar as they are actions taken by people with no subtext or secondary intent. Specifically, he writes the following:
"Those breadcrumbs tell the story of your life. It tells what you've chosen to do. That's very different than what you put on Facebook. What you put on Facebook is what you would like to tell people, edited according to the standards of the day. Who you actually are is determined by where you spend time, and which things you buy."
To paraphrase, Petland finds some web-based data to be valid measures of people's attitudes when that data is without subtext or secondary intent; what he calls data exhaust. In other words, actions are harder to fake than words. He cautions against taking people's online statements at face value, because they may be nothing more than cheap talk.
Anthony Stefanidis of George Mason University also advocates for the use of social data mining. He favorably speaks about its reliability, noting that its size inherently creates a preponderance of evidence. This book takes neither the strong position of Pentland and honest signals nor Stefanidis and preponderance of evidence. Instead, we advocate a blended approach of curiosity and creativity as well as some healthy skepticism.
Generally, we follow the attitude of Charles Handy (The Empty Raincoat, 1994), who described the steps to measurement during the Vietnam War as follows:
"The first step is to measure whatever can be easily measured. This is OK as far as it goes. The second step is to disregard that which can't be easily measured or to give it an arbitrary quantitative value. This is artificial and misleading. The third step is to presume that what can't be measured easily really isn't important. This is blindness. The fourth step is to say that what can't be easily measured really doesn't exist. This is suicide."
The social web may not consist of perfect data, but its value is tremendous if used properly and analyzed with care. 40 years ago, a social science study containing millions of observations was unheard of due to the time and cost associated with collecting that much information. The most successful efforts in social data mining will be by those who "measure (all) what is measurable, and make measurable (all) what is not so" (Rasinkinski, 2008).
Ultimately, we feel that the size and scope of big social data, the fact that some of it is comprised of honest signals, and the fact that some of it can be validated with other data, lends it validity. In another sense, the "proof is in the pudding". Businesses, governments, and organizations are already using social media mining to good effect; thus, the data being mined must be at least moderately useful.
Another defining characteristic of big social data is the speed with which it is generated, especially when considered against traditional media channels. Social media platforms such as Twitter, but also the web generally, spread news in near-instant bursts. From the perspective of social media mining, this speed may be a blessing or a curse. On the one hand, analysts can keep up with the very fast-moving trends and patterns, if necessary. On the other hand, fast-moving information is subject to mistakes or even abuse.
Following the tragic bombings in Boston, Massachusetts (April 15, 2013), Twitter was instrumental in citizen reporting and provided insight into the events as they unfolded. Law enforcement asked for and received help from general public, facilitated by social media. For example, Reddit saw an overall peak in traffic when reports came in that the second suspect was captured. Google Analytics reports that there were about 272,000 users on the site with 85,000 in the news update thread alone. This was the only time in Reddit's history other than Obama AMA that a thread beat the front page in the ratings (Reddit).
The downside of this fast-paced, highly visible public search is that masses can be incorrect. This is exactly what happened; users began to look at the details and photos posted and pieced together their own investigation—as it turned out, the information was incorrect. This was a charged event and created an atmosphere that ultimately undermined the good intentions of many. Other efforts such as those by governments (Wikipedia) and companies (Forbes) to post messages favorable to their position is less than well intentioned. Overall, we should be skeptical of tactical (that is, very real time) uses of social media. However, as evidence and information are aggregated by social media, we expect certain types of data, especially opinions and sentiments, to converge towards the truth (subject to the caveats set out in Chapter 4, Potentials and Pitfalls of Social Media Data).
In this research, we aim to mine and summarize online opinions in reviews, tweets, blogs, forum discussions, and so on. Our approach is highly quantitative (that is, mathematical and/or statistical) as opposed to qualitative (that is, involving close study of a few instances). In social sciences, these two approaches are sometimes at odds, or at least their practitioners are. In this section, we will lay out the rationale for a quantitative approach to understanding online opinions. Our use of quantitative approaches is entirely pragmatic rather than dogmatic. We do, however, find the famous Bill James' words relating to the quantitative and qualitative tension to resonate with our pragmatic voice.
"The alternative to good statistics is not "no statistics", it's bad statistics. People who argue against statistical reasoning often end up backing up their arguments with whatever numbers they have at their command, over- or under-adjusting in their eagerness to avoid anything systematic."
One traditional rationale for using qualitative approaches to sentiment analysis, such as focus groups, is lack of available data. Looking closely at what a handful of consumers think about a product is a viable way to generate opinion data if none, or very little, exists. However, in the era of big social data, analysts are awash in opinion-laden text and online actions. In fact, the use of statistical approaches is often necessary to handle the sheer volume of data generated by the social web. Furthermore, the explosion of data is obviating traditional hypothesis-testing concerns about sampling, as samples converge in size towards the population of interest.
The exploration of large sets of opinion data is what Openshaw (1988) would call a data-rich but theory-poor environment. Often, qualitative methods are well suited for inductively deriving theories from small numbers of test cases. However, our aim as sentiment analyzers is usually less theoretical and more descriptive; that is, we want to measure opinions and not understand the process by which they are generated. As such, this book covers important quantitative methods that reflect the state of discipline and that allow data to have a voice. This type of analysis accomplishes what Gould (1981) refers to as "letting the data speak for itself."
Perhaps the strongest reason to choose quantitative methods over qualitative ones is the ability of quantitative methods, when coupled with large and valid data-sets, to generate accurate measures in the face of analyst biases. Qualitative methods, even when applied correctly, put researchers at risk of a plethora of inferential problems. Foremost is apophenia, the human tendency to discover patterns where there are none; for example, a Type I error of sorts and dubbed patternicity by Michael Shermer (2008). A second pitfall of qualitative work is the atomistic fallacy, that is, the problem of generalizing based on an insufficient number of individual observations. The atomistic fallacy is real. Most people rely on advice from only a few sources, over-weighting information from within their networks rather than third parties such as Consumers Reports. Allowing an individual observation (for example, an opinion) to influence our actions or decisions is unreliably compared to what constitutes sensible samples in Consumers Reports.
The natural sciences benefited from the invention and proliferation of a host of new measurement tools during the twentieth century. For example, advances in microscopes led to a range of discoveries. The advent of the social web, with its seemingly endless amounts of opinionated data, and new measurement tools such as the ones covered in this book calls for a set of new discoveries. This book introduces readers to tools that will assist in that pursuit.
In this chapter, we introduced readers to the concepts of social media, sentiment analysis, and Big Data. We described how social media has changed the nature of interpersonal communication and the opportunities it presents for analysts of social data. This chapter also made a case for the use of quantitative approaches to measure all that is measurable, and make the one which is not so measurable.
In the next chapter, we will introduce R, which is the main tool through which we will illustrate techniques for harvesting, analyzing, and visualizing social media data.