Home

Data

Mastering Social Media Mining with Python

By Marco Bonzanini

Book

eBook $39.99 $27.98

Print $48.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $39.99 $27.98

Print $48.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Your social media is filled with a wealth of hidden data – unlock it with the power of Python. Transform your understanding of your clients and customers when you use Python to solve the problems of understanding consumer behavior and turning raw data into actionable customer insights. This book will help you acquire and analyze data from leading social media sites. It will show you how to employ scientific Python tools to mine popular social websites such as Facebook, Twitter, Quora, and more. Explore the Python libraries used for social media mining, and get the tips, tricks, and insider insight you need to make the most of them. Discover how to develop data mining tools that use a social media API, and how to create your own data analysis projects using Python for clear insight from your social data.

Publication date:: July 2016
Publisher: Packt
Pages: 338
ISBN: 9781783552016

Chapter 1. Social Media, Social Data, and Python

This book is about applying data mining techniques to social media using Python. The three highlighted keywords in the previous sentence help us define the intended audience of this book: any developer, engineer, analyst, researcher, or student who is interested in exploring the area where the three topics meet.

In this chapter, we will cover the following topics:

Social media and social data
The overall process of data mining from social media
Setting up the Python development environment
Python tools for data science
Processing data in Python

Getting started

In the second quarter of 2015, Facebook reported nearly 1.5 billion monthly active users. In 2013, Twitter had reported a volume of 500+ million tweets per day. On a smaller scale, but certainly of interest for the readers of this book, in 2015, Stack Overflow announced that more than 10 million programming questions had been asked on their platform since the website has opened.

These numbers are just the tip of the iceberg when describing how the popularity of social media has grown exponentially with more users sharing more and more information through different platforms. This wealth of data provides unique opportunities for data mining practitioners. The purpose of this book is to guide the reader through the use of social media APIs to collect data that can be analyzed with Python tools in order to produce interesting insights on how users interact on social media.

This chapter lays the ground for an initial discussion on challenges and opportunities in social media mining and introduces some Python tools that will be used in the following chapters.

Social media - challenges and opportunities

In traditional media, users are typically just consumers. Information flows in one direction: from the publisher to the users. Social media breaks this model, allowing every user to be a consumer and publisher at the same time. Many academic publications have been written on this topic with the purpose of defining what the term social media really means (for example, Users of the world, unite! The challenges and opportunities of Social Media, Andreas M. Kaplan and Michael Haenlein, 2010). The aspects that are most commonly shared across different social media platforms are as follows:

Internet-based applications
User-generated content
Networking

Social media are Internet-based applications. It is clear that the advances in Internet and mobile technologies have promoted the expansion of social media. Through your mobile, you can, in fact, immediately connect to a social media platform, publish your content, or catch up with the latest news.

Social media platforms are driven by user-generated content. As opposed to the traditional media model, every user is a potential publisher. More importantly, any user can interact with every other user by sharing content, commenting, or expressing positive appraisal via the like button (sometimes referred to as upvote, or thumbs up).

Social media is about networking. As described, social media is about the users interacting with other users. Being connected is the central concept for most social media platform, and the content you consume via your news feed or timeline is driven by your connections.

With these main features being central across several platforms, social media is used for a variety of purposes:

Staying in touch with friends and family (for example, via Facebook)
Microblogging and catching up with the latest news (for example, via Twitter)
Staying in touch with your professional network (for example, via LinkedIn)
Sharing multimedia content (for example, via Instagram, YouTube, Vimeo, and Flickr)
Finding answers to your questions (for example, via Stack Overflow, Stack Exchange, and Quora)
Finding and organizing items of interest (for example, via Pinterest)

This book aims to answer one central question: how to extract useful knowledge from the data coming from the social media? Taking one step back, we need to define what is knowledge and what is useful.

Traditional definitions of knowledge come from information science. The concept of knowledge is usually pictured as part of a pyramid, sometimes referred to as knowledge hierarchy, which has data as its foundation, information as the middle layer, and knowledge at the top. This knowledge hierarchy is represented in the following diagram:

Figure 1.1: From raw data to semantic knowledge

Climbing the pyramid means refining knowledge from raw data. The journey from raw data to distilled knowledge goes through the integration of context and meaning. As we climb up the pyramid, the technology we build gains a deeper understanding of the original data, and more importantly, of the users who generate such data. In other words, it becomes more useful.

In this context, useful knowledge means actionable knowledge, that is, knowledge that enables a decision maker to implement a business strategy. As a reader of this book, you'll understand the key principles to extract value from social data. Understanding how users interact through social media platforms is one of the key aspects in this journey.

The following sections lay down some of the challenges and opportunities of mining data from social media platforms.

Opportunities

The key opportunity of developing data mining systems is to extract useful insights from data. The aim of the process is to answer interesting (and sometimes difficult) questions using data mining techniques to enrich our knowledge about a particular domain. For example, an online retail store can apply data mining to understand how their customers shop. Through this analysis, they are able to recommend products to their customers, depending on their shopping habits (for example, users who buy item A, also buy item B). This, in general, will lead to a better customer experience and satisfaction, which in return can produce better sales.

Many organizations in different domains can apply data mining techniques to improve their business. Some examples include the following:

Banking:
- Identifying loyal customers to offer them exclusive promotions
- Recognizing patterns of fraudulent transaction to reduce costs
Medicine:
- Understanding patient behavior to forecast surgery visits
- Supporting doctors in identifying successful treatments depending on the patient's history

Retail:
- Understanding shopping patterns to improve customer experience
- Improving the effectiveness of marketing campaigns with better targeting
- Analyzing real-time traffic data to find the quickest route for food delivery

So how does it translate to the realm of social media? The core of the matter consists of how the users share their data through social media platforms. Organizations are not limited to analyze the data they directly collect anymore, and they have access to much more data.

The solution for this data collection happens through well-engineered language-agnostic APIs. A common practice among social media platforms is, in fact, to offer a Web API to developers who want to integrate their applications with particular social media functionalities.

Note

Application Programming Interface

An Application Programming Interface (API) is a set of procedure definitions and protocols that describe the behavior of a software component, such as a library or remote service, in terms of its allowed operations, inputs, and outputs. When using a third-party API, developers don't need to worry about the internals of the component, but only about how they can use it.

With the term Web API, we refer to a web service that exposes a number of URIs to the public, possibly behind an authentication layer, to access the data. A common architectural approach for designing this kind of APIs is called Representational State Transfer (REST). An API that implements the REST architecture is called RESTful API. We still prefer the generic term Web API, as many of the existing API do not strictly follow the REST principles. For the purpose of this book, a deep understanding of the REST architecture is not required.

Challenges

Some of the challenges of social media mining are inherited from the broader field of data mining.

When dealing with social data, we're often dealing with big data. To understand the meaning of big data and the challenges it entails, we can go back to the traditional definition (3D Data Management: Controlling Data Volume, Velocity and Variety, Doug Laney, 2001) that is also known as the three Vs of big data: volume, variety, and velocity. Over the years, this definition has also been expanded by adding more Vs, most notably value, as providing value to an organization is one the main purposes of exploiting big data. Regarding the original three Vs, volume means dealing with data that spans over more than one machine. This, of course, requires a different infrastructure from small data processing (for example, in-memory). Moreover, volume is also associated with velocity in the sense that data is growing so fast that the concept of big becomes a moving target. Finally, variety concerns how data is present in different formats and structures, often incompatible between them and with different semantics. Data from social media can check all the three Vs.

The rise of big data has pushed the development of new approaches to database technologies towards a family of systems called NoSQL. The term is an umbrella for multiple database paradigms that share the common trait of moving away from traditional relational data, promoting dynamic schema design. While this book is not about database technologies, from this field, we can still appreciate the need for dealing with a mixture of well-structured, unstructured, and semi-structured data. The phrase structured data refers to information that is well organized and typically presented in a tabular form. For this reason, the connection with relational databases is immediate. The following table shows an example of structured data that represents books sold by a bookshop:

Title	Genre	Price
1984	Political fiction	12
War and Peace	War novel	10

This kind of data is structured as each represented item has a precise organization, specifically, three attributes called title, genre, and price.

The opposite of structured data is unstructured data, which is information without a predefined data model, or simply not organized according to a predefined data model. Unstructured data is typically in the form of textual data, for example, e-mails, documents, social media posts, and so on. Techniques presented throughout this book can be used to extract patterns in unstructured data to provide some structure.

Between structured and unstructured data, we can find semi-structured data. In this case, the structure is either flexible or not fully predefined. It is sometimes also referred to as a self-describing structure. A typical example of data format that is semi-structured is JSON. As the name suggests, JSON borrows its notation from the programming language JavaScript. This data format has become extremely popular due to its wide use as a way to exchange data between client and server in a web application. The following snippet shows an example of the JSON representation that extends the previous book data:

[ 
  { 
    "title": "1984", 
    "price": 12, 
    "author": "George Orwell", 
    "genre": ["Political fiction", "Social science fiction"] 
  }, 
  { 
    "title": "War and Peace", 
    "price": 10, 
    "genre": ["Historical", Romance", "War novel"] 
  } 
]

What we can observe from this example is that the first book has the author attribute, whereas, this attribute is not present in the second book. Moreover, the genre attribute is here presented as a list, with a variable number of values. Both these aspects are usually avoided in a well-structured (relational) data format, but are perfectly fine in JSON and more in general when dealing with semi-structured data.

The discussion on structured and unstructured data translates into handling different data formats and approaching data integrity in different ways. The phrase data integrity is used to capture the combination of challenges coming from the presence of dirty, inconsistent, or incomplete data.

The case of inconsistent and incomplete data is very common when analyzing user-generated content, and it calls for attention, especially with data from social media. It is very rare to observe users who share their data methodically, almost in a formal fashion. On the contrary, social media often consists of informal environments, with some contradictions. For example, if a user wants to complain about a product on the company's Facebook page, the user first needs to like the page itself, which is quite the opposite of being upset with a company due to the poor quality of their product. Understanding how users interact on social media platforms is crucial to design a good analysis.

Developing data mining applications also requires us to consider issues related to data access, particularly when company policies translate into the lack of data to analyze. In other words, data is not always openly available. The previous paragraph discussed how in social media mining, this is a little less of an issue compared to other corporate environments, as most social media platforms offer well-engineered language-agnostic APIs that allow us to access the data we need. The availability of such data is, of course, still dependent on how users share their data and how they grant us access. For example, Facebook users can decide the level of detail that can be shown in their public profile and the details that can be shown only to their friends. Profile information, such as birthday, current location, and work history (as well as many more), can all be individually flagged as private or public. Similarly, when we try to access such data through the Facebook API, the users who sign up to our application have the opportunity to grant us access only to a limited subset of the data we are asking for.

One last general challenge of data mining lies in understanding the data mining process itself and being able to explain it. In other words, coming up with the right question before we start analyzing the data is not always straightforward. More often than not, research and development (R&D) processes are driven by exploratory analysis, in the sense that in order to understand how to tackle the problem, we first need to start tampering with it. A related concept in statistics is described by the phrase correlation does not imply causation. Many statistical tests can be used to establish correlation between two variables, that is, two events occurring together, but this is not sufficient to establish a cause-effect relationship in either direction. Funny examples of bizarre correlations can be found all over the Web. A popular case was published in the New England Journal of Medicine, one of the most reputable medical journals, showing an interesting correlation between the amount of chocolate consumed per capita per country versus the number of Nobel prices awarded (Chocolate Consumption, Cognitive Function, and Nobel Laureates, Franz H. Messerli, 2012).

When performing an exploratory analysis, it is important to keep in mind that correlation (two events occurring together) is a bidirectional relationship, while causation (event A has caused event B) is a unidirectional one. Does chocolate make you smarter or do smart people like chocolate more than an average person? Do the two events occur together just by a random chance? Is there a third, yet unseen, variable that plays some role in the correlation? Simply observing a correlation is not sufficient to describe causality, but it is often an interesting starting point to ask important questions about the data we are observing.

The following section generalizes the way our application interacts with a social media API and performs the desired analysis.

Social media mining techniques

This section briefly discusses the overall process for building a social media mining application, before digging into the details in the next chapters.

The process can be summarized in the following steps:

Authentication
Data collection
Data cleaning and pre-processing
Modeling and analysis
Result presentation

Figure 1.2 shows an overview of the process:

Figure 1.2: The overall process of social media mining

The authentication step is typically performed using the industry standard called Open Authorization (OAuth). The process is three legged, meaning that it involves three actors: a user, consumer (our application), and resource provider (the social media platform). The steps in the process are as follows:

The user agrees with the consumer to grant access to the social media platform.
As the user doesn't give their social media password directly to the consumer, the consumer has an initial exchange with the resource provider to generate a token and a secret. These are used to sign each request and prevent forgery.
The user is then redirected with the token to the resource provider, which will ask to confirm authorizing the consumer to access the user's data.
Depending on the nature of the social media platform, it will also ask to confirm whether the consumer can perform any action on the user's behalf, for example, post an update, share a link, and so on.
The resource provider issues a valid token for the consumer.
The token can then go back to the user confirming the access.

Figure 1.3 shows the OAuth process with references to each of the steps described earlier. The aspect to remember is that the exchange of credentials (username/password) only happens between the user and the resource provider through the steps 3 and 4. All other exchanges are driven by tokens:

Figure 1.3: The OAuth process

From the user's perspective, this apparently complex process happens when the user is visiting our web app and hits the Login with Facebook (or Twitter, Google+, and so on) button. Then the user has to confirm that they are granting privileges to our app, and everything for them happens behind the scenes.

From a developer's perspective, the nice part is that the Python ecosystem has already well-established libraries for most social media platforms, which come with an implementation of the authentication process. As a developer, once you have registered your application with the target service, the platform will provide the necessary authorization tokens for your app. Figure 1.4 shows a screenshot of a custom Twitter app called Intro to Text Mining. On the Keys and Access Tokens configuration page, the developer can find the API key and secret, as well as the access token and access token secret. We'll discuss the details of the authorization for each social media platform in the relevant chapters:

Figure 1.4: Configuration page for a Twitter app called Intro to Text Mining. The page contains all the authorization tokens for the developers to use in their app.

The data collection, cleaning, and pre-processing steps are also dependent on the social media platform we are dealing with. In particular, the data collection step is tied to the initial authorization as we can only download data that we have been granted access to. Cleaning and pre-processing, on the other hand, are functional to the type of data modeling and analysis that we decide to employ to produce insights on the data.

Back to Figure 1.2, the modeling and analysis is performed by the component labeled ANALYTICS ENGINE. Typical data processing tasks that we'll encounter throughout this book are text mining and graph mining.

Text mining (also referred to as text analytics) is the process of deriving structured information from unstructured textual data. Text mining is applicable to most social media platforms, as the users are allowed to publish content in the form of posts or comments.

Some examples of text mining applications include the following:

Document classification: This is the task of assigning a document to one or more categories
Document clustering: This is the task of grouping documents into subsets (called clusters) that are coherent and distinct from one another (for example, by topic or sub-topic)
Document summarization: This is the task of creating a shortened version of the document in order to reduce the information overload to the user, while still retaining the most important aspects described in the original source
Entity extraction: This is the task of locating and classifying entity references from a text into some desired categories such as persons, locations, or organizations
Sentiment analysis: This is the task of identifying and categorizing sentiments and opinions expressed in a text in order to understand the attitude towards a particular product, topic, service, and so on

Not all these applications are tailored for social media, but the growing amount of textual data available through these platforms makes social media a natural playground for text mining.

Graph mining is also focused on the structure of the data. Graphs are a simple-to-understand, yet powerful, data structure that is generic enough to be applied to many different data representations. In graphs, there are two main components to consider: nodes, which represent entities or objects, and edges, which represent relationships or connections between nodes. In the context of social media, the obvious use of a graph is to represent the social relationships of our users. More in general, in social sciences, the graph structure used to represent social relationship is also referred to as social network.

In terms of using such data structure within social media, we can naturally represent users as nodes, and their relationships (such as friends of or followers) as edges. In this way, information such as friends of friends who like Python becomes easily accessible just by traversing the graph (that is, walking from one node to the other by following the edges). Graph theory and graph mining offer more options to discover deeper insights that are not as clearly visible as the previous example.

After a high-level discussion on social media mining, the following section will introduce some of the useful Python tools that are commonly used in data mining projects.

Python tools for data science

Until now, we've been using the term data mining when referring to problems and techniques that we're going to apply throughout this book. The title of this section, in fact, mentions the term data science. The use of this term has exploded in the recent years, especially in business environments, while many academics and journalists have also criticized its use as a buzzword. Meanwhile, other academic institutions started offering courses on data science, and many books and articles have been published on the subject. Rather than having a strong opinion on where we should draw the border between different disciplines, we limit ourselves to observe how, nowadays, there is a general interest in multiple fields, including data science, data mining, data analysis, statistics, machine learning, artificial intelligence, data visualization, and more. The topics we're discussing are interdisciplinary by their own nature, and they all borrow from each other from time to time. These is certainly an amazing time to be working in any of these fields, with a lot of interest from the public and a constant buzz with new advances in interesting projects.

The purpose of this section is to introduce Python as a tool for data science, and to describe part of the Python ecosystem that we're going to use in the next chapters.

Python is one of the most interesting languages for data analytics projects. The following are some of the reasons that make it fit for purpose:

Declarative and intuitive syntax
Rich ecosystem for data processing
Efficiency

Python has a shallow learning curve due to its elegant syntax. Being a dynamic and interpreted language, it facilitates rapid development and interactive exploration. The ecosystem for data processing is partially described in the following sections, which will introduce the main packages we'll use in this book.

In terms of efficiency, interpreted and high-level languages are not famous for being furiously fast. Tools such as NumPy achieve efficiency by hooking to low-level libraries under the hood, and exposing a friendly Python interface. Moreover, many projects employ the use of Cython, a superset of Python that enriches the language by allowing, among other features, to define strong variable types and compile into C. Many other projects in the Python world are in the process of tackling efficiency issues with the overall goal of making pure Python implementations faster. In this book, we won't dig into Cython or any of these promising projects, but we'll make use of NumPy (especially through other libraries that employ NumPy) for data analysis.

Python development environment setup

When this book was started, Python 3.5 had just been released and received some attention for some its latest features, such as improved support for asynchronous programming and semantic definition of type hints. In terms of usage, Python 3.5 is probably not widely used yet, but it represents the current line of development of the language.

Note

The examples in this book are compatible with Python 3, particularly with versions 3.4+ and 3.5+.

In the never-ending discussion about choosing between Python 2 and Python 3, one of the points to keep in mind is that the support for Python 2 will be dismissed in a few years (at the time of writing, the sunset date is 2020). New features are not developed in Python 2, as this branch is only for bug fixes. On the other hand, many libraries are still developed for Python 2 first, and then the support for Python 3 is added later. For this reason, from time to time, there could be a minor hiccup in terms of compatibility of some library, which is usually resolved by the community quite quickly. In general, if there is no strong reason against this choice, the preference should go to Python 3, especially for new green-field projects.

pip and virtualenv

In order to keep the development environment clean, and ease the transition from prototype to production, the suggestion is to use virtualenv to manage a virtual environment and install dependencies. virtualenv is a tool for creating and managing isolated Python environments. By using an isolated virtual environment, developers avoid polluting the global Python environment with libraries that could be incompatible with each other. The tools allow us to maintain multiple projects that require different configurations and easily switch from one to the other. Moreover, the virtual environment can be installed in a local folder that is accessible to users without administrative privileges.

To install virtualenv in the global Python environment in order to make it available to all users, we can use pip from a terminal (Linux/Unix) or command prompt (Windows):

$ [sudo] pip install virtualenv

The sudo command might be necessary on Linux/Unix or macOS if our current user doesn't have administrator privileges on the system.

If a package is already installed, it can be upgraded to the latest version:

$ pip install --upgrade [package name]

Note

Since Python 3.4, the pip tool is shipped with Python. Previous versions require a separate installation of pip as explained on the project page (https://github.com/pypa/pip). The tool can also be used to upgrade itself to the latest version:

$ pip install --upgrade pip

Once virtualenv is globally available, for each project, we can define a separate Python environment where dependencies are installed in isolation, without tampering with the global environment. In this way, tracking the required dependencies of a single project is extremely easy.

In order to set up a virtual environment, follow these steps:

$ mkdir my_new_project # creat new project folder
$ cd my_new_project # enter project folder
$ virtualenv my_env # setup custom virtual environment

This will create a my_env subfolder, which is also the name of the virtual environment we're creating, in the current directory. Inside this subfolder, we have all the necessary tools to create the isolated Python environment, including the Python binaries and the standard library. In order to activate the environment, we can type the following command:

$ source my_env/bin/activate

Once the environment is active, the following will be visible on the prompt:

(my_env)$

Python packages can be installed for this particular environment using pip:

(my_env)$ pip install [package-name]

All the new Python libraries installed with pip when the environment is active will be installed into my_env/lib/python{VERSION}/site-packages. Notice that being a local folder, we won't need administrative access to perform this command.

When we want to deactivate the virtual environment, we can simply type the following command:

$ deactivate

The process described earlier should work for the official Python distributions that are shipped (or available for download) with your operating system.

Conda, Anaconda, and Miniconda

There is also one more option to consider, called conda (http://conda.pydata.org/), which is gaining some traction in the scientific community as it makes the dependency management quite easy. Conda is an open source package manager and environment manager for installing multiple versions of software packages (and related dependencies), which makes it easy to switch from one version to the other. It supports Linux, macOS, and Windows, and while it was initially created for Python, it can be used to package and distribute any software.

There are mainly two distributions that ship with conda: the batteries-included version, Anaconda, which comes with approximately 100 packages for scientific computing already installed, and the lightweight version, Miniconda, which simply comes with Python and the conda installer, without external libraries.

If you're new to Python, have some time for the bigger download and disk space to spare, and don't want to install all the packages manually, you can get started with Anaconda. For Windows and macOS, Anaconda is available with either a graphical or command-line installer. Figure 1.5 shows a screen capture of the installation procedure on a macOS. For Linux, only the command-line installer is available. In all cases, it's possible to choose between Python 2 and Python 3. If you prefer to have full control of your system, Miniconda will probably be your favorite option:

Figure 1.5: Screen capture of the Anaconda installation

Once you've installed your version of conda, in order to create a new conda environment, you can use the following command:

$ conda create --name my_env python=3.4 # or favorite version

The environment can be activated with the following command:

$ conda activate my_env

Similar to what happens with virtualenv, the environment name will be visible in the prompt:

(my_env)$

New packages can be installed for this environment with the following command:

$ conda install [package-name]

Finally, you can deactivate an environment by typing the following command:

$ conda deactivate

Another nice feature of conda is the ability to install packages from pip as well, so if a particular library is not available via conda install, or it's not been updated to the latest version we need, we can always fall back to the traditional Python package manager while using a conda environment.

If not specified otherwise, by default, conda will look up for packages on https://anaconda.org, while pip makes use of the Python Package Index (PyPI in short, also known as CheeseShop) at https://pypi.python.org/pypi. Both installers can also be instructed to install packages from the local filesystem or private repository.

The following section will use pip to install the required packages, but you can easily switch to conda if you prefer to use this alternative.

Efficient data analysis

This section introduces two of the foundational packages for scientific Python: NumPy and pandas.

NumPy (Numerical Python) offers fast and efficient processing or array-like data structures. For numerical data, storing and manipulating data with the Python built-ins (for example, lists or dictionaries) is much slower than using a NumPy array. Moreover, NumPy arrays are often used by other libraries as containers for input and output of different algorithms that require vectorized operations.

To install NumPy with pip/virtualenv, use the following command:

$ pip install numpy

When using the batteries-included Anaconda distribution, developers will find both NumPy and pandas preinstalled, so the preceding installation step is not necessary.

The core data structure of this library is the multi-dimensional array called ndarray.

The following snippet, run from the interactive interpreter, showcases the creation of a simple array with NumPy:

>>> import numpy as np
>>> data = [1, 2, 3] # a list of int
>>> my_arr = np.array(data)
>>> my_arr
array([1, 2, 3])
>>> my_arr.shape
(3,)
>>> my_arr.dtype
dtype('int64')
>>> my_arr.ndim
1

The example shows that our data are represented by a one-dimensional array (the ndim attribute) with three elements as we expect. The data type of the array is int64 as all our inputs are integers.

We can observe the speed of the NumPy array by profiling a simple operation, such as the sum of a list, using the timeit module:

# Chap01/demo_numpy.py 
from timeit import timeit 
import numpy as np 
 
if __name__ == '__main__': 
  setup_sum = 'data = list(range(10000))' 
  setup_np = 'import numpy as np;' 
  setup_np += 'data_np = np.array(list(range(10000)))' 
   
  run_sum = 'result = sum(data)' 
  run_np = 'result = np.sum(data_np)' 
 
  time_sum = timeit(run_sum, setup=setup_sum, number=10000) 
  time_np = timeit(run_np, setup=setup_np, number=10000) 
 
  print("Time for built-in sum(): {}".format(time_sum)) 
  print("Time for np.sum(): {}".format(time_np))

The timeit module takes a piece of code as the first parameter and runs it a number of times, producing the time required for the run as output. In order to focus on the specific piece of code that we're analyzing, the initial data setup and the required imports are moved to the setup parameter that will be run only once and will not be included in the profiling. The last parameter, number, limits the number of iterations to 10,000 instead of the default, which is 1 million. The output you observe should look as follows:

Time for built-in sum(): 0.9970562970265746
Time for np.sum(): 0.07551316602621228

The built-in sum() function is more than 10 times slower than the NumPy sum() function. For more complex pieces of code, we can easily observe differences of a greater order of magnitude.

Tip

Naming conventions

The Python community has converged on some de-facto standards to import some popular libraries. NumPy and pandas are two well-known examples, as they are usually imported with an alias, for example: import numpy as np

In this way, NumPy functionalities can be accessed with np.function_name() as illustrated in the preceding examples. Similarly, the pandas library is aliased to pd. In principle, importing the whole library namespace with from numpy import * is considered bad practice because it pollutes the current namespace.

Some of the characteristics of a NumPy array that we want to keep in mind are detailed as follows:

The size of a NumPy array is fixed at creation, unlike, for example, Python lists that can be changed dynamically, so operations that change the size of the array are really creating a new one and deleting the original.
The data type for each element of the array must be the same (with the exception of having arrays of objects, hence potentially of different memory sizes).
NumPy promotes the use of operations with vectors, producing a more compact and readable code.

The second library introduced in this section is pandas. It is built on top of NumPy, so it also provides fast computation, and it offers convenient data structures, called Series and DataFrame, which allow us to perform data manipulation in a flexible and concise way.

Some of the nice features of pandas include the following:

Fast and efficient objects for data manipulation
Tools to read and write data between different formats such as CSV, text files, MS Excel spreadsheets, or SQL data structures
Intelligent handling of missing data and related data alignment
Label-based slicing and dicing of large datasets
SQL-like data aggregations and data transformation
Support for time series functionalities
Integrated plotting functionalities

We can install pandas from the CheeseShop with the usual procedure:

$ pip install pandas

Let's consider the following example, run from the Python interactive interpreter, using a small made-up toy example of user data:

>>> import pandas as pd
>>> data = {'user_id': [1, 2, 3, 4], 'age': [25, 35, 31, 19]}
>>> frame = pd.DataFrame(data, columns=['user_id', 'age'])
>>> frame.head()
   user_id  age
0        1   25
1        2   35
2        3   31
3        4   19

The initial data layout is based on a dictionary, where the keys are attributes of the users (a user ID and age represented as number of years). The values in the dictionary are lists, and for each user, the corresponding attributes are aligned depending on the position. Once we create the DataFrame with these data, the alignment of the data becomes immediately clear. The head() function prints the data in a tabular form, truncating to the first ten lines if the data is bigger than that.

We can now augment the DataFrame by adding one more column:

>>> frame['over_thirty'] = frame['age'] > 30
>>> frame.head()
   user_id  age over_thirty
0        1   25       False
1        2   35        True
2        3   31        True
3        4   19       False

Using pandas declarative syntax, we don't need to iterate through the whole column in order to access its data, but we can apply a SQL-like operation as shown in the preceding example. This operation has used the existing data to create a column of Booleans. We can also augment the DataFrame by adding new data:

>>> frame['likes_python'] = pd.Series([True, False, True, True], index=frame.index)
>>> frame.head()
   user_id  age over_thirty likes_python
0        1   25       False         True
1        2   35        True        False
2        3   31        True         True
3        4   19       False         True

We can observe some basic descriptive statistics using the describe() method:

>>> frame.describe()
        user_id   age over_thirty likes_python
count  4.000000   4.0           4            4
mean   2.500000  27.5         0.5         0.75
std    1.290994   7.0     0.57735          0.5
min    1.000000  19.0       False        False
25%    1.750000  23.5           0         0.75
50%    2.500000  28.0         0.5            1
75%    3.250000  32.0           1            1
max    4.000000  35.0        True         True

So, for example, 50% of our users are over 30, and 75% of them like Python.

Note

Downloading the example code

Detailed steps to download the code bundle are mentioned in the Preface of this book. Please have a look.

The code bundle for the book is also hosted on GitHub at https://github.com/bonzanini/Book-SocialMediaMiningPython. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Machine learning

Machine learning is the discipline that studies and develops algorithms to learn from, and make predictions on, data. It is strongly related to data mining, and sometimes, the names of the two fields are used interchangeably. A common distinction between the two fields is roughly given as follows: machine learning focuses on predictions based on known properties of the data, while data mining focuses on discovery based on unknown properties of the data. Both fields borrow algorithms and techniques from their counterpart. One of the goals of this book is to be practical, so we acknowledge that academically the two fields, despite the big overlap, often have distinct goals and assumptions, but we will not worry too much about it.

Some examples of machine learning applications include the following:

Deciding whether an incoming e-mail is spam or not
Choosing the topic of a news article from a list of known subjects such as sport, finance, or politics
Analyzing bank transactions to identify fraud attempts
Deciding, from the apple query, whether a user is interested in fruit or in computers

Some of the most popular methodologies can be categorized into supervised and unsupervised learning approaches, described in the following section. This is an over simplification that doesn't describe the whole breadth and depth of the machine learning field, but it's a good starting point to appreciate some of its technicalities.

Supervised learning approaches can be employed to solve problems such as classification, in which the data comes with the additional attributes that we want to predict, for example, the label of a class. In this case, the classifier can associate each input object with the desired output. By inferring from the features of the input objects, the classifier can then predict the desired label for the new unseen inputs. Common techniques include Naive Bayes (NB), Support Vector Machine (SVM) and models that belong to the Neural Networks (NN) family, such as perceptrons or multi-layer perceptrons.

The sample inputs used by the learning algorithm to build the mathematical model are called training data, while the unseen inputs that we want to obtain a prediction on are called test data. Inputs of a machine learning algorithm are typically in the form of a vector with each element of the vector representing a feature of the input. For supervised learning approaches, the desired output to assign to each of the unseen inputs is typically called label or target.

Unsupervised learning approaches are instead applied to problems in which the data come without a corresponding output value. A typical example of this kind of problems is clustering. In this case, an algorithm tries to find hidden structures in the data in order to group similar items into clusters. Another application consists of identifying items that don't appear to belong to a particular group (for example, outlier detection). An example of a common clustering algorithm is k-means.

The main Python package for machine learning is scikit-learn. It's an open source collection of machine learning algorithms that includes tools to access and preprocess data, evaluate the output of an algorithm, and visualize the results.

You can install scikit-learn with the common procedure via the CheeseShop:

$ pip install scikit-learn

Without digging into the details of the techniques, we will now walkthrough an application of scikit-learn to solve a clustering problem.

As we don't have social data yet, we can employ one of the datasets that is shipped together with scikit-learn.

The data that we're using is called the Fisher's Iris dataset, also referred to as Iris flower dataset. It was introduced in the 1930s by Ronald Fisher and it's today one of the classic datasets: given its small size, it's often used in the literature for toy examples. The dataset contains 50 samples from each of the three species of Iris, and for each sample four features are reported: the length and width of petals and sepals.

The dataset is commonly used as a showcase example for classification as the data comes with the correct labels for each sample, while its application for clustering is less common, mainly because there are just two well-visible clusters with a rather obvious separation. Given its small size and simple structure, it makes the case for a gentle introduction to data analysis with scikit-learn. If you want to run the example, including the data visualization part, you need to install also the matplotlib library with pip install matplotlib. More details on data visualization with Python are discussed later in this chapter.

Let's take a look at the following sample code:

# Chap01/demo_sklearn.py 
from sklearn import datasets 
from sklearn.cluster import KMeans 
import matplotlib.pyplot as plt 
 
if __name__ == '__main__': 
  # Load the data 
  iris = datasets.load_iris() 
  X = iris.data 
  petal_length = X[:, 2] 
  petal_width = X[:, 3] 
  true_labels = iris.target 
  # Apply KMeans clustering 
  estimator = KMeans(n_clusters=3) 
  estimator.fit(X) 
  predicted_labels = estimator.labels_ 
  # Color scheme definition: red, yellow and blue 
  color_scheme = ['r', 'y', 'b'] 
  # Markers definition: circle, "x" and "plus" 
  marker_list = ['o', 'x', '+'] 
  # Assign colors/markers to the predicted labels
  colors_predicted_labels = [color_scheme[lab] for lab in
                             predicted_labels]
  markers_predicted = [marker_list[lab] for lab in
                       predicted_labels] 
  # Assign colors/markers to the true labels 
  colors_true_labels = [color_scheme[lab] for lab in true_labels] 
  markers_true = [marker_list[lab] for lab in true_labels] 
  # Plot and save the two scatter plots 
  for x, y, c, m in zip(petal_width, 
                        petal_length, 
                        colors_predicted_labels, 
                        markers_predicted): 
    plt.scatter(x, y, c=c, marker=m) 
  plt.savefig('iris_clusters.png') 
  for x, y, c, m in zip(petal_width, 
                        petal_length, 
                        colors_true_labels, 
                        markers_true): 
    plt.scatter(x, y, c=c, marker=m) 
  plt.savefig('iris_true_labels.png') 
 
  print(iris.target_names)

Firstly, we will load the dataset into the iris variable, which is an object containing both the data and information about the data. In particular, iris.data contains the data itself, in the form of a NumPy array or arrays, while iris.target contains a numeric label that represent the class a sample belongs to. In each sample vector, the four values represent, respectively, sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm. Using the slicing notation for the NumPy array, we extract the third and fourth element of each sample into petal_length and petal_width, respectively. These will be used to plot the samples in a two-dimensional representation, even though the vectors have four dimensions.

The clustering process consists in two lines of code: one to create an instance of the KMeans algorithm and the second to fit() the data to the model. The simplicity of this interface is one of the characteristics of scikit-learn, which in most cases, allows you to apply a learning algorithms with just a few lines of code. For the application of the k-means algorithm, we choose the number of clusters to be three, as this is given by the data. Keep in mind that knowing the appropriate number of clusters in advance is not something that usually happens. Determining the correct (or the most interesting) number of clusters is a challenge in itself, distinct from the application of a clustering algorithm per se. As the purpose of this example is to briefly introduce scikit-learn and the simplicity of its interface, we take this shortcut. Normally, more effort is put into preparing the data in a format that is understood by scikit-learn.

The second half of the example serves the purpose of visualizing the data using matplotlib. Firstly, we will define a color scheme to visually differentiate the three clusters, using red, yellow, and blue defined in the color_scheme list. Secondly, we will exploit the fact that both the real labels and cluster associations for each sample are given as integers, starting from 0, so they can be used as indexes to match one of the colors.

Notice that while the numbers for the real labels are associated to the particular meaning of the labels, that is, a class name; the cluster numbers are simply used to clarify that a given sample belongs to a cluster, but there is no information on the meaning of the cluster. Specifically, the three classes for the real labels are setosa, versicolor, and virginica, respectively-the three species of Iris represented in the dataset.

The last lines of the example produce two scatterplots of the data, one for the real labels and another for the cluster association, using the petal length and width as two dimensions. The two plots are represented in Figure 1.6. The position of the items in the two plots is, of course, the same, but what we can observe is how the algorithm has split the three groups. In particular, the cluster at the bottom left is clearly separated by the other two, and the algorithm can easily identify it without doubt. Instead, the other two clusters are more difficult to distinguish as some of the elements overlap, so the algorithm makes some mistakes in this context.

Once again, it's worth mentioning that here we can spot the mistakes because we know the real class of each sample. The algorithm has simply created an association based on the features given to it as input:

Figure 1.6: A 2-D representation of the Iris data, colored according to the real labels (left) and clustering results (right)

Natural language processing

Natural language processing (NLP) is the discipline related to the study of methods and techniques for automatic analysis, understanding, and generation of natural language, that is, the language as written or spoken naturally by humans.

Academically, it's been an active field of study for many decades, as its early days are generally attributed to Alan Turing, one of the fathers of computer science, who proposed a test to evaluate machine intelligence in 1950. The concept is fairly straightforward: if a human judge is having a written conversation with two agents-a human and a machine-can the machine fool the judge into thinking it's not a machine? If this happens, the machine passes the test and shows signs of intelligence.

The test is nowadays known as the Turing Test, and after being common knowledge only in computer science circles for a long time, it's been recently brought to a wider audience by pop media. The movie The Imitation Game (2014), for example, is loosely based on the biography of Alan Turing, and its title is a clear reference to the test itself. Another movie that mentions the Turing Test is Ex Machina (2015), with a stronger emphasis on the development of an artificial intelligence that can lie and fool humans for its own benefits. In this movie, the Turing Test is played between a human judge and the human-looking robot, Ava, with direct verbal interaction. Without spoiling the end of the movie for those who haven't watched it, the story develops with the artificial intelligence showing to be much smarter, in a shady and artful way, than the human. Interestingly, the futuristic robot was trained using search engine logs, to understand and mimic how humans ask questions.

This little detour between past and hypothetical future of artificial intelligence is to highlight how mastering human language will be central for development of an advanced artificial intelligence (AI). Despite all the recent improvements, we're not quite there yet, and NLP is still a hot topic at the moment.

In the contest of social media, the obvious opportunity for us is that there's a huge amount of natural language waiting to be mined. The amount of textual data available via social media is continuously increasing, and for many questions, the answer has probably been already written; but transforming raw text into information is not an easy task. Conversations happen on social media all the time, users ask technical questions on web forums and find answers and customers describe their experiences with a particular product through comments or reviews on social media. Knowing the topics of these conversations, finding the most expert users who answer these questions, and understanding the opinion of the customers who write these reviews, are all tasks that can be achieved with a fair level of accuracy by means of NLP.

Moving on to Python, one of the most popular packages for NLP is Natural Language Toolkit (NLTK). The toolkit provides a friendly interface for many of the common NLP tasks, as well as lexical resources and linguistic data.

Some of the tasks that we can easily perform with the NLTK include the following:

Tokenization of words and sentences, that is, the process of breaking a stream of text down to individual tokens
Tagging words for part-of-speech, that is, assigning words to categories according to their syntactic function, such as nouns, adjectives, verbs, and so on
Identifying named entities, for example, identifying and classifying references of persons, locations, organizations, and so on
Applying machine learning techniques (for example, classification) to text
In general, extracting information from raw text

The installation of NLTK follows the common procedure via the CheeseShop:

$ pip install nltk

A difference between NLTK and many other packages is that this framework also comes with linguistic resources (that is, data) for specific tasks. Given their size, such data is not included in the default installation, but has to be downloaded separately.

The installation procedure is fully documented at http://www.nltk.org/data.html and it's strongly recommended that you read this official guide for all the details.

In short, from a Python interpreter, you can type the following code:

>>> import nltk
>>> nltk.download()

If you are in a desktop environment, this will open a new window that allows you to browse the available data. If a desktop environment is not available, you'll see a textual interface in the terminal. You can select individual packages to download, or even download all the data (that will take approximately 2.2 GB of disk space).

The downloader will try to save the file at a central location (C:\nltk_data on Windows and /usr/share/nltk_data on Unix and Mac) if you are working from an administrator account, or at your home folder (for example, ~/nltk_data) if you are a regular user. You can also choose a custom folder, but in this case, NLTK will look for the $NLTK_DATA environment variable to know where to find its data, so you'll need to set it accordingly.

If disk space is not a problem, installing all the data is probably the most convenient option, as you do it once and you can forget about it. On the other hand, downloading everything doesn't give you a clear understanding of what resources are needed. If you prefer to have full control on what you install, you can download the packages you need one by one. In this case, from time to time during your NLTK development, you'll find a little bump on the road in the form of LookupError, meaning that a resource you're trying to use is missing and you have to download it.

For example, after a fresh NLTK installation, if we try to tokenize some text from the Python interpreter, we can type the following code:

>>> from nltk import word_tokenize
>>> word_tokenize('Some sample text')
Traceback (most recent call last):
  # some long traceback here
LookupError:
*********************************************************************
  Resource 'tokenizers/punkt/PY3/english.pickle' not found.
  Please use the NLTK Downloader to obtain the resource:  >>>
nltk.download()
Searched in:
    - '/Users/marcob/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
*********************************************************************

This error tells us that the punkt resource, responsible for the tokenization, is not found in any of the conventional folders, so we'll have to go back to the NLTK downloader and solve the issue by getting this package.

Assuming that we now have a fully working NLTK installation, we can go back to the previous example and discuss tokenization with a few more details.

In the context of NLP, tokenization (also called segmentation) is the process of breaking a piece of text into smaller units called tokens or segments. While tokens can be interpreted in many different ways, typically we are interested in words and sentences. A simple example using word_tokenize() is as follows:

>>> from nltk import word_tokenize
>>> text = "The quick brown fox jumped over the lazy dog"
>>> words = word_tokenize(text)
>>> print(words)
# ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

The output of word_tokenize() is a list of strings, with each string representing a word. The word boundaries in this example are given by white spaces. In a similar fashion, sent_tokenize() returns a list of strings, with each string representing a sentence, bounded by punctuation. An example that involves both functions:

>>> from nltk import word_tokenize, sent_tokenize
>>> text = "The quick brown fox jumped! Where? Over the lazy dog."
>>> sentences = sent_tokenize(text)
>>> print(sentences)
#['The quick brown fox jumped!', 'Where?', 'Over the lazy dog.']
>>> for sentence in sentences:
...     words = word_tokenize(sentence)
...     print(words)
# ['The', 'quick', 'brown', 'fox', 'jumped', '!']
# ['Where', '?']
# ['Over', 'the', 'lazy', 'dog', '.']

As you can see, the punctuation symbols are considered tokens on their own, and as such, are included in the output of word_tokenize(). This opens for a question that we haven't really asked so far: how do we define a token? The word_tokenize() function implements an algorithm designed for standard English. As the focus of this book is on data from social media, it's fair to investigate whether the rules for standard English also apply in our context. Let's consider a fictitious example for Twitter data:

>>> tweet = '@marcobonzanini: an example! :D http://example.com #NLP'
>>> print(word_tokenize(tweet))
# ['@', 'marcobonzanini', ':', 'an', 'example', '!', ':', 'D', 'http', ':', '//example.com', '#', 'NLP']

The sample tweet introduces some peculiarities that break the standard tokenization:

Usernames are prefixed with an @ symbol, so @marcobonzanini is split into two tokens with @ being recognized as punctuation
Emoticons such as :D are common slang on chats, text messages, and of course, social media, but they're not officially part of standard English, hence the split
URLs are frequently used to share articles or pictures, but once again, they're not part of standard English, so they are broken down into components
Hash-tags such as #NLP are strings prefixed by # and are used to define the topic of the post, so that other users can easily search for a topic or follow the conversation

The previous example shows that an apparently straightforward problem such as tokenization can hide many tricky edge cases that will require something smarter than the initial intuition to be solved. Fortunately, NLTK offers the following off-the-shelf solution:

>>> from nltk.tokenize import TweetTokenizer
>>> tokenizer = TwitterTokenizer()
>>> tweet = '@marcobonzanini: an example! :D http://example.com #NLP'
>>> print(tokenizer.tokenize(tweet))
# ['@marcobonzanini', ':', 'an', 'example', '!', ':D', 'http://example.com', '#NLP']

The previous examples should provide the taste of NLTK's simple interface. We will be using this framework on different occasions throughout the book.

With increasing interest in NLP applications, the Python-for-NLP ecosystem has grown dramatically in recent years, with a whole lot of interesting projects getting more and more attention. In particular, Gensim, dubbed topic modeling for humans, is an open-source library that focuses on semantic analysis. Gensim shares with NLTK the tendency to offer an easy-to-use interface, hence the for humans part of the tagline. Another aspect that pushed its popularity is efficiency, as the library is highly optimized for speed, has options for distributed computing, and can process large datasets without the need to hold all the data in memory.

The simple installation of Gensim follows the usual procedure:

$ pip install gensim

The main dependencies are NumPy and SciPy, although if you want to take advantage of the distributed computing capabilities of Gensim, you'll also need to install PYthon Remote Objects (Pyro4):

$ pip install Pyro4

In order to showcase Gensim's simple interface, we can take a look at the text summarization module:

# Chap01/demo_gensim.py 
from gensim.summarization import summarize 
import sys 
 
fname = sys.argv[1] 
 
with open(fname, 'r') as f: 
  content = f.read() 
  summary = summarize(content, split=True) 
  for i, sentence in enumerate(summary): 
    print("%d) %s" % (i+1, sentence))

The demo_gensim.py script takes one command-line parameter, which is the name of a text file to summarize. In order to test the script, I took a piece of text from the Wikipedia page about The Lord of the Rings, the paragraph with the plot of the first volume, The Fellowship of the Ring, in particular. The script can be invoked with the following command:

$ python demo_gensim.py lord_of_the_rings.txt

This produces the following output:

1) They nearly encounter the NazgÃ»l while still in the Shire, but shake off pursuit by cutting through the Old Forest, where they are aided by the enigmatic Tom Bombadil, who alone is unaffected by the Ring's corrupting influence.
2) Aragorn leads the hobbits toward the Elven refuge of Rivendell, while Frodo gradually succumbs to the wound.
3) The Council of Elrond reveals much significant history about Sauron and the Ring, as well as the news that Sauron has corrupted Gandalf's fellow wizard, Saruman.
4) Frodo volunteers to take on this daunting task, and a "Fellowship of the Ring" is formed to aid him: Sam, Merry, Pippin, Aragorn, Gandalf, Gimli the Dwarf, Legolas the Elf, and the Man Boromir, son of the Ruling Steward Denethor of the realm of Gondor.

The summarize() function in Gensim implements the classic TextRank algorithm. The algorithm ranks sentences according to their importance and selects the most peculiar ones to produce the output summary. It's worth noting that this approach is an extractive summarization technique, meaning that the output only contains sentences selected from the input as they are, that is, there is no text transformation, rephrasing, and so on. The output size is approximately 25% of the original text. This can be controlled with the optional ratio argument for a proportional size, or word_count for a fixed number of words. In both cases, the output will only contain full sentences, that is, sentences will not be broken down to respect the desired output size.

Social network analysis

Network theory, part of the graph theory, is the study of graphs as a representation of relationships between discrete objects. Its application to social media takes the form of Social network analysis (SNA), which is a strategy to investigate social structures such as friendships or acquaintances.

NetworkX is one of the main Python libraries for the creation, manipulation, and study of complex network structures. It provides data structures for graphs, as well as many well-known standard graph algorithms.

For the installation from the CheeseShop, we follow the usual procedure:

$ pip install networkx

The following example shows how to create a simple graph with a few nodes, representing users, and a few edges between nodes, representing social relationships between users:

# Chap01/demo_networkx.py 
import networkx as nx 
from datetime import datetime 
 
if __name__ == '__main__': 
  g = nx.Graph() 
  g.add_node("John", {'name': 'John', 'age': 25}) 
  g.add_node("Peter", {'name': 'Peter', 'age': 35}) 
  g.add_node("Mary", {'name': 'Mary', 'age': 31}) 
  g.add_node("Lucy", {'name': 'Lucy', 'age': 19}) 
 
  g.add_edge("John", "Mary", {'since': datetime.today()}) 
  g.add_edge("John", "Peter", {'since': datetime(1990, 7, 30)}) 
  g.add_edge("Mary", "Lucy", {'since': datetime(2010, 8, 10)}) 
 
  print(g.nodes()) 
  print(g.edges()) 
print(g.has_edge("Lucy", "Mary")) 
# ['John', 'Peter', 'Mary', 'Lucy'] 
# [('John', 'Peter'), ('John', 'Mary'), ('Mary', 'Lucy')] 
# True

Both nodes and edges can carry additional attributes in the form of a Python dictionary, which help in describing the semantics of the network.

The Graph class is used to represent an undirected graph, meaning that the direction of the edges is not considered. This is clear from the use of the has_edge() function, which checks whether an edge between Lucy and Mary exists. The edge was inserted between Mary and Lucy, but the function shows that the direction is ignored. Further edges between the same nodes will also be ignored, that is, only one edge per node pair is considered. Self-loops are allowed by the Graph class, although in our example, they are not needed.

Other types of graphs supported by NetworkX are DiGraph for directed graphs (the direction of the nodes matters) and their counterparts for multiple (parallel) edges between nodes, MultiGraph and MultiDiGraph, respectively.

Data visualization

Data visualization (or data viz) is a cross-field discipline that deals with the visual representation of data. Visual representations are powerful tools that offer the opportunity of understanding complex data and are efficient ways to present and communicate the results of a data analysis process in general. Through data visualizations, people can see aspects of the data that are not immediately clear. After all, if a picture is worth a thousand words, a good data visualization allows the reader to absorb complex concepts with the help of a simple picture. For example, data visualization can be used by data scientists during the exploratory data analysis steps in order to understand the data. Moreover, data scientists can also use data visualization to communicate with nonexperts and explain them what is interesting about the data.

Python offers a number of tools for data visualization, for example, the matplotlib library briefly used in the Machine learning section of this chapter. To install the library, use the following command:

$ pip install matplotlib

Matplotlib produces publication quality figures in a variety of formats. The philosophy behind this library is that a developer should be able to create simple plots with a small number of lines of code. Matplotlib plots can be saved into different file formats, for example, Portable Network Graphics (PNG) or Portable Document Format (PDF).

Let's consider a simple example that plots some two-dimensional data:

# Chap01/demo_matplotlib.py 
import matplotlib.pyplot as plt 
import numpy as np 
 
if __name__ == '__main__': 
  # plot y = x^2 with red dots 
  x = np.array([1, 2, 3, 4, 5]) 
  y = x * x 
  plt.plot(x, y, 'ro') 
  plt.axis([0, 6, 0, 30]) 
  plt.savefig('demo_plot.png')

The output of this code is shown in the following diagram:

Figure 1.7: A plot created with matplotlib

Aliasing pyploy to plt is a common naming convention, as discussed earlier, also for other packages. The plot() function takes two sequence-like parameters containing the coordinates for x and y, respectively. In this example, these coordinates are created as the NumPy array, but they could be Python lists. The axis() function defines the visible range for the axis. As we're plotting the numbers 1 to 5 squared, our range is 0-6 for x axis and 0-30 for y axis. Finally, the savefig() function produces an image file with the output visualized in Figure 1.7, guessing the image format from the file extension.

Matplotlib produces excellent images for publication, but sometimes there's a need for some interactivity to allow the user to explore the data by zooming into the details of a visualization dynamically. This kind of interactivity is more in the realm of other programming languages, for example, JavaScript (especially through the popular D3.js library at https://d3js.org), that allow building interactive web-based data visualizations. While this is not the central topic of this book, it is worth mentioning that Python doesn't fall short in this domain, thanks to the tools that translate Python objects into the Vega grammar, a declarative format based on JSON that allows creating interactive visualizations.

A particularly interesting situation where Python and JavaScript can cooperate well is the case of geographical data. Most social media platforms are accessible through mobile devices. This offers the opportunity of tracking the user's location to also include the geographical aspect of data analysis. A common data format used to encode and exchange a variety of geographical data structures (such as points or polygons) is GeoJSON (http://geojson.org). As the name suggests, this format is JSON-based grammar.

A popular JavaScript library for plotting interactive maps is Leaflet (http://leafletjs.com). The bridge between JavaScript and Python is provided by folium, a Python library that makes it easy to visualize geographical data, handled with Python via, for example, GeoJSON, over a Leaflet.js map.

It's also worth mentioning that third-party services such as Plotly (https://plot.ly) offer support for the automatic generation of data visualization, off-loading the burden of creating interactive components of their services. Specifically, Plotly offers ample support for creating bespoke data visualizations using their Python client (https://plot.ly/python). The graphs are hosted online by Plotly and linked to a user account (free for public hosting, while private graphs have paid plans).

Processing data in Python

After introducing some the most important Python packages for data analytics, we take a small step back to describe some of the tools of interest to load and manipulate data from different formats with Python.

Most social media APIs provide data in JSON or XML. Python comes well equipped, from this point of view, with packages to support these formats that are part of the standard library.

For convenience, we will focus on JSON as this format can be mapped nicely into Python dictionaries and it's easier to read and understand. The interface of the JSON library is pretty straightforward, you can either load or dump data, from and to JSON to Python dictionaries.

Let's consider the following snippet:

# Chap01/demo_json.py 
import json 
 
if __name__ == '__main__': 
  user_json = '{"user_id": "1", "name": "Marco"}' 
  user_data = json.loads(user_json) 
  print(user_data['name']) 
  # Marco 
 
  user_data['likes'] = ['Python', 'Data Mining'] 
  user_json = json.dumps(user_data, indent=4) 
  print(user_json) 
  # { 
  #     "user_id": "1", 
  #     "name": "Marco", 
  #     "likes": [ 
  #         "Python", 
  #         "Data Mining" 
  #     ] 
  # }

The json.loads() and json.dumps() functions manage the conversion from JSON strings to Python dictionaries and back. There are also two counterparts, json.load() and json.dump(), which operate with file pointers, in case you want to load or save JSON data from/to files.

The json.dumps() function also takes a second parameter, indent, to specify the number of characters of the indentation, which is useful for pretty printing.

When manually analyzing more complex JSON files, it's probably convenient to use an external JSON viewer that performs pretty printing within the browser, allowing the users to collapse and expand the structure as they wish.

There are several free tools for this, some of them are web-based services, such as JSON Viewer (http://jsonviewer.stack.hu). The user simply needs to paste a piece of JSON, or pass a URL that serves a piece of JSON, and the viewer will load it and display it in a user-friendly format.

The following image shows how the JSON document from the previous example is shown in JSON Viewer:

Figure 1.8: An example of pretty-printed JSON on JSON Viewer

As we can see in Figure 1.8, the likes field is a list, that can be collapsed to hide its element and ease the visualization. While this example is minimal, this feature becomes extremely handy to inspect complex documents with several nested layers.

Tip

When using a web-based service or browser extension, loading large JSON documents for pretty printing can clog up your browser and slow your system down.

Building complex data pipelines

As soon as the data tools that we're building grow into something a bigger than a simple script, it's useful to split data pre-processing tasks into small units, in order to map all the steps and dependencies of the data pipeline.

With the term data pipeline, we intend a sequence of data processing operations, which cleans, augments, and manipulates the original data, transforming it into something digestible by the analytics engine. Any non-trivial data analytics project will require a data pipeline that is composed of a number of steps.

In the prototyping phase, it is common to split these steps into different scripts, which are then run individually, for example:

$ python download_some_data.py
$ python clean_some_data.py
$ python augment_some_data.py

Each script in this example produces the output for the following script, so there are dependencies between the different steps. We can refactor data processing scripts into a large script that does everything, and then run it in one go:

$ python do_everything.py

The content of such script might look similar to the following code:

if __name__ == '__main__': 
  download_some_data() 
  clean_some_data() 
  augment_some_data()

Each of the preceding functions will contain the main logic of the initial individual scripts. The problem with this approach is that errors can occur in the data pipeline, so we should also include a lot of boilerplate code with try and except to have control over the exceptions that might occur. Moreover, parameterizing this kind of code might feel a little clumsy.

In general, when moving from prototyping to something more stable, it's worth thinking about the use of a data orchestrator, also called workflow manager. A good example of this kind of tool in Python is given by Luigi, an open source project introduced by Spotify. The advantages of using a data orchestrator such as Luigi include the following:

Task templates: Each data task is defined as a class with a few methods that define how the task runs, its dependencies, and its output

Dependency graph: Visual tools assist the data engineer to visualize and understand the dependencies between tasks Recovery from intermediate failure: If the data pipeline fails halfway through the tasks, it's possible to restart it from the last consistent state

Integration with command-line interface, as well as system job schedulers such as cron job
Customizable error reporting

We won't dig into all the features of Luigi, as a detailed discussion would go beyond the scope of this book, but the readers are encouraged to take a look at this tool and use it to produce a more elegant, reproducible, and easily maintainable and expandable data pipeline.

Summary

This chapter introduced the different aspects of a multifaceted topic such as data mining applied to social media using Python. We have gone through some of the challenges and opportunities that make this topic interesting to study and valuable to businesses that want to gather meaningful insights from social media data.

After introducing the topic, we also discussed the overall process of social media mining, including aspects such as authentication with OAuth. We also analyzed the details of the Python tools that should be part of the data toolbox of any data mining practitioner. Depending on the social media platform we're analyzing, and the type of insights that we are concerned about, Python offers robust and well-established packages for machine learning, NLP, and SNA.

We recommend that you set up a Python development environment with virtualenv as described in pip and virtualenv section of this chapter, as this allows us to keep the global development environment clean.

The next chapter will focus on Twitter, particularly discussing on how to get access to Twitter data via the Twitter API and how to dice and slice such data in order to produce interesting information.

About the Author

Marco Bonzanini

Marco Bonzanini is a data scientist based in London, United Kingdom. He holds a Ph.D. in information retrieval from the Queen Mary University of London. He specializes in text analytics and search applications, and over the years, he has enjoyed working on a variety of information management and data science problems. He maintains a personal blog at https://marcobonzanini.com, where he discusses different technical topics, mainly around Python, text analytics, and data science. When not working on Python projects, he likes to engage with the community at PyData conferences and meetups, and he also enjoys brewing homemade beer.
Browse publications by this author

I video the book was well constructed and conscience. I feel it could have used more examples to hammer the topics home. The author is versed in the subject matter.

It's a great book covering exactly the topics I was interested in, with some aspects reinforcing prior knowledge yet also teaching me a few tricks