Home

Data

Learning Social Media Analytics with R

By Dipanjan Sarkar , Karthik Ganapathy , Raghav Bali and 1 more

Book

eBook $43.99 $29.99

Print $54.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $43.99 $29.99

Print $54.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

The Internet has truly become humongous, especially with the rise of various forms of social media in the last decade, which give users a platform to express themselves and also communicate and collaborate with each other. This book will help the reader to understand the current social media landscape and to learn how analytics can be leveraged to derive insights from it. This data can be analyzed to gain valuable insights into the behavior and engagement of users, organizations, businesses, and brands. It will help readers frame business problems and solve them using social data. The book will also cover several practical real-world use cases on social media using R and its advanced packages to utilize data science methodologies such as sentiment analysis, topic modeling, text summarization, recommendation systems, social network analysis, classification, and clustering. This will enable readers to learn different hands-on approaches to obtain data from diverse social media sources such as Twitter and Facebook. It will also show readers how to establish detailed workflows to process, visualize, and analyze data to transform social data into actionable insights.

Publication date:: May 2017
Publisher: Packt
Pages: 394
ISBN: 9781787127524
Download code from GitHub

Chapter 1. Getting Started with R and Social Media Analytics

The invention of computers, digital electronics, social media, and the Internet have truly ushered us from the industrial age into the information age. The Internet, and more specifically the invention of World Wide Web in the early 1990s, helped people to build an inter-connected universal platform where information can be stored, shared and consumed by anyone with an electronic device capable of connecting to the Web. This has led to the creation of vast amounts of information, ideas and opinions which people, brands, organizations and businesses want to share with everyone around the world. So, social media was born which provides interactive platforms to post content, share ideas, messages and opinions about everything under the sun.

This book will take you on a journey to understand various popular social media, analyzing rich data generated by these media and gaining valuable insights. We will focus on social media which cater to audiences in different forms, like micro-blogging, social networking, software collaboration, news and media sharing platforms. The main objective is to use standardized data access and retrieval techniques using social media application programming interfaces (APIs) to gather data from these websites and apply different data mining, statistical and machine learning, and natural language processing techniques on the data by leveraging the R programming language. This book will provide you with the tools, techniques, and approaches which would help you achieve the same. This introductory chapter will cover several important concepts which would help you get a jumpstart on social media analytics. They are mentioned as follows:

Social media – significance and pitfalls
Social media analytics – opportunities and challenges
Getting started with R
Data analytics
Machine learning
Text analytics

We will look at social media, the various forms of social media which exist today, and how it has impacted our society. This will help us understand the entire scope pertaining to social media analytics and the opportunity presented by it which would be valuable for consumers as well as businesses and brands. Concepts related to analytics, machine learning and text analytics coupled with hands on examples depicting the various features of the R programming language will help you get a grip on essential things which are necessary for the rest of this book. Without further delay, let's get started!

Understanding social media

The Internet and the information age have been responsible for revolutionizing the way we humans interact with each other in the 21st Century. Almost everyone uses some form of electronic communication, be it a laptop, tablet, smartphone or a personal computer. Social media is built upon the concept of platforms where people use computer-mediated communication (CMC) methods to communicate with others. This can range from instant messaging, emails, and chat rooms to social forums and social networking. To understand social media, you need to understand the origins of legacy or traditional media which gradually evolved into social media. Entities like the popular television, newspapers, radio, movies, books and magazines are various ways of sharing and consuming information, ideas and opinions. It's important to remember that social media has not replaced the older legacy based media; they co-exist peacefully together as we use and consume them both in our day-to-day lives.

Legacy media typically follow a one-way communication system. For instance, I can always read a magazine or watch a show on the television or get updated about the news from newspapers, but I cannot voice my opinions or share my ideas using the same media instantly. The communication mechanism in the various forms of social media is a two-way street, where audiences can share information and ideas and others can consume them and voice their own ideas, opinions and feedback on the same, and even share their own content based on what they see. Legacy based media, like radio or television, now use social media to provide a two-way communication mechanism to support their communications, but it's much more seamless in social media where anyone and everyone can share content, communicate with others, freely voice their ideas and opinions on a huge scale.

We can now formally define social media as interactive applications or platforms based on the principles of Web 2.0 and computer-mediated communication, which enable users to be publishers as well as consumers, to create and share ideas, opinions, information, emotions and expressions in various forms. While different and diverse forms of social media exist, they have several key features in common which are mentioned briefly as follows:

Web 2.0 Internet based applications or platforms
Content is created as well as consumed by users
Profiles give users have their own distinct and unique identity
Social networks help connect different users, similarly to communities

Indeed social media give users their own unique identity and the freedom to express themselves in their own user profiles. These profiles are maintained as accounts by social media companies. Features like what you see is what you get (WYSIWYG) editors, emoticons, photos and videos help users in creating and sharing rich content. Social networking capabilities enables users to add other users to their own friend or contact lists and create groups and forums where they can share and talk about like-minded interests. The following figure shows us some of the popular social media used today across the globe:

I am sure you recognize several of these popular social media from their logos, which you must have seen on your own smartphone or on the web. Social media is used in various ways and media can be grouped into distinct buckets by the nature of its usage and its features. We mention several popular social media in the following points, some of which we will be analyzing in the future chapters:

Micro-blogging platforms, like Twitter and Tumblr
Blogging platforms, like WordPress, Blogger and Medium
Instant messaging application, like WhatsApp and Hangouts
Networking platforms, like Facebook and LinkedIn
Software collaboration platforms, like GitHub and StackOverflow
Audio collaboration platforms, like SoundCloud
Photo sharing platforms, like Instagram and Flickr
Video sharing platforms, like YouTube and Vimeo

This list is not an exhaustive list of social media because there are so many applications and platforms out there. We apologize in advance if we missed out mentioning your favorite social media! The list should clarify the different forms of communication and content sharing mechanisms that are available for users, and that they can leverage any of these social media to share content and connect with other users. We will now discuss some of the key advantages and significance which social media has to offer.

Advantages and significance

Social media has gained immense popularity and importance so that today almost everyone can't stay away from it. Not only is social media a medium for people to express their views, but also a very powerful tool which can be used by businesses to target new and existing customers and increase revenue. We will discuss some of the main advantages of social media as follows:

Cost savings: One of the main challenges for businesses is to reach out to their customers or clients through advertising on traditional and legacy based media, which can be expensive. However, social media allows businesses to have branded pages and to post sponsored content and advertisements for a fraction of the cost, thus helping them save costs in the process of increasing visibility.
Networking: With social media, you can build your social as well as professional network with people across the globe. This has opened up a myriad of possibilities where people from different continents and countries work together on cutting-edge innovations, share news, talk about their personal experiences, offer advice and share interesting opportunities, which can help develop personalities, careers, and skills.
Ease of use: It is quite easy to get started with social media. All you need is to create an account by registering your details in the application or website and within minutes you are ready to go! Besides this, it is quite easy to navigate through any social media website or application without any sophisticated technical skills. You just need an Internet connection and an electronic device, like a smartphone or a computer. Perhaps this could be the reason that a lot of parents and grandparents are now taking to social media to share their moments and connect with their long lost friends.
Global audience: With social media, you can also make your content reach out to a global audience across the world. The reason is quite simple: because social media applications are available openly on the web, users all across the world use it. Businesses that engage with customers in different parts of the world have a key advantage to push their promotions and new products and services.
Prompt feedback: Businesses and organizations can get prompt feedback on their new product launches and services being used directly from the users. There is much less calling up people asking them about their satisfaction levels. Tweets, posts, videos, comments and many more features exist to give instant feedback to organizations by posting generally, or conversing with them directly on their official social media channels.
Grievance redressal: One of the great advantages of social media is that users can now express any sort of grievances or inconveniences, like electricity, water supply or security issues. Most governments and organizations, including law enforcement have public social media channels which can be used for instant notification of grievances.
Entertainment: This is perhaps the most popular advantage used to the maximum by most users. Social media provides an unlimited source of entertainment where you can spend your time playing interactive games, watching videos, and participating in competitions with users across the world. Indeed the possibilities of entertainment from social media are endless.
Visibility: Anyone can leverage their social media profile to gain visibility in the world. Professional networking platforms like LinkedIn are an excellent way for people to get noticed by recruiters and also for companies to recruit great talent. Even small startups or individuals can develop inventions, build products, or announce discoveries and leverage social media to go viral and gain the necessary visibility which can propel them to the next level.

The significance and importance of social media is quite evident from the preceding points. In today's interconnected world, social media has almost become indispensable and although it might have a lot of disadvantages, including distractions, if we use it for the right reasons, it can indeed be a very important tool or medium to help us achieve great things.

Disadvantages and pitfalls

Even though we have been blowing the trumpet about social media and its significance, I'm sure you are already thinking about pitfalls and disadvantages, which are directly or indirectly caused from social media. We want to cover all aspects of social media including the good and the bad, so let's look at some negative aspects:

Privacy concerns: Perhaps one of the biggest concerns with regards to using social media is the lack of privacy. All our personal data, even though often guaranteed to be secure by the social media organizations that host it, have a risk of being illegally accessed. Further, many social media platforms have been accused, time and again, of selling or using users' personal data without their consent.
Security issues: Often users enter personal information on their social media profiles which can be used by hackers and other harmful entities to gain insights into their personal lives and use it for their own personal gain. You will have heard, several times in the past, that social media websites have been hacked and personal information from user accounts has been leaked. There are other issues also like users' bank accounts being compromised, and even theft and other harmful actions happening as a result of sensitive information obtained from social media.
Addiction: This is relevant to a large percentage of people using social media. Social media addiction is indeed real and a serious concern, especially among the millennials. There are so many forms of social media and you can really get engrossed in playing games, trying to keep up with what everyone is doing, or just sharing moments from your life every other minute. A lot of us tend to check social media websites every now and then, which can be a distraction, especially if you are trying to meet deadlines. There are even a few stories of people accessing social media whilst driving, with fatal results.
Negativity: Social media allows you to express yourself freely and this is often misused by people, terrorist, and other extremist groups to spread hateful propaganda and negativity. People often post sarcastic and negative reactions based on their opinions and feelings, which can lead to trolling and racism. Even though there are ways to report such behavior, it is often not enough because it is impossible to monitor a vast social network all the time.
Risks: There are several potential risks of leveraging social media for your personal use or business promotions and campaigns. One wrong post can potentially prove to be very costly. Besides this, there is the constant risk of hackers, fraud, security attacks and unwanted spam. Continuous usage of social media and addiction to it also poses a potential health risk. Organizations should have proper social media use policies to ensure that their employees do not end up being unproductive by wasting too much time on social media, and do not leak trade secrets or confidential information on social media.

We have discussed several pitfalls attached to using social media and some of them are very serious concerns. Proper social media usage guidelines and policies should be borne in mind by everyone because social media is like a magnifying glass: anything you post can be used against you or can potentially prove harmful later. Be it extremely sensitive personal information, or confidential information, like design plans for your next product launch, always think carefully before sharing anything with the rest of the world.

However, if you know what you are doing, social media can definitely be used as a proper tool for your personal as well as professional gain.

Social media analytics

We now have a detailed overview of social media, its significance, pitfalls, and various facets. We will now discuss social media analytics and the benefits it offers for data analysts, scientists and businesses in general looking to gather useful insights from social media. Social media analytics, also known as social media mining or social media intelligence, can be defined as the process of gathering data (usually unstructured) from social media platforms and analyzing the data using diverse analytical techniques to extract vital insights, which can be used to make data-driven business decisions. There are lots of opportunities and challenges involved in social media analytics, which we will be discussing in further detail in later sections. An important thing to remember is that the processes involved in social media analytics are usually domain-agnostic and you can apply them on data belonging to any organization or business in any domain.

The most important step in going forward with any social media analytics based workflow or process is to determine the business goals or objectives and the insights that we want to gather from our analyzes. These goals are usually in the form of key performance indicators (KPIs). For instance, the total number of followers, number of likes and shares can be KPIs to measure brand engagement with customers using social media. Sometimes data is not structured and the end objectives are not very concrete. Techniques like natural language processing and text analytics can be leveraged in such cases to extract insights from noisy unstructured text data like understanding the sentiment or mood of customers for a particular service or product and trying to understand the key trends and themes based on customer tweets or posts at any point in time.

A typical social media analytics workflow

We will be analyzing data from diverse social media applications and platforms throughout the course of this book. However, it is essential to have a good grasp of the essential concepts behind any typical analytics process or workflow. While we will be expanding more on data analytics and mining processes later, let us look at a typical social media analytics workflow in the following figure:

From the preceding diagram, we can broadly classify the main steps involved in the analytics workflow as follows:

Data access
Data processing and normalization
Data analysis
Insights

We will now briefly expand upon each of these four processes since we will be using them extensively in future chapters.

Data access

For access to social media data, you can usually do it using standard data retrieval methods in two ways.

The first technique is to use official APIs provided by the social media platform or organization itself.

The second technique is to use unofficial mechanisms, like web crawling and scraping. An important point to remember is that crawling and scraping social media websites and using that data for commercial purposes, like selling the data to other organizations, is usually against their terms of service. We will therefore not be using such methods in our book. Besides this, we will be following the necessary politeness policies while accessing social media data using their APIs, so that we do not overload them with too many requests. The data we'll obtain is the raw data which can be further processed and normalized as needed.

Data processing and normalization

The raw data obtained from data retrieval using social media APIs may not be structured and clean. In fact most of the data obtained from social media is noisy, unstructured and often contains unnecessary tokens such as Hyper Text Markup Language (HTML) tags and other metadata. Usually, data streams from social media APIs have JavaScript Object Notation (JSON) response objects, which consist of key value pairs just like the example shown in the following snippet:

{
"user": {
                    "profile_sidebar_fill_color": "DDFFCC",
                    "profile_sidebar_border_color": "BDDCAD",
                    "profile_background_tile": true,
                    "name": "J'onn J'onzz",
                    "profile_image_url": "http://test.com/img/prof.jpg",
                    "created_at": "Tue Apr 07 19:05:07 +0000 2009",
                    "location": "Ox City, UK",
                    "follow_request_sent": null,
                    "profile_link_color": "0084B4",
                    "is_translator": false,
                    "id_str": "2921138"
},
"followers_count": 2452,
"statuses_count": 7311,
"friends_count": 427
}

The preceding JSON object consists of a typical response from the Twitter API showing details of a user profile. Some APIs might return data in other formats, such as Extensible Markup Language (XML) or Comma Separated Values (CSV), and each format needs to be handled properly.

Often social media data contains unstructured textual data which needs additional text pre-processing and normalization before it can be fed into any standard data mining or machine learning algorithm. Text normalization is usually done using several techniques to clean and standardize the text. Some of them are:

Text tokenization
Removing special characters and symbols
Spelling corrections
Contraction expansions
Stemming
Lemmatization

More advanced processing can insert additional metadata to describe the text better, such as adding parts of speech (POS) tags, phrase tags, named entity tags, and so on.

Data analysis

This is the core of the whole workflow, where we apply various techniques to analyze the data: this could be the raw native data itself, or the processed and curated data. Usually the techniques used in analysis can be broadly classified into three areas:

Data mining or analytics
Machine learning
Natural language processing and text analytics

Data mining and machine learning have several overlapping concepts, including the fact that both use statistical techniques and try to find patterns from underlying data. Data mining is more about finding key patterns or insights from data; and machine learning is more about using mathematics, statistics, and even some of these data mining algorithms, to build models to predict or forecast outcomes. While both of these techniques need structured and numeric data to work with, more complex analyzes with unstructured textual data is usually handled in the separate realm of text analytics by leveraging natural language processing which enables us to use several tools, techniques and algorithms to analyze free-flowing unstructured text. We will be using techniques, from these three areas to analyze data from various social media platforms throughout this book. We will cover important concepts from data analytics and text analytics briefly towards the end of this chapter.

Insights

The end results from our workflow are the actual insights which act as facts or concrete data points to achieve the objective of the analysis. This can be anything from a business intelligence report to visualizations such as bar graphs, histograms, or even word or phrase clouds. Insights should be crisp, clear, and actionable so that it can be easy for businesses to take valuable decisions in time by leveraging them.

Opportunities

Based on the advantages of social media, we can derive plentiful opportunities which lie within the scope of social media analytics. You can save a lot of cost involved in targeted advertising and promotions by analyzing your social media traffic patterns. You can see how users engage with your brand or business using social media, for instance, when it is the perfect time to share something interesting, such as a new service, product, or even an interesting anecdote about your company. Based on traffic from different geographies, you can analyze and understand the preferences of users from different parts of the world. Users love it if you publish promotions in their local language, and businesses are already leveraging such capabilities from social media platforms such as Facebook to target users in specific countries based on localized content.

The social media analytics landscape is still young and emerging and has a lot of untapped potential.

Let us understand the potential of social media analytics better by taking a real-world example.

Consider you are running a profitable business with active engagement on various social media channels. How can you use the data generated from social media to know how you are doing and how your competitors are doing? Live data streams from Twitter could be continuously analyzed to get real-time mood, sentiment, emotion, and reactions of people to your products and services. You could even analyze the same for your rival competitors to see when they are launching their commodities and how users are reacting to them. With Facebook, you can do the same and even push localized promotions and advertisements to see if they help in generating better revenue. News portals would give you live feeds of trending news articles and insights into the current state of the economy and current events and help you decide if these are favorable times for a thriving business or should you be preparing for some hard times. Sentiment analysis, concept mining, topic models, clustering, and inference are just a few examples of using analytics on social media. The opportunities are huge—you just need to have a clear objective in mind so that you can use analytics effectively to solve that objective.

Challenges

Before we delve into the challenges associated with social media analytics let us look at the following interesting facts:

There are over 300 million active Twitter users
Facebook has over 1.8 billion active users
Facebook generates 600-700+ terabytes of data daily (and it could be more now)
Twitter generates 8-10+ terabytes of data daily
Facebook generates over 4 to 5 million posts per minute
Instagram generates over 2 million likes per minute

These statistics give you a rough idea about the massive scale of data being generated and consumed in these social media platforms. This leads to some challenges:

Big data: Due to the massive amount of data produced by social media platforms, it is sometimes difficult to analyze the complete dataset using traditional analytical methods since the complete data would never fit in memory. Other approaches and tools, such as Hadoop and Spark, need to be leveraged.
Accessibility issues: Social media platforms generate a lot of data but getting access to them directly is not always easy. There are rate limits for their official APIs and it's rare to be able to access and store complete datasets. Besides this, each platform has its own terms and conditions, which should be adhered to when accessing their data.
Unstructured and noisy data: Most of the data from social media APIs are unstructured, noisy, and have a lot of junk in them. Dealing with data cleaning and processing becomes really cumbersome and often analysts and data scientists end up spending 70% of their time and effort in trying to clean and curate the data for analysis.

These are perhaps the most prevalent challenges when analyzing social media data, amongst many other challenges, that you might face in your social media analytics journey. Let's now get acquainted with the R programming language, which will be useful to us when we are performing our analyzes.

Getting started with R

This section will help you get started with setting up your analysis and development environment and also acquaint you with the syntax, data structures, constructs, and other important concepts related to the R programming language. Feel free to skim through this section if you consider yourself to be a master of R! We will be mainly focusing our attention on the following topics:

Environment setup
Data types
Data structures
Functions
Controlling code flow
Advanced operations
Visualizing data
Next steps

We will be explaining each construct or concept with hands-on examples and code so that it is easier to understand and you can also learn by doing. Before we dive into further details, let us briefly get to know more about R. R is actually a scripting language but is used extensively for statistical modeling and analysis. The roots of R lie in the S language which was a statistical programming language developed by AT&T. R is a community-driven language and has grown by leaps and bounds over the years. It now has a vast arsenal of tools, frameworks, and packages for processing, analyzing, and visualizing any type of data. Because it's open source, the community posts constant improvements to the base R language, and it introduces extremely powerful R packages capable of performing complex analyzes and visualizations.

R and Python are perhaps the two most popular languages to be used for statistical analysis; and R is often preferred by statisticians, mathematicians, and data scientists because it has more capabilities related to statistical modeling, learning, and algorithms. R is maintained by the Comprehensive R Archival Network (CRAN) and includes all the latest and past versions, binaries, and source code for R, and its packages for different operating systems. Capabilities also exist to connect and interface R with other frameworks including big data frameworks such as Hadoop and Spark, computing platforms, and languages such as Python, Matlab, SPSS, and data interfaces to any possible source such as social media platforms, news portals, the Internet of Things based device data, web traffic data, and so on.

Environment setup

We will be discussing the necessary steps for setting up a proper analysis environment by installing the necessary dependencies around the R ecosystem and also the necessary code snippets, functions, and modules which we will be using across all the chapters. You can refer to any code snippet being used across any chapter from the code files which will be provided for each chapter along with this book. Besides that, you can also access our GitHub repository https://github.com/dipanjanS/learning-social-media-analytics-with-r for necessary code modules, snippets and functions which will be used in the book and adopt them for your own analyzes!

The R language is free and open-source as we mentioned earlier, and is available for all major operating systems. At the time of writing this book, the latest version of R is 3.3.1 (code named Bug in Your Hair) and is available for downloading at https://www.r-project.org/. This link includes detailed steps, but the direct download page can be accessed at https://cloud.r-project.org/ if you are interested. Download the necessary binary distribution based on your operating system of choice and run the executable setup following the necessary instructions for the Windows platform. If you are using Unix or any *nix like environment, you can install it directly from the terminal too if needed.

Once R is installed, you can fire up the R interpreter directly. This has a graphical user interface (GUI) containing an editor where you can write your code and then execute it. We recommend using an Integrated Development Environment (IDE) instead which eases development and helps maintain code in a more structured way. Besides this you can also use it for other capabilities like generating R markdown documents, R notebooks and Shiny Web Applications. We recommend using RStudio which provides a user-friendly interface for working with R. You can download and install it from https://www.rstudio.com/products/rstudio/download3/ which contains installers for various operating systems.

Once installed, you can start RStudio and use R directly from the IDE itself. It usually contains a code editor window at the top and the R interactive interpreter in the bottom. The interactive interpreter is often called a Read-Evaluate-Print-Loop (REPL). The interpreter asks for input, evaluates and instantly returns the output if any in the interpreter window itself. The interface usually shows the > symbol when waiting for any input and often shows the + symbol in the prompt when you enter code which spans multiple lines. Anything in R is usually a vector and outputs are usually returned with square brackets preceding it, like [1] indicating the output is a vector of size one. Comments are used to describe functions or sections of code. We can specify comments by using the # symbol followed by text. A sample execution in the R interpreter is shown in the following code for convenience:

> 10 + 5
[1] 15
> c(1,2,3,4)
[1] 1 2 3 4
> 5 == 5
[1] TRUE
> fruit = 'apple'
> if (fruit == 'apple'){
+     print('Apple')
+ }else{
+     print('Orange')
+ }
[1] "Apple"

You can see various operations being performed in the R interpreter in the preceding code snippet, including some conditional evaluations and basic arithmetic. We will now delve deeper into the various constructs of R.

Data types

There are several basic data types in R for handling different types of data and values:

numeric: The numeric data type is used to store real or decimal vectors and is identical to the double data type.
double: This data type can store and represent double precision vectors.
integer: This data type is used for representing 32-bit integer vectors.
character: This data type is used to represent character vectors, where each element can be a string of type character
logical: The reserved words TRUE and FALSE are logical constants in the R language and T and F are global variables. All these four are logical type vectors.
complex: This data type is used to store and represent complex numbers
factor: This type is used to represent nominal or categorical variables by storing the nominal values in a vector of integers ranging from (1…n) such that n is the number of distinct values for the variable. A vector of character strings for the actual variable values is then mapped to this vector of integers
Miscellaneous: There are several other types including NA to denote missing values in data, NaN which denotes not a number, and ordered is used for factoring ordinal variables

Common functions for each data type include as and is, which are used for converting data types (typecasting) and checking the data type respectively.

For example, as.numeric(…) would typecast the data or vector indicated by the ellipses into numeric type and is.numeric(…) would check if the data is of numeric type.

Let us look at a few more examples for the various data types in the following code snippet to understand them better:

# typecasting and checking data types
> n <- c(3.5, 0.0, 1.7, 0.0)
> typeof(n)
[1] "double"
> is.numeric(n)
[1] TRUE
> is.double(n)
[1] TRUE
> is.integer(n)
[1] FALSE
> as.integer(n)
[1] 3 0 1 0
> as.logical(n)
[1]  TRUE FALSE  TRUE FALSE

# complex numbers
> comp <- 3 + 4i
> typeof(comp)
[1] "complex"

# factoring nominal variables
> size <- c(rep('large', 5), rep('small', 5), rep('medium', 3))
> size
 [1] "large"  "large"  "large"  "large"  "large"  "small"  "small"  "small"  "small"  "small" 
[11] "medium" "medium" "medium"
> size <- factor(size)
> size
 [1] large  large  large  large  large  small  small  small  small  small  medium medium medium
Levels: large medium small
> summary(size)
 large medium  small 
     5      3      5

The preceding examples should make the concepts clearer. Notice that non-zero numeric values are logically TRUE always, and zero values are FALSE, as we can see from typecasting our numeric vector to logical. We will now dive into the various data structures in R.

Data structures

The base R system has several important core data structures which are extensively used in handling, processing, manipulating, and analyzing data. We will be talking about five important data structures, which can be classified according to the type of data which can be stored and its dimensionality. The classification is depicted in the following table:

Content type	Dimensionality	Data structure
Homogeneous	One-dimensional	Vector
Homogeneous	N-dimensional	Array
Homogeneous	Two-dimensional	Matrix
Heterogeneous	One-dimensional	List
Heterogeneous	N-dimensional	DataFrame

The content type in the table depicts whether the data stored in the structure belongs to the same data type (homogeneous) or can contain data of different data types, (heterogeneous). The dimensionality of the data structure is pretty straightforward and is self-explanatory. We will now examine each data structure in further detail.

Vectors

The vector is the most basic data structure in R and here vectors indicate atomic vectors. They can be used to represent any data in R including input and output data. Vectors are usually created using the c(…) function, which is short for combine. Vectors can also be created in other ways such as using the : operator or the seq(…) family of functions. Vectors are homogeneous, all elements always belong to a single data type, and the vector by itself is a one-dimensional structure. The following snippet shows some vector representations:

> 1:5
[1] 1 2 3 4 5
> c(1,2,3,4,5)
[1] 1 2 3 4 5
> seq(1,5)
[1] 1 2 3 4 5
> seq_len(5)
[1] 1 2 3 4 5

You can also assign vectors to variables and perform different operations on them, including data manipulation, mathematical operations, transformations, and so on. We depict a few such examples in the following snippet:

# assigning two vectors to variables
> x <- 1:5
> y <- c(6,7,8,9,10)
> x
[1] 1 2 3 4 5
> y
[1]  6  7  8  9 10

# operating on vectors
> x + y
[1]  7  9 11 13 15
> sum(x)
[1] 15
> mean(x)
[1] 3
> x * y
[1]  6 14 24 36 50
> sqrt(x)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068

# indexing and slicing
> y[2:4]
[1] 7 8 9
> y[c(2,3,4)]
[1] 7 8 9

# naming vector elements
> names(x) <- c("one", "two", "three", "four", "five")
> x
  one   two three  four  five 
    1     2     3     4     5

The preceding snippet should give you a good flavor of what we can do with vectors. Try playing around with vectors and transforming and manipulating data with them!

Arrays

From the table of data structures we mentioned earlier, arrays can store homogeneous data and are N-dimensional data structures unlike vectors. Matrices are a special case of arrays with two dimensions, but more on that later. Considering arrays, it is difficult to represent data higher than two dimensions on the screen, but R can still handle it in a special way. The following example creates a three-dimensional array of 2x2x3:

# create a three-dimensional array
three.dim.array <- array(
    1:12,    # input data
    dim = c(2, 2, 3),   # dimensions
    dimnames = list(    # names of dimensions
        c("row1", "row2"),
        c("col1", "col2"),
        c("first.set", "second.set", "third.set")
    )
)

# view the array
> three.dim.array
, , first.set

     col1 col2
row1    1    3
row2    2    4

, , second.set

     col1 col2
row1    5    7
row2    6    8

, , third.set

     col1 col2
row1    9   11
row2   10   12

From the preceding output, you can see that R filled the data in the column-first order in the three-dimensional array. We will now look at matrices in the following section.

Matrices

We've briefly mentioned that matrices are a special case of arrays with two dimensions. These two dimensions are represented by properties in rows and columns. Just like we used the array(…) function in the previous section to create an array, we will be using the matrix(…) function to create matrices.

The following snippet creates a 4x3 matrix:

# create a matrix
mat <- matrix(
    1:12,   # data
    nrow = 4,  # num of rows
    ncol = 3,  # num of columns
    byrow = TRUE  # fill the elements row-wise
)

# view the matrix
> mat
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12

Thus you can see from the preceding output that we have a 4x3 matrix with 4 rows and 3 columns and we filled in the data in row-wise fashion by using the by row parameter in the matrix(…) function.

The following snippet shows some mathematical operations with matrices which you should be familiar with:

# initialize matrices
m1 <- matrix(
    1:9,   # data
    nrow = 3,  # num of rows
    ncol = 3,  # num of columns
    byrow = TRUE  # fill the elements row-wise
)
m2 <- matrix(
    10:18,   # data
    nrow = 3,  # num of rows
    ncol = 3,  # num of columns
    byrow = TRUE  # fill the elements row-wise
)

# matrix addition
> m1 + m2
     [,1] [,2] [,3]
[1,]   11   13   15
[2,]   17   19   21
[3,]   23   25   27

# matrix transpose
> t(m1)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

# matrix product
> m1 %*% m2
     [,1] [,2] [,3]
[1,]   84   90   96
[2,]  201  216  231
[3,]  318  342  366

We encourage you to try out more complex operations using matrices. See if you can find the inverse of a matrix.

Lists

Lists are a special type of vector besides the atomic vector which we discussed earlier. The difference with atomic vectors is that lists are heterogeneous and can hold different types of data such that each element of a list can itself be a list, an atomic vector, an array, matrix, or even a function. The following snippet shows us how to create lists:

# create sample list
list.sample <- list(
    nums = seq.int(1,5),
    languages = c("R", "Python", "Julia", "Java"),
    sin.func = sin
)

# view the list
> list.sample
$nums
[1] 1 2 3 4 5

$languages
[1] "R"      "Python" "Julia"  "Java"  

$sin.func
function (x)  .Primitive("sin")

# accessing individual list elements
> list.sample$languages
[1] "R"      "Python" "Julia"  "Java"  
> list.sample$sin.func(1.5708)
[1] 1

You can see from the preceding snippet that lists can hold different types of elements and accessing them is really easy.

Let us look at a few more operations with lists in the following snippet:

# initializing two lists
l1 <- list(nums = 1:5)
l2 <- list(
    languages = c("R", "Python", "Julia"),
    months = c("Jan", "Feb", "Mar")
)

# check lists and their type
> l1
$nums
[1] 1 2 3 4 5

> typeof(l1)
[1] "list"
> l2
$languages
[1] "R"      "Python" "Julia" 

$months
[1] "Jan" "Feb" "Mar"

> typeof(l2)
[1] "list"

# concatenating lists
> l3 <- c(l1, l2)
> l3
$nums
[1] 1 2 3 4 5

$languages
[1] "R"      "Python" "Julia" 

$months
[1] "Jan" "Feb" "Mar"

# converting list back to a vector
> v1 <- unlist(l1)
> v1
nums1 nums2 nums3 nums4 nums5 
    1     2     3     4     5 
> typeof(v1)
[1] "integer"

Now that we know how lists work, we will be moving on to the last and perhaps most widely used data structure in data processing and analysis, the DataFrame.

DataFrames

The DataFrame is a special data structure which is used to handle heterogeneous data with N-dimensions. This structure is used to handle data tables or tabular data having several observations, samples or data points which are represented by rows, and attributes for each sample, which are represented by columns. Each column can be thought of as a dimension to the dataset or a vector. It is very popular since it can easily work with tabular data, notably spreadsheets.

The following snippet shows us how we can create DataFrames and examine their properties:

# create data frame
df <- data.frame(
  name =  c("Wade", "Steve", "Slade", "Bruce"),
  age = c(28, 85, 55, 45),
  job = c("IT", "HR", "HR", "CS")
)

# view the data frame
> df
   name age job
1  Wade  28  IT
2 Steve  85  HR
3 Slade  55  HR
4 Bruce  45  CS

# examine data frame properties
> class(df)
[1] "data.frame"
> str(df)
'data.frame':      4 obs. of  3 variables:
 $ name: Factor w/ 4 levels "Bruce","Slade",..: 4 3 2 1
 $ age : num  28 85 55 45
 $ job : Factor w/ 3 levels "CS","HR","IT": 3 2 2 1
> rownames(df)
[1] "1" "2" "3" "4"
> colnames(df)
[1] "name" "age" "job"
> dim(df)
[1] 4 3

You can see from the preceding snippet how DataFrames can represent tabular data where each attribute is a dimension or column. You can also perform multiple operations on DataFrames such as merging, concatenating, binding, sub-setting, and so on. We will depict some of these operations in the following snippet:

# initialize two data frames
emp.details <- data.frame(
    empid = c('e001', 'e002', 'e003', 'e004'),
    name = c("Wade", "Steve", "Slade", "Bruce"),
    age = c(28, 85, 55, 45)
)
job.details <- data.frame(
    empid = c('e001', 'e002', 'e003', 'e004'),
    job = c("IT", "HR", "HR", "CS")
)

# view data frames
> emp.details
  empid  name age
1  e001  Wade  28
2  e002 Steve  85
3  e003 Slade  55
4  e004 Bruce  45
> job.details
  empid job
1  e001  IT
2  e002  HR
3  e003  HR
4  e004  CS

# binding and merging data frames
> cbind(emp.details, job.details)
  empid  name age empid job
1  e001  Wade  28  e001  IT
2  e002 Steve  85  e002  HR
3  e003 Slade  55  e003  HR
4  e004 Bruce  45  e004  CS
> merge(emp.details, job.details, by='empid')
  empid  name age job
1  e001  Wade  28  IT
2  e002 Steve  85  HR
3  e003 Slade  55  HR
4  e004 Bruce  45  CS

# subsetting data frame
> subset(emp.details, age > 50)
  empid  name age
2  e002 Steve  85
3  e003 Slade  55

Now that we have a good grasp of data structures, we will look at concepts related to functions in R in the next section.

Functions

So far we have dealt with various variables and data types and structures for storing data. Functions are just another data type or object in R, albeit a special one which allows us to operate on data and perform actions on data. Functions are useful for modularizing code and separating concerns where needed by dedicating specific actions and operations to different functions and implementing the logic needed for any action inside the function. We will be talking about two types of functions in this section: the built-in functions and the user-defined functions.

Built-in functions

There are several functions which come with the base installation of R and its core packages. You can access these built-in functions directly using the function name and you can get more functions as you install newer packages. We depict operations using a few built-in functions in the following snippet:

> sqrt(7)
[1] 2.645751
> mean(1:5)
[1] 3
> sum(1:5)
[1] 15
> sqrt(1:5)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
> runif(5)
[1] 0.8880760 0.2925848 0.9240165 0.6535002 0.1891892
> rnorm(5)
[1]  1.90901035 -1.55611066 -0.40784306 -1.88185230  0.02035915

You can see from the previous examples that functions such as sqrt(…), mean(…), and sum(…) are built-in and pre-implemented. They can be used anytime in R without the need to define these functions explicitly or load other packages.

User-defined functions

While built-in functions are good, often you need to incorporate your own algorithms, logic, and processes for solving a problem. That is where you need to build you own functions. Typically, there are three main components in the function. They are mentioned as follows:

The environment(…) which contains the location map of the defined function and its variables
The formals(…) which depict the list of arguments which are used to call the function
The body(…) which is used to depict the code inside the function which contains the core logic of the function

Some depictions of user-defined functions are shown in the following code snippet:

# define the function
square <- function(data){
  return (data^2)
}

# inspect function components
> environment(square)
<environment: R_GlobalEnv>

> formals(square)
$data


> body(square)
{
    return(data^2)
}

# execute the function on data
> square(1:5)
[1]  1  4  9 16 25
> square(12)
[1] 144

We can see how user-defined functions can be defined using function(…) and we can also examine the various components of the function, as we discussed earlier, and also use them to operate on data.

Controlling code flow

When writing complete applications and scripts using R, the flow and execution of code is very important. The flow of code is based on statements, functions, variables, and data structures used in the code and it is all based on the algorithms, business logic, and other rules for solving the problem at hand. There are several constructs which can be used to control the flow of code and we will be discussing primarily the following two constructs:

Looping constructs
Conditional constructs

We will start with looking at various looping constructs for executing the same sections of code multiple times.

Looping constructs

Looping constructs basically involve using loops which are used to execute code blocks or sections repeatedly as needed. Usually the loop keeps executing the code block in its scope until some specific condition is met or some other conditional statements are used. There are three main types of loops in R:

for
while
repeat

We will explore all the three constructs with examples in the following code snippet:

# for loop
> for (i in 1:5) {
+     cat(paste(i," "))
+ }
1  2  3  4  5  

> sum <- 0
> for (i in 1:10){
+     sum <- sum + i
+ }

> sum
[1] 55

# while loop
> n <- 1
> while (n <= 5){
+     cat(paste(n, " "))
+     n <- n + 1
+ }
1  2  3  4  5  

# repeat loop
> i <- 1
> repeat{
+     cat(paste(i, " "))
+     if (i >= 5){
+         break  # break out of the infinite loop
+     }
+     i <- i + 1
+ }
1  2  3  4  5

An important point to remember here is that, with larger amounts of data, vectorization-based constructs are more optimized than loops and we will cover some of them in the Advanced operations section later.

Conditional constructs

There are several conditional constructs which help us in executing and controlling the flow of code conditionally based on user-defined rules and conditions. This is very useful when we do not want to execute all possible code blocks in a script sequentially but we want to execute specific code blocks if and only if they meet or do not meet specific conditions.

There are mainly four constructs which are used frequently in R:

if or if…else
if…else if…else
ifelse(…)
switch(…)

The bottom two are functions compared to the other statements, which use the if, if…else, and if…else if…else syntax. We will look at them in the following code snippet with examples:

# using if
> num = 10
> if (num == 10){
+     cat('The number was 10')
+ }
The number was 10

# using if-else
> num = 5
> if (num == 10){
+     cat('The number was 10')
+ } else{
+     cat('The number was not 10')
+ }
The number was not 10

# using if-else if-else
> if (num == 10){
+     cat('The number was 10')
+ } else if (num == 5){
+     cat('The number was 5')
+ } else{
+     cat('No match found')
+ }
The number was 5

# using ifelse(...) function
> ifelse(num == 10, "Number was 10", "Number was not 10")
[1] "Number was not 10"

# using switch(...) function
> for (num in c("5","10","15")){
+     cat(
+         switch(num,
+             "5" = "five",
+             "7" = "seven",
+             "10" = "ten",
+             "No match found"
+     ), "\n")
+ }
five 
ten 
No match found

From the preceding snippet, we can see that switch(…) has a default option which can return a user-defined value when no match is found when evaluating the condition.

Advanced operations

We can perform several advanced vectorized operations in R, which is useful when dealing with large amounts of data and improves code performance with regards to time taken in executing code. Some advanced constructs in the apply family of functions will be covered in this section, as follows:

apply: Evaluates a function on the boundaries or margins of an array
lapply: Loops over a list and evaluates a function on each element
sapply: A more simplified version of the lapply(…) function
tapply: Evaluates a function over subsets of a vector
mapply: A multivariate version of the lapply(…) function

Let's look at how each of these functions work in further detail.

apply

As we mentioned earlier, the apply(…) function is used mainly to evaluate any defined function over the margins or boundaries of any array or matrix.

An important point to note here is that there are dedicated aggregation functions rowSums(…), rowMeans(…), colSums(…) and colMeans(…) which actually use apply internally but are more optimized and useful compared to other functions when operating on large arrays.

The following snippet depicts some aggregation functions being applied to a matrix:

# creating a 4x4 matrix
> mat <- matrix(1:16, nrow=4, ncol=4)

# view the matrix
> mat
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

# row sums
> apply(mat, 1, sum)
[1] 28 32 36 40
> rowSums(mat)
[1] 28 32 36 40

# row means
> apply(mat, 1, mean)
[1]  7  8  9 10
> rowMeans(mat)
[1]  7  8  9 10

# col sums
> apply(mat, 2, sum)
[1] 10 26 42 58
> colSums(mat)
[1] 10 26 42 58
 
# col means
> apply(mat, 2, mean)
[1]  2.5  6.5 10.5 14.5
> colMeans(mat)
[1]  2.5  6.5 10.5 14.5
 
# row quantiles
> apply(mat, 1, quantile, probs=c(0.25, 0.5, 0.75))
    [,1] [,2] [,3] [,4]
25%    4    5    6    7
50%    7    8    9   10
75%   10   11   12   13

You can see aggregations taking place without the need of extra looping constructs.

lapply

The lapply(…) function takes a list and a function as input parameters. Then it evaluates that function over each element of the list. If the input list is not a list, it is coerced to a list using the as.list(…) function before the final output is returned. All operations are vectorized and we will see an example in the following snippet:

# create and view a list of elements
> l <- list(nums=1:10, even=seq(2,10,2), odd=seq(1,10,2))
> l
$nums
 [1]  1  2  3  4  5  6  7  8  9 10

$even
[1]  2  4  6  8 10

$odd
[1] 1 3 5 7 9

# use lapply on the list
> lapply(l, sum)
$nums
[1] 55

$even
[1] 30

$odd
[1] 25

sapply

The sapply(…) function is quite similar to the lapply(…) function, the only exception is that it will always try to simplify the final results of the computation. Suppose the final result is such that every element is of length 1, then sapply(…) would return a vector. If the length of every element in the result is greater than 1, then a matrix would be returned. If it is not able to simplify the results, then we end up getting the same result as lapply(…). The following example will make things clearer:

# create and view a sample list
> l <- list(nums=1:10, even=seq(2,10,2), odd=seq(1,10,2))
> l
$nums
 [1]  1  2  3  4  5  6  7  8  9 10

$even
[1]  2  4  6  8 10

$odd
[1] 1 3 5 7 9

# observe differences between lapply and sapply
> lapply(l, mean)
$nums
[1] 5.5

$even
[1] 6

$odd
[1] 5
> typeof(lapply(l, mean))
[1] "list"

> sapply(l, mean)
nums even  odd 
 5.5  6.0  5.0
> typeof(sapply(l, mean))
[1] "double"

tapply

The tapply(…) function is used to evaluate a function over specific subsets of input vectors. These subsets can be defined by the users.

The following example depicts the same:

> data <- 1:30
> data
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
> groups <- gl(3, 10)
> groups
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
> tapply(data, groups, sum)
  1   2   3 
 55 155 255 
> tapply(data, groups, sum, simplify = FALSE)
$'1'
[1] 55

$'2'
[1] 155

$'3'
[1] 255

mapply

The mapply(…) function is used to evaluate a function in parallel over sets of arguments. This is basically a multi-variate version of the lapply(…) function.

The following example shows how we can build a list of vectors easily with mapply(…) as compared to using the rep(…) function multiple times otherwise:

> list(rep(1,4), rep(2,3), rep(3,2), rep(4,1))
[[1]]
[1] 1 1 1 1

[[2]]
[1] 2 2 2

[[3]]
[1] 3 3

[[4]]
[1] 4

> mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1

[[2]]
[1] 2 2 2

[[3]]
[1] 3 3

[[4]]
[1] 4

Visualizing data

One of the most important aspects of data analytics is to depict meaningful insights with crisp and concise visualizations. Data visualization is one of the most important aspects of exploratory data analysis, as well as an important medium to present results from any analyzes. There are three main popular plotting systems in R:

The base plotting system which comes with the basic installation of R
The lattice plotting system which produces better looking plots than the base plotting system
The ggplot2 package which is based on the grammar of graphics and produces beautiful, publication quality visualizations

The following snippet depicts visualizations using all three plotting systems on the popular iris dataset:

# load the data
> data(iris)

# base plotting system
> boxplot(Sepal.Length~Species,data=iris, 
+         xlab="Species", ylab="Sepal Length", main="Iris Boxplot")

This gives us a set of boxplots using the base plotting system as depicted in the following plot:

# lattice plotting system
> bwplot(Sepal.Length~Species,data=iris, xlab="Species", ylab="Sepal Length", main="Iris Boxplot")

This snippet helps us in creating a set of boxplots for the various species using the lattice plotting system:

# ggplot2 plotting system
> ggplot(data=iris, aes(x=Species, y=Sepal.Length)) + geom_boxplot(aes(fill=Species)) + 
+     ylab("Sepal Length") + ggtitle("Iris Boxplot") +
+     stat_summary(fun.y=mean, geom="point", shape=5, size=4) + theme_bw()

This code snippet gives us the following boxplots using the ggplot2 plotting system:

You can see from the preceding visualizations how each plotting system works and compare the plots across each system. Feel free to experiment with each system and visualize your own data!

Next steps

This quick refresher in getting started with R should gear you up for the upcoming chapters and also make you more comfortable with the R eco-system, its syntax and features. It's important to remember that you can always get help with any aspect of R in a variety of ways. Besides that, packages in R are perhaps going to be the most important tool in your arsenal for analyzing data. We will briefly list some important commands for each of these two aspects which may come in handy in the future.

Getting help

R has thousands of packages, functions, constructs and structures! Hence it is impossible for anyone to keep track and remember of them all. Luckily, help in R is readily available and you can always get detailed information and documentation with regard to R, its utilities, features and packages using the following commands:

help(<any_R_object>) or ?<any_R_object>: This provides help on any R object including functions, packages, data types, and so on
example(<function_name>): This provides a quick example for the mentioned function
apropos(<any_string>): This lists all functions containing the any_string term

Managing packages

R, being open-source and a community-driven language, has tons of packages for helping you with your analyzes which are free to download, install and use.

In the R community, the term library is often used interchangeably with the term package.

R provides the following utilities to manage packages:

install.packages(…): This installs a package from Comprehensive R Archive Network (CRAN). CRAN helps maintain and distribute the various versions and documentation of R
libPaths(…): This adds this library path to R
installed.packages(lib.loc=): This lists installed packages
update.packages(lib.loc=): This updates a package
remove.packages(…): This removes a package
path.package(…): This is the package loaded for the session
library(…): This loads a package in a script to use its functions or utilities
library(help=): This lists the functions in a package

This brings us to the end of our section on getting started with R and we will now look at some aspects of analytics, machine learning and text analytics.

Data analytics

Data analytics is basically a structured process of using statistical modeling, machine learning, knowledge discovery, predictive modeling and data mining to discover and interpret meaningful patterns in the data, and to communicate them effectively as actionable insights which help in driving business decisions. Data analytics, data mining and machine learning are often said to be covering similar concepts and methodologies. While machine learning, being a branch or subset of artificial intelligence, is more focused on model building, evaluation and learning patterns, the end goal of all three processes is the same: to generate meaningful insights from the data. In the next section, we will briefly discuss the industry standard process followed in analytics which is rigorously followed by organizations.

Analytics workflow

Analyzing data is a science and an art. Any analytics process usually has a defined set of steps, which are generally executed in sequence. More than often, analytics being an iterative process, it leads to several of these steps being repeated many times over if necessary. There is an industry standard that is widely followed for data analysis, known as CRISP-DM, which stands for Cross Industry Standard Process for Data Mining. This is a standard data analysis and mining process workflow that describes how to break up any particular data analysis problem into six major stages.

The main stages in the CRISP-DM model are as follows:

Business understanding: This is the initial stage that focuses on the business context of the problem, objective or goal which we are trying to solve. Domain and business knowledge are essential here along with valuable insights from subject matter experts in the business for planning out the key objectives and end results that are intended from the data analysis workflow.
Data acquisition and understanding: This stage's main focus is to acquire data of interest and understand the meaning and semantics of the various data points and attributes that are present in the data. Some initial exploration of the data may also be done at this stage using various exploratory data analysis techniques.
Data preparation: This stage usually involves data munging, cleaning, and transformation. Extract-Transform-Load (ETL) processes of ten come in handy at this stage. Data quality issues are also dealt with in this stage. The final dataset is usually used for analysis and modeling.
Modeling and analysis: This stage mainly focuses on analyzing the data and building models using specific techniques from data mining and machine learning. Often, we need to apply further data transformations that are based on different modeling algorithms.
Evaluation: This is perhaps one of the most crucial stages. Building models is an iterative process. In this stage, we evaluate the results that are obtained from multiple iterations of various models and techniques, and then we select the best possible method or analysis, which gives us the insights that we need based on our business requirements. Often, this stage involves reiterating through the previous two steps to reach a final agreement based on the results.
Deployment: This is the final stage, where decision systems that are based on analysis are deployed so that end users can start consuming the results and utilizing them. This deployed system can be as complex as a real-time prediction system or as simple as an ad-hoc report.

The following figure shows the complete flow with the various stages in the CRISP-DM model:

The CRISP-DM model. Source: www.wikipedia.org

In principle, the CRISP-DM model is very clear and concise with straightforward requirements at each step. As a result, this workflow has been widely adopted across the industry and has become an industry standard.

Machine learning

Machine learning is one of the trending buzzwords in the technology domain today, along with big data. Although a lot of it is over-hyped, machine learning (ML) has proved to be really effective in solving tough problems and has been significant in propelling the rise of artificial intelligence (AI). Machine learning is basically an intersection of elements from the fields of computer science, statistics, and mathematics. This includes a combination of concepts from knowledge mining and discovery, artificial intelligence, pattern detection, optimization, and learning theory to develop algorithms and techniques which can learn from and make predictions on data without being explicitly programmed.

The learning here refers to the ability to make computers or machines intelligent, based on the data and algorithms which we provide to them, also known as training the model, so that they start detecting patterns and insights from new data. This can be better understood from the definition of machine learning by the well-known professor Tom Mitchell who said, "A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E."

Let's consider the task (T) of the system is to predict the sales of an organization for the next year. To perform such a task, it needs to rely upon historical sales information. We shall call this experience (E). Its performance (P) is measured on how well it predicts the sales in any given year. Thus, we can generalize that a system has successfully learned how to predict the sales (or task T) if it gets better at predicting it (or improves its performance P), utilizing the past information (or experience E).

Machine learning techniques

Machine-learning techniques are basically algorithms which can work on data and extract insights from it, which can include discovering, forecasting, or predicting trends and patterns. The idea is to build a model using a combination of data and algorithms which can then be used to work on new, previously unseen data and derive actionable insights.

Each and every technique depends on what type of data it can work on and the objective of the problem we are trying to solve. People often get tempted to learn a couple of algorithms and then try to apply them to every problem. An important point to remember here is that there is no universal machine-learning algorithm which fits all problems. The main inputs to machine-learning algorithms are features which are extracted from data using a process known as feature extraction, which is often coupled with another process called feature engineering (or building new features from existing features). Each feature can be described as an attribute of the dataset, such as your location, age, number of posts, shares and so on, if we were dealing with data related to social media user profiles. Machine-learning techniques can be classified into two major types namely supervised learning and unsupervised learning.

Supervised learning

Supervised learning techniques are a subset of the family of machine -learning algorithms which are mainly used in predictive modeling and forecasting. A predictive model is basically a model which can be constructed using a supervised learning algorithm on features or attributes from training data (available data used to train or build the model) such that we can predict using this model on newer, previously unseen data points. Supervised learning algorithms try to model relationships and dependencies between the target prediction output and the input features such that we can predict the output values for new data based on those relationships which it learned from the dataset used during model training or building.

There are two main types of supervised learning techniques

Classification: These algorithms build predictive models from training data where the response variable to be predicted is categorical. These predictive models use the features learnt from training data on new, previously unseen data to predict their class or category labels. The output classes belong to discrete categories. Types of classification algorithms include decision trees, support vector machines, random forests and many more.
Regression: These algorithms are used to build predictive models on data such that the response variable to be predicted is numerical. The algorithm builds a model based on input features and output response values of the training data and this model is used to predict values for new data. The output values in this case are continuous numeric values and not discrete categories. Types of regression algorithms include linear regression, multiple regression, ridge regression and lasso regression, among many others.

Unsupervised learning

Unsupervised learning techniques are a subset of the family of machine-learning algorithms which are mainly used in pattern detection, dimension reduction and descriptive modeling. A descriptive model is basically a model constructed from an unsupervised machine learning algorithm and features from input data similar to the supervised learning process. However, the output response variables are not present in this case. These algorithms try to use techniques on the input data to mine for rules, detect patterns, and summarize and group the data points which help in deriving meaningful insights and describe the data better to the users. There is no specific concept of training or testing data here since we do not have any specific relationship mapping between input features and output response variables (which do not exist in this case). There are three main types of unsupervised learning techniques:

Clustering: The main objective of clustering algorithms is to cluster or group input data points into different classes or categories using features derived from the input data alone and no other external information. Unlike classification, the output response labels are not known beforehand in clustering. There are different approaches to build clustering models, such as by using centroid based approaches, hierarchical approaches, and many more. Some popular clustering algorithms include k-means, k-medoids, and hierarchical clustering.
Association rule mining: These algorithms are used to mine and extract rules and patterns of significance from data. These rules explain relationships between different variables and attributes, and also depict frequent item sets and patterns which occur in the data.
Dimensionality reduction: These algorithms help in the process of reducing the number of features or variables in the dataset by computing a set of principal representative variables. These algorithms are mainly used for feature selection.

Text analytics

Text analytics is also often called text mining. This is basically the process of extracting and deriving meaningful patterns from textual data which can in turn be translated into actionable knowledge and insights. Text analytics consist of a collection of machine learning, natural language processing, linguistic, and statistical methods that can be leveraged to analyze text data. Machine-learning algorithms are built to work on numeric data in general, so extra processing and feature extraction and engineering is needed for text analytics to make regular machine learning and statistical methods work on unstructured data.

Natural language processing, popularly known as NLP, aids in doing this. NLP is defined as a specialized field in computer science and engineering and artificial intelligence which has its roots and origins in computational linguistics. Concepts and techniques from NLP are extremely useful and help in building applications and systems that enable interaction between machines and humans with the aid of natural language which is indeed a daunting task. Some of the main applications of NLP are:

Question-answering systems
Speech recognition
Machine translation
Text categorization and classification
Text summarization

We will be using several concepts from these when we analyze unstructured textual data from social media in the upcoming chapters.

Summary

We've covered a lot in this chapter so I would like to commend your efforts for staying with us till the very end! We kicked off this chapter with a detailed look into social media, its scope, variants, significance and pitfalls. We also covered the basics of social media analytics, as well as the opportunities and challenges involved, to whet your appetite for social media analytics and to get us geared up for the journey we'll be taking throughout the course of this book. A complete refresher of the R programming language was also covered in detail, especially with regard to setting up a proper analytics environment, core structures, constructs and features of R. Finally, we took a quick glance at the basic concepts of data analytics, the industry standard process for analytics, and covered the core essentials of machine learning, text analytics and natural language processing.

We will be looking at analyzing data from various popular social media platforms in future chapters so get ready to do some serious analysis on social media!

About the Authors

Dipanjan Sarkar

Dipanjan (DJ) Sarkar is a Data Scientist at Intel, leveraging data science, machine learning, and deep learning to build large-scale intelligent systems. He holds a master of technology degree with specializations in Data Science and Software Engineering. He has been an analytics practitioner for several years now, specializing in machine learning, NLP, statistical methods, and deep learning. He is passionate about education and also acts as a Data Science Mentor at various organizations like Springboard, helping people learn data science. He is also a key contributor and editor for Towards Data Science, a leading online journal on AI and Data Science. He has also authored several books on R, Python, machine learning, NLP, and deep learning.
Browse publications by this author
Karthik Ganapathy

Contacted for Reviewing
Browse publications by this author
Raghav Bali

Browse publications by this author
Tushar Sharma

Tushar Sharma has a master's degree specializing in data science from the International Institute of Information Technology, Bangalore. He works as a data scientist with Intel. In his previous job he used to work as a research engineer for a financial consultancy firm. His work involves handling big data at scale generated by the massive infrastructure at Intel. He engineers and delivers end to end solutions on this data using the latest machine learning tools and frameworks. He is proficient in R, Python, Spark, and mathematical aspects of machine learning among other things. Tushar has a keen interest in everything related to technology. He likes to read a wide array of books ranging from history to philosophy and beyond. He is a running enthusiast and likes to play badminton and tennis.
Browse publications by this author

Very helpful especially for those who ar not familar with using social media APIs in conjunction with R