Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-responsive-visualizations-using-d3js-and-bootstrap
Packt
01 Mar 2016
21 min read
Save for later

Responsive Visualizations Using D3.js and Bootstrap

Packt
01 Mar 2016
21 min read
In this article by Christoph Körner, the author of the book Learning Responsive Data Visualization, we will design and implement a responsive data visualization using Bootstrap and Media Queries based on real data. We will cover the following topics: Absolute and relative units in the browsers Drawing charts with percentage values Adapting charts using JavaScript event listeners Learning to adapt the resolution of the data Using bootstrap's Media Queries Understanding how to use Media Queries in CSS, LESS and JavaScript Learning how to use bootstrap's grid system (For more resources related to this topic, see here.) First, we will discuss the most important absolute and relative units that are available in modern browsers. You will learn the difference of absolute pixels, relative percentages, em, rem, and many more. In the next section, we will take a look at what is really needed for a chart to be responsive. Adapting the width to the parent element is one of the requirements, and you will learn about two different ways to implement this. After this section, you will know when to use percentage values or JavaScript event listeners. We will also take a look at adapting the data resolution, which is another important property of responsive visualizations. In the next section, we will explore Media Queries, and understand how we can use them to make viewport depended responsive charts. We will take the advantage of Bootstrap's definitions of Media Queries for the most common device resolutions to integrate them into our responsive chart using CSS or LESS. Finally, we will also see how to include Media Queries into JavaScript. In the last section, we will take a look at Bootstrap's grid system and learn how to seamlessly integrate it with the charts. This will give us not only some great flexibility but will also make it easier to combine multiple charts to one big dashboard application. Units and lengths in the browser Creating a responsive design, bet it website or graphics, depends strongly on the units and lengths that a browser can interpret. We can easily create an element that fills the entire width of a container using the percentage values that are relative to the parent container; whereas, achieving the same result width absolute values could be very tricky. Thus, mastering responsive graphics also means knowing all the absolute and relative units that are available in the browser. Units for Absolute lengths The most convenient and popular way in web design and development is to define and measure lengths and dimensions in absolute units, usually in pixels. The reason for this is that designers and developers often want to exactly specify the exact dimensions of an object. The pixel unit called px has been introduced as a visual unit based on a physical measurement to read from a device in the distance of approximately one arm length; however, all the modern browsers can also allow the definitions of lengths based on physical units. The following list shows the most common absolute units and their relations to each other: cm: centimeters (1 cm = 96px/2.54) mm: millimeters (1 mm = 1/10th of 1 cm) in: inches (1in = 2.54 cm = 96 px) pt: points (1 pt = 1/72th of 1 in) px: pixels (1 px = 1/96th of 1 in) More information on the origin and meaning of the pixel unit can be found in the CSS3 Specification available at http://www.w3.org/TR/css3-values/#viewport-relative-lengths. Units for Relative lengths In addition to Absolute lengths, relative lengths that are expressed as the percentage of the width or height of a parent element has also been a common technique to the style dynamic elements of web pages. Traditionally, the % unit has always been the unit of choice for the preceding reason. However, with CSS3, a couple of additional relative units have found their way into the browsers, for example, to define a length relative to the font size of the element. Here is a list of the relative length units that have been specified in the CSS3 specifications and will soon be available in modern browsers: %: the percentage of the width/height of the absolute container Em: the factor of font size of the element Rem: the factor of font size of the root element Vw: 1% of viewport's width vh: 1% viewport's height vmin: 1% of the viewport's smaller dimension (either vw or vh) vmax: 1% of the viewport's larger dimension (either vw or vh) I am aware that as a web developer, we cannot really take the advantage of any technology that will be supported soon; however, I want to point out one unit that will play an important role for feature web developers: the rem unit. The rem unit defines the length of an element based on the font size of the root node (the html node in this case), rather than the font size of the current element such as em. This rem unit is very powerful if we use it to define the lengths and spacings of a layout because the layout can also adapt when the user increases the font size of the browser (that is, for readability). I want to mention that Bootstrap 4 will replace all absolute pixel units for Media Queries by the rem units because of this reason. If we look at the following figure, we also see that rem units are already supported in all major browsers. I recommend you to start replacing all your absolute pixel units of your layout and spacing by the rem units: Cross-browser compatibility of rem units However, percentage units are not dead; we can still use them when they are appropriate. We will use them later to draw SVG elements with dimensions based on their parent elements' dimensions. Units for Resolution To round up the section of relative and absolute units, I want to mention that we can also use resolutions in different units. These resolution units can be used in Media Queries together with the min-resolution or max-resolution attribute: Dpi: dots per inch dpcm: dots per centimeter dppx: dots per px unit Mathematical Expressions We often have the problem of dealing with rational numbers or expressions in CSS; just imagine defining a 3-column grid with a width of 33% per column, or imagine the need of computing a simple expression in the CSS file. CSS3 provides a simple solution for this: calc(exp): This computes the mathematical expression called exp, which can consist of lengths, values, and the operators called +, -, /, and * Note that the + and -operators must be surrounded by whitespace. Otherwise, they will be interpreted as a sign of the second number rather than the operator. Both the other operators called * and / don't require a whitespace, but I encourage you to add them for consistency. We can use these expression in the following snippets. .col-4 { width: calc(100%/3); } .col-sp-2 { width: calc(50% - 2em); } The preceding examples look great; however, as we can see in the following figure, we need to take care of the limitations of the browser compatibility: Cross-browser compatibility of the calc() expression Responsive charts Now that we know some basics about absolute and relative units, we can start to define, design, and implement responsive charts. A responsive chart is a chart that automatically adapts its look and feel to the resolution of the user's device; thus, responsive charts need to adapt the following properties: The dimension (width and height) The resolution of data points The interactions and interaction areas. Adapting the dimensions is most obvious. The chart should always scale and adapt to the width of its parent element. In the previous section, you learned about relative and absolute lengths, so one might think that simply using relative values for the chart's dimensions would be enough. However, there are multiple ways with advantages and disadvantages to achieve this; in this section, we will discuss three of them. Adapting the resolution of the data is a little less obvious and often neglected. The resolution of data points (the amount of data point per pixel) should adapt, so that we can see more points on a device with a higher resolution and less points on a low resolution screen. In this section we will see that this can only be achieved using JavaScript event listeners and by redrawing/updating the whole chart manually. Adapting interactions and interaction areas is important for not just using different screen resolutions but also different devices. We interact differently with a TV than a computer, and we use different input devices on a desktop and mobile phone. However, the chart will allow interactions and interaction areas that are appropriate for a given device and screen resolution.. Using Relative Lengths in SVG The first and most obvious solution for adapting the dimensions of a chart to its parent container is the use of relative values for lengths and coordinates. This means that when we define the chart once with relative values and the browser takes care of recomputing all the values when the dimension of the parent container has changed, there is no manual redrawing of the chart required. First, we will add some CSS styles to scale the SVG element to the full width of the parent container: .chart { height: 16rem; position: relative; } .chart > svg { width: 100%; height: 100%; } Next, we modify all our scales to work on a range at [0, 100] and subtract a padding from both sides: var xScale = d3.scale.ordinal() .domain(flatData.map(xKey)) .rangeBands([padding, 100 - 2*padding]); var yScale = d3.scale.linear() .domain([0, d3.max(flatData, yKey)]) .range([100 - 2*padding, padding]); Finally, we can draw the chart as before, but simply adding percentage signs % at the end of the attributes—to indicate the use of percentage units: $$bars .attr('x', function(d) { return (xScale(d.x) + i*barWidth ) + '%'; }) .attr('y', function(d) { return yScale(d.y) + '%'; }) .attr('height', function(d) { return (yScale(0) - yScale(d.y)) + '%'; }) .attr('width', barWidth + '%') .attr('fill', colors(data.key)); Observe that we only slightly modified the code to plot the bars in the bar chart in order to use percentage values as coordinates for the attributes. However, the effect of this small change is enormous. In the following figure, we can see the result of the chart in a browser window: Bar chart using relative lengths If we now increase the size of the browser, the bar chart scales nicely to the full width of the parent container. We can see the scaled chart in the following figure: Scaled bar chart using relative lengths If you are not impressed by it now, you better be. This is awesome in my opinion because it leaves all the hard work of recomputing the SVG element dimensions to the browser. We don't have to care about them, and these native computations give us some maximal performance. The previous example shows how to use percentage values to create a simple bar chart. However, what we didn't explain so far is why we didn't add any axis and labels to the chart. Well, despite the idea that we can exploit native rescaling of the browser, we need to face the limitations of this technique. Relative values are only allowed in standard attributes, such as width, height, x, y, cx, cy—but not in SVG paths or transform functions. Conclusion about using Relative Lengths While this sounds like an excellent solution (and indeed it is wonderful for certain use cases), it has two major drawbacks: The percentage values are not accepted for the SVG transform attributes and for the d attribute in the path elements, only in standard attributes Due to the fact that the browser is recomputing all the values automatically, we cannot adapt the resolution of the data of the charts The first point is the biggest drawback, which means we can only position elements using the standard attributes called width, height, x, y, cx, cy, and more. However, we can still draw a bar chart that seamlessly adapts according to the parent elements without the use of JavaScript event listeners. The second argument doesn't play a huge role anymore compared to the first one, and it can be circumvented using additional JavaScript event listeners, but I am sure you get the point. Using the JavaScript Resize event The last option is to use JavaScript event handlers and redraw the chart manually when the dimensions of the parent container change. Using this technique, we can always measure the width of the parent container (in absolute units) and use this length to update and redraw the chart accordingly. This gives us great flexibility over the data resolution, and we can adapt the chart to a different aspect ratio when needed as well. The Native Resize event Theoretically, this solution sounds brilliant. We simply watch the parent container or even the SVG container itself (if it uses a width of 100%) for the resize events, and then redraw the chart when the dimensions of the element change. However, there does not exist a native resize event on the div or svg element; modern browsers only support the resize events on the window element. Hence, it triggers only if the dimensions of the browser window changes. This means also that we need to clean up listeners once we remove a chart from the page. Although this is a limitation, in most cases, we can still use the windowresize event to adapt the chart to its parent container; we have to just keep this in our mind. Let's always use the parent container's absolute dimensions for drawing and redrawing the chart; we need to define the following things inside a redraw function: var width = chart.clientWidth; var height = width / aspectRatio; Now, we can add a resize event listener to the window element and call the redraw function whenever the window dimensions change: window.addEventListener('resize', function(event){ redraw(); }); The benefit of this solution is that we can do everything that we want in the redraw function, for example, modifying the aspect ratio, adapting the labels of the axis, or modifying the number of displayed elements. The following figure shows a resized version of the previous chart; we observe that this time, the axis ticks adapt nicely and don't overlap anymore. Moreover, the axis ticks now take the full available space: Resized chart with adapted axis ticks Adapting the Resolution of the Data However, there is another problem that can be nicely solved using these types of manual redraws—the problem of data resolution. How much data should be displayed in a small chart and how much in a bigger chart? Small chart with high data resolution I think you agree that in the preceding figure, we display too much data for the size of the graphic. This is bad and makes the chart useless. Moreover, we should really adapt the resolution of the data according to the small viewport in the redrawing process. Let's implement a function that returns only ever i-th element of an array: function adaptResolution(data, resolution) { resolution = resolution ? Math.ceil(resolution) : 1; return data.filter(function(d, i) { return i % resolution === 0; }); } Great, let's define a width depended data resolution and filter the data accordingly: var pixelsPerData = 20; var resolution = pixelsPerData * (flatData.length) / width; In the previous code, we observed that we can now define the minimum amount of pixel that one data point should have and remove the amount of values accordingly by calling the following: var flatDataRes = adaptResolution(flatData, resolution); The following image shows a small chart with a low number of values, which is perfectly readable even though it is very small: Small chart with a proper data resolution In the next figure, we can see the same chart based on the same data drawn with a bigger container. We immediately observe that also the data resolutions adapts accordingly; and again, the chart looks nice: Big chart with a proper data resolution Conclusion of using Resize events This is the most flexible solution, and therefore, in many situations, it is the solution of choice. However, you need to be aware that there are also drawbacks in using this solution: There is no easy way to listen for resize events of the parent container We need to add event listeners We need to make sure that event listeners are removed properly We need to manually redraw the chart Using Bootstrap's Media Queries Bootstrap is an awesome library that gets you started quickly with new projects. It not just includes a huge amount of useful HTML components but also normalized amd standardized CSS styles. One particular style is the implementation of Media Queries for four typical device types (five types in Bootstrap 4). In this section, we will take a look at how to make use of these Media Queries in our styles and scripts. The great thing about Bootstrap is that it successfully standardizes typical device dimensions for web developers thus, beginners can simply use them without rethinking over and over which pixel width could be the most common one for tablets. Media Queries in CSS The quickest way to use Bootstrap's Media Queries is to simply copy them from the compiled source code. The queries are here: /* Extra small devices (phones, etc. less than 768px) */ /* No media query since this is the default in Bootstrap */ /* Small devices (tablets, etc.) */ @media (min-width: 768px) { ... } /* Medium devices (desktops, 992px and up) */ @media (min-width: 992px) { ... } /* Large devices (large desktops, 1200px and up) */ @media (min-width: 1200px) { ... } We can easily add these queries to our CSS styles and define certain properties and styles for our visualizations, such as four predefined widths, aspect ratios, spacing, and so on in order to adapt the chart appearance to the device type of the user. Bootstrap 4 is currently in alpha; however, I think you can already start using the predefined device types in your CSS. The reason I am strongly arguing for Bootstrap 4 is because of its shift towards the em units instead of pixels: // Extra small devices (portrait phones, etc.) // No media query since this is the default in Bootstrap // Small devices (landscape phones, etc.) @media (min-width: 34em) { ... } // Medium devices (tablets, etc.) @media (min-width: 48em) { ... } // Large devices (desktops, etc.) @media (min-width: 62em) { ... } // Extra large devices (large desktops, etc.) @media (min-width: 75em) { ... } Once again, the huge benefit of this is that the layout can adapt when the user increases the font size of the browser, for example, to enhance readability. Media Queries in LESS/SASS In Bootstrap 3, you can include Media Query mixins to your LESS file, which then gets compiled to plain CSS. To use these mixins, you have to create a LESS file instead of CSS and import the Bootstrap variables.less file. In this file, Bootstrap defines all its dimensions, colors, and other variables. Let's create a style.less file and import variables.less: // style.less @import "bower_components/bootstrap/less/variables.less"; Perfect, that's all. Now, we can go ahead and start using Bootstrap's device types in our LESS file. /* Extra small devices (phones, etc. less than 768px) */ /* No media query since this is the default in Bootstrap */ /* Small devices (tablets, etc.) */ @media (min-width: @screen-sm-min) { ... } /* Medium devices (desktops, etc.) */ @media (min-width: @screen-md-min) { ... } /* Large devices (large desktops, etc.) */ @media (min-width: @screen-lg-min) { ... } Finally, we need to use a LESS compiler to transform our style.less file to plain CSS. To achieve this, we run the following command from the terminal: lessc styles.less styles.css As we can see, the command requires the LESS compiler called lessc being installed. If it's not yet installed on your system, go ahead and install it using the following command: npm install -g less If you are new to LESS, I recommend you to read through the LESS documentation on http://lesscss.org/. Once you check out of LESS, you can also look at the very similar SASS format, which is favored by Bootstrap 4. You can find the SASS documentation at http://sass-lang.com/. We can use the Bootstrap 4 Media Queries in a SASS file by the following mixins: @include media-breakpoint-up(xs) { ... } @include media-breakpoint-up(sm) { ... } @include media-breakpoint-up(md) { ... } @include media-breakpoint-up(lg) { ... } @include media-breakpoint-up(xl) { ... } In my opinion, including Bootstrap's LESS/SASS mixins to the styles of your visualization is the cleanest solution because you always compile your CSS from the latest Bootstrap source, and you don't have to copy CSS into your project. Media Queries in JavaScript Another great possibility of using Bootstrap's Media Queries to adapt your visualization to the user's device is to use them directly in JavaScript. The native window.matchMedia (mediaQuery) function gives you the same control over your JavaScript as Media Queries gives us over CSS. Here is a little example on how to use it: if (window.matchMedia("(min-width: 1200px)").matches) { /* the viewport is at least 1200 pixels wide */ } else { /* the viewport is less than 1200 pixels wide */ } In the preceding code, we see that this function is quite easy to use and adds almost infinite customization possibilities to our visualization. More information about the matchMedia function can be found on the Mozilla Website https://developer.mozilla.org/de/docs/Web/API/Window/matchMedia. However, apart from using the watchMedia function directly, we could also use a wrapper around the native API call. I can really recommend the enquire.js library by Nick Williams, which allows you to declare event listeners for viewport changes. It can be installed via the package manager bower by running the following command from the terminal: bower install enquire Then, we need to add enquire.js to the website and use in the following snippet: enquire.register("screen and (min-width:1200px)", { // triggers when the media query matches. match : function() { /* the viewport is at least 1200 pixels wide */ }, // optional; triggers when the media query transitions unmatch : function() { /* the viewport is less than 1200 pixels wide */ }, }); In the preceding code, we see that we can now can add the match and unmatch listeners almost in the same way as listening for resize events—just much more flexible. More information about require.js can be found on the GitHub page of the project at https://github.com/WickyNilliams/enquire.js. If we would like to use the Bootstrap device types, we could easily implement them (as needed) with enquire.js and trigger events for each device type. However, I prefer being very flexible and using the bare wrapper. Using Bootstrap's Grid System Another great and quick way of making your charts responsive and play nicely together with Bootstrap is to integrate them into Bootstrap's gird system. It is the best and cleanest integration however, is to separate concerns—and make the visualization as general and adaptive as possible. Let's take our bar chart example with the custom resize events and integrate it into a simple grid layout. As usual, you can find the full source code of the example in the code examples: <div class="container"> <div class="row"> <div class="col-md-8"> <div class="chart" data-url="…" …> </div> </div> <div class="col-md-4"> <h2>My Dashboard</h2> <p>This is a simple dashboard</p> </div> </div> <div class="row"> <div class="col-md-4"> <div class="chart" data-url="…" …> </div> </div> <div class="col-md-4"> <div class="chart" data-url="…" …> </div> </div> <div class="col-md-4"> <div class="chart" data-url="…" …> </div> </div> </div> <div class="row"> <div class="col-md-6"> <div class="chart" data-url="…" …> </div> </div> <div class="col-md-6"> <div class="chart" data-url="…" …> </div> </div> </div> We observe that by making use of the parent containers' width, we can simply add the charts as the div elements in the columns of the grid. This is the preferred integration where two components play together nicely but are not depended on each other. In the following figure, we can see a screenshot from the simple dashboard that we just built. We observe that the visualizations already fit nicely into our grid layout, which makes it easy to compose them together: A simple dashboard using Bootstrap's grid layout Summary In this article, you learned the essentials about absolute and relative units to define lengths in a browser. We remember that the em and rem unit plays an important role because it allows a layout to adapt when a user increases the font size of the web site. Then, you learned about how to use relative units and JavaScript resize events to adapt the chart size and the data resolution according to the current container size. We looked into Media Queries in CSS, LESS, und JavaScript. Finally, we saw how to integrate charts with Bootstrap's grid system and implemented a simple Google Analytics-like dashboard with multiple charts.
Read more
  • 0
  • 0
  • 16457

article-image-make-things-pretty-ggplot2
Packt
24 Feb 2016
30 min read
Save for later

Make Things Pretty with ggplot2

Packt
24 Feb 2016
30 min read
 The objective of this article is to provide you with a general overview of the plotting environments in R and of the most efficient way of coding your graphs in it. We will go through the most important Integrated Development Environment (IDE) available for R as well as the most important packages available for plotting data; this will help you to get an overview of what is available in R and how those packages are compared with ggplot2. Finally, we will dig deeper into the grammar of graphics, which represents the basic concepts on which ggplot2 was designed. But first, let's make sure that you have a working version of R on your computer. (For more resources related to this topic, see here.) Getting ggplot2 up and running You can download the most up-to-date version of R from the R project website (http://www.r-project.org/). There, you will find a direct connection to the Comprehensive R Archive Network (CRAN), a network of FTP and web servers around the world that store identical, up-to-date versions of code and documentation for R. In addition to access to the CRAN servers, on the website of the R project, you may also find information about R, a few technical manuals, the R journal, and details about the packages developed for R and stored in the CRAN repositories. At the time of writing, the current version of R is 3.1.2. If you have already installed R on your computer, you can check the actual version with the R.Version() code, or for a more concise result, you can use the R.version.string code that recalls only part of the output of the previous function. Packages in R In the next few pages of this article, we will quickly go through the most important visualization packages available in R, so in order to try the code, you will also need to have additional packages as well as ggplot2 up and running in your R installation. In the basic R installation, you will already have the graphics package available and loaded in the session; the lattice package is already available among the standard packages delivered with the basic installation, but it is not loaded by default. ggplot2, on the other hand, will need to be installed. You can install and load a package with the following code: > install.packages(“ggplot2”) > library(ggplot2) Keep in mind that every time R is started, you will need to load the package you need with the library(name_of_the_package) command to be able to use the functions contained in the package. In order to get a list of all the packages installed on your computer, you can use the call to the library() function without arguments. If, on the other hand, you would like to have a list of the packages currently loaded in the workspace, you can use the search() command. One more function that can turn out to be useful when managing your library of packages is .libPaths(), which provides you with the location of your R libraries. This function is very useful to trace back the package libraries you are currently using, if any, in addition to the standard library of packages, which on Windows is located by default in a path of the kind C:/Program Files/R/R-3.1.2/library. The following list is a short recap of the functions just discussed: .libPaths()   # get library location library()   # see all the packages installed search()   # see the packages currently loaded Integrated Development Environment (IDE) You will definitely be able to run the code and the examples explained in the article directly from the standard R Graphical User Interface (GUI), especially if you are frequently working with R in more complex projects or simply if you like to keep an eye on the different components of your code, such as scripts, plots, and help pages, you may well think about the possibility of using an IDE. The number of specific IDEs that get integrated with R is still limited, but some of them are quite efficient, well-designed and open source. RStudio RStudio (http://www.rstudio.com/) is a very nice and advanced programming environment developed specifically for R, and this would be my recommended choice of IDE as the R programming environment in most cases. It is available for all the major platforms (Windows, Linux, and Mac OS X), and it can be run on a local machine, such as your computer, or even over the Web, using RStudio Server. With RStudio Server, you can connect a browser-based interface (the RStudio IDE) to a version of R running on a remote Linux server. RStudio allows you to integrate several useful functionalities, in particular if you use R for a more complex project. The way the software interface is organized allows you to keep an eye on the different activities you very often deal with in R, such as working on different scripts, overviewing the installed packages, as well as having easy access to the help pages and the plots generated. This last feature is particularly interesting for ggplot2 since in RStudio, you will be able to easily access the history of the plots created instead of visualizing only the last created plot, as is the case in the default R GUI. One other very useful feature of RStudio is code completion. You can, in fact, start typing a comment, and upon pressing the Tab key, the interface will provide you with functions matching what you have written . This feature will turn out to be very useful in ggplot2, so you will not necessarily need to remember all the functions and you will also have guidance for the arguments of the functions as well. In Figure 1.1, you can see a screenshot from the current version of RStudio (v 0.98.1091): Figure 1.1: This is a screenshot of RStudio on Windows 8 The environment is composed of four different areas: Scripting area: In this area you can open, create, and write the scripts. Console area: This area is the actual R console in which the commands are executed. It is possible to type commands directly here in the console or write them in a script and then run them on the console (I would recommend the last option). Workspace/History area: In this area, you can find a practical summary of all the objects created in the workspace in which you are working and the history of the typed commands. Visualization area: Here, you can easily load packages, open R help files, and, even more importantly, visualize plots. The RStudio website provides a lot of material on how to use the program, such as manuals, tutorials, and videos, so if you are interested, refer to the website for more details. Eclipse and StatET Eclipse (http://www.eclipse.org/) is a very powerful IDE that was mainly developed in Java and initially intended for Java programming. Subsequently, several extension packages were also developed to optimize the programming environment for other programming languages, such as C++ and Python. Thanks to its original objective of being a tool for advanced programming, this IDE is particularly intended to deal with very complex programming projects, for instance, if you are working on a big project folder with many different scripts. In these circumstances, Eclipse could help you to keep your programming scripts in order and have easy access to them. One drawback of such a development environment is probably its big size (around 200 MB) and a slightly slow-starting environment. Eclipse does not support interaction with R natively, so in order to be able to write your code and execute it directly in the R console, you need to add StatET to your basic Eclipse installation. StatET (http://www.walware.de/goto/statet) is a plugin for the Eclipse IDE, and it offers a set of tools for R coding and package building. More detailed information on how to install Eclipse and StatET and how to configure the connections between R and Eclipse/StatET can be found on the websites of the related projects. Emacs and ESS Emacs (http://www.gnu.org/software/emacs/) is a customizable text editor and is very popular, particularly in the Linux environment. Although this text editor appears with a very simple GUI, it is an extremely powerful environment, particularly thanks to the numerous keyboard shortcuts that allow interaction with the environment in a very efficient manner after getting some practice. Also, if the user interface of a typical IDE, such as RStudio, is more sophisticated and advanced, Emacs may be useful if you need to work with R on systems with a poor graphical interface, such as servers and terminal windows. Like Eclipse, Emacs does not support interfacing with R by default, so you will need to install an add-on package on your Emacs that will enable such a connection, Emacs Speaks Statistics (ESS). ESS (http://ess.r-project.org/) is designed to support the editing of scripts and interacting with various statistical analysis programs including R. The objective of the ESS project is to provide efficient text editor support to statistical software, which in some cases comes with a more or less defined GUI, but for which the real power of the language is only accessible through the original scripting language. The plotting environments in R R provides a complete series of options to realize graphics, which makes it quite advanced with regard to data visualization. Along the next few sections of this article, we will go through the most important R packages for data visualization by quickly discussing some high-level differences and analogies. If you already have some experience with other R packages for data visualization, in particular graphics or lattice, the following sections will provide you with some references and examples of how the code used in such packages appears in comparison with that used in ggplot2. Moreover, you will also have an idea of the typical layout of the plots created with a certain package, so you will be able to identify the tool used to realize the plots you will come across. The core of graphics visualization in R is within the grDevices package, which provides the basic structure of data plotting, such as the colors and fonts used in the plots. Such a graphic engine was then used as the starting point in the development of more advanced and sophisticated packages for data visualization, the most commonly used being graphics and grid. The graphics package is often referred to as the base or traditional graphics environment since, historically, it was the first package for data visualization available in R, and it provides functions that allow the generation of complete plots. The grid package, on the other hand, provides an alternative set of graphics tools. This package does not directly provide functions that generate complete plots, so it is not frequently used directly to generate graphics, but it is used in the development of advanced data visualization packages. Among the grid-based packages, the most widely used are lattice and ggplot2, although they are built by implementing different visualization approaches—Trellis plots in the case of lattice and the grammar of graphics in the case of ggplot2. We will describe these principles in more detail in the coming sections. A diagram representing the connections between the tools just mentioned is shown in Figure 1.2. Just keep in mind that this is not a complete overview of the packages available but simply a small snapshot of the packages we will discuss. Many other packages are built on top of the tools just mentioned, but in the following sections, we will focus on the most relevant packages used in data visualization, namely graphics, lattice, and, of course, ggplot2. If you would like to get a more complete overview of the graphics tools available in R, you can have a look at the web page of the R project summarizing such tools, http://cran.r-project.org/web/views/Graphics.html. Figure 1.2: This is an overview of the most widely used R packages for graphics In order to see some examples of plots in graphics, lattice and ggplot2, we will go through a few examples of different plots over the following pages. The objective of providing these examples is not to do an exhaustive comparison of the three packages but simply to provide you with a simple comparison of how the different codes as well as the default plot layouts appear for these different plotting tools. For these examples, we will use the Orange dataset available in R; to load it in the workspace, simply write the following code: >data(Orange) This dataset contains records of the growth of orange trees. You can have a look at the data by recalling its first lines with the following code: >head(Orange) You will see that the dataset contains three columns. The first one, Tree, is an ID number indicating the tree on which the measurement was taken, while age and circumference refer to the age in days and the size of the tree in millimeters, respectively. If you want to have more information about this data, you can have a look at the help page of the dataset by typing the following code: ?Orange Here, you will find the reference of the data as well as a more detailed description of the variables included. Standard graphics and grid-based graphics The existence of these two different graphics environments brings these questions  to most users' minds—which package to use and under which circumstances? For simple and basic plots, where the data simply needs to be represented in a standard plot type (such as a scatter plot, histogram, or boxplot) without any additional manipulation, then all the plotting environments are fairly equivalent. In fact, it would probably be possible to produce the same type of plot with graphics as well as with lattice or ggplot2. Nevertheless, in general, the default graphic output of ggplot2 or lattice will be most likely superior compared to graphics since both these packages are designed considering the principles of human perception deeply and to make the evaluation of data contained in plots easier. When more complex data should be analyzed, then the grid-based packages, lattice and ggplot2, present a more sophisticated support in the analysis of multivariate data. On the other hand, these tools require greater effort to become proficient because of their flexibility and advanced functionalities. In both cases, lattice and ggplot2, the package provides a full set of tools for data visualization, so you will not need to use grid directly in most cases, but you will be able to do all your work directly with one of those packages. Graphics and standard plots The graphics package was originally developed based on the experience of the graphics environment in R. The approach implemented in this package is based on the principle of the pen-on-paper model, where the plot is drawn in the first function call and once content is added, it cannot be deleted or modified. In general, the functions available in this package can be divided into high-level and low-level functions. High-level functions are functions capable of drawing the actual plot, while low-level functions are functions used to add content to a graph that was already created with a high-level function. Let's assume that we would like to have a look at how age is related to the circumference of the trees in our dataset Orange; we could simply plot the data on a scatter plot using the high-level function plot() as shown in the following code: plot(age~circumference, data=Orange) This code creates the graph in Figure 1.3. As you would have noticed, we obtained the graph directly with a call to a function that contains the variables to plot in the form of y~x, and the dataset to locate them. As an alternative, instead of using a formula expression, you can use a direct reference to x and y, using code in the form of plot(x,y). In this case, you will have to use a direct reference to the data instead of using the data argument of the function. Type in the following code: plot(Orange$circumference, Orange$age) The preceding code results in the following output: Figure 1.3: Simple scatterplot of the dataset Orange using graphics For the time being, we are not interested in the plot's details, such as the title or the axis, but we will simply focus on how to add elements to the plot we just created. For instance, if we want to include a regression line as well as a smooth line to have an idea of the relation between the data, we should use a low-level function to add the just-created additional lines to the plot; this is done with the lines() function: plot(age~circumference, data=Orange)   ###Create basic plot abline(lm(Orange$age~Orange$circumference), col=”blue”) lines(loess.smooth(Orange$circumference,Orange$age), col=”red”) The graph generated as the output of this code is shown in Figure 1.4: Figure 1.4: This is a scatterplot of the Orange data with a regression line (in blue) and a smooth line (in red) realized with graphics As illustrated, with this package, we have built a graph by first calling one function, which draws the main plot frame, and then additional elements were included using other functions. With graphics, only additional elements can be included in the graph without changing the overall plot frame defined by the plot() function. This ability to add several graphical elements together to create a complex plot is one of the fundamental elements of R, and you will notice how all the different graphical packages rely on this principle. If you are interested in getting other code examples of plots in graphics, there is also some demo code available in R for this package, and it can be visualized with demo(graphics). In the coming sections, you will find a quick reference to how you can generate a similar plot using graphics and ggplot2. As will be described in more detail later on, in ggplot2, there are two main functions to realize plots, ggplot() and qplot(). The function qplot() is a wrapper function that is designed to easily create basic plots with ggplot2, and it has a similar code to the plot() function of graphics. Due to its simplicity, this function is the easiest way to start working with ggplot2, so we will use this function in the examples in the following sections. The code in these sections also uses our example dataset Orange; in this way, you can run the code directly on your console and see the resulting output. Scatterplot with individual data points To generate the plot generated using graphics, use the following code: plot(age~circumference, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange) The preceding code results in the following output: Scatterplots with the line of one tree To generate the plot using graphics, use the following code: plot(age~circumference, data=Orange[Orange$Tree==1,], type=”l”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange[Orange$Tree==1,], geom=”line”) The preceding code results in the following output: Scatterplots with the line and points of one tree To generate the plot using graphics, use the following code: plot(age~circumference, data=Orange[Orange$Tree==1,], type=”b”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange[Orange$Tree==1,], geom=c(“line”,”point”)) The preceding code results in the following output: Boxplot of orange dataset To generate the plot using graphics, use the following code: boxplot(circumference~Tree, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=”boxplot”) The preceding code results in the following output: Boxplot with individual observations To generate the plot using graphics, use the following code: boxplot(circumference~Tree, data=Orange) points(circumference~Tree, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=c(“boxplot”,”point”)) The preceding code results in the following output: Histogram of orange dataset To generate the plot using graphics, use the following code: hist(Orange$circumference) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”) The preceding code results in the following output: Histogram with reference line at median value in red To generate the plot using graphics, use the following code: hist(Orange$circumference) abline(v=median(Orange$circumference), col=”red”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”)+geom_vline(xintercept = median(Orange$circumference), colour=”red”) The preceding code results in the following output: Lattice and the Trellis plots Along with with graphics, the base R installation also includes the lattice package. This package implements a family of techniques known as Trellis graphics, proposed by William Cleveland to visualize complex datasets with multiple variables. The objective of those design principles was to ensure the accurate and faithful communication of data information. These principles are embedded into the package and are already evident in the default plot design settings. One interesting feature of Trellis plots is the option of multipanel conditioning, which creates multiple plots by splitting the data on the basis of one variable. A similar option is also available in ggplot2, but in that case, it is called faceting. In lattice, we also have functions that are able to generate a plot with one single call, but once the plot is drawn, it is already final. Consequently, plot details as well as additional elements that need to be included in the graph, need to be specified already within the call to the main function. This is done by including all the specifications in the panel function argument. These specifications can be included directly in the main body of the function or specified in an independent function, which is then called; this last option usually generates more readable code, so this will be the approach used in the following examples. For instance, if we want to draw the same plot we just generated in the previous section with graphics, containing the age and circumference of trees and also the regression and smooth lines, we need to specify such elements within the function call. You may see an example of the code here; remember that lattice needs to be loaded in the workspace: require(lattice)              ##Load lattice if needed myPanel <- function(x,y){ panel.xyplot(x,y)            # Add the observations panel.lmline(x,y,col=”blue”)   # Add the regression panel.loess(x,y,col=”red”)      # Add the smooth line } xyplot(age~circumference, data=Orange, panel=myPanel) This code produces the plot in Figure 1.5: Figure 1.5: This is a scatter plot of the Orange data with the regression line (in blue) and the smooth line (in red) realized with lattice As you would have noticed, taking aside the code differences, the plot generated does not look very different from the one obtained with graphics. This is because we are not using any special visualization feature of lattice. As mentioned earlier, with this package, we have the option of multipanel conditioning, so let's take a look at this. Let's assume that we want to have the same plot but for the different trees in the dataset. Of course, in this case, you would not need the regression or the smooth line, since there will only be one tree in each plot window, but it could be nice to have the different observations connected. This is shown in the following code: myPanel <- function(x,y){ panel.xyplot(x,y, type=”b”) #the observations } xyplot(age~circumference | Tree, data=Orange, panel=myPanel) This code generates the graph shown in Figure 1.6: Figure 1.6: This is a scatterplot of the Orange data realized with lattice, with one subpanel representing the individual data of each tree. The number of trees in each panel is reported in the upper part of the plot area As illustrated, using the vertical bar |, we are able to obtain the plot conditional to the value of the variable Tree. In the upper part of the panels, you would notice the reference to the value of the conditional variable, which, in this case, is the column Tree. As mentioned before, ggplot2 offers this option too; we will see one example of that in the next section. In the next section, You would find a quick reference to how to convert a typical plot type from lattice to ggplot2. In this case, the examples are adapted to the typical plotting style of the lattice plots. Scatterplot with individual observations To plot the graph using lattice, use the following code: xyplot(age~circumference, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange) The preceding code results in the following output: Scatterplot of orange dataset with faceting To plot the graph using lattice, use the following code: xyplot(age~circumference|Tree, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange, facets=~Tree) The preceding code results in the following output: Faceting scatterplot with line and points To plot the graph using lattice, use the following code: xyplot(age~circumference|Tree, data=Orange, type=”b”) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange, geom=c(“line”,”point”), facets=~Tree) The preceding code results in the following output: Scatterplots with grouping data To plot the graph using lattice, use the following code: xyplot(age~circumference, data=Orange, groups=Tree, type=”b”) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange,color=Tree, geom=c(“line”,”point”)) The preceding code results in the following output: Boxplot of orange dataset To plot the graph using lattice, use the following code: bwplot(circumference~Tree, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=”boxplot”) The preceding code results in the following output: Histogram of orange dataset To plot the graph using lattice, use the following code: histogram(Orange$circumference, type = “count”) To plot the graph using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”) The preceding code results in the following output: Histogram with reference line at median value in red To plot the graph using lattice, use the following code: histogram(~circumference, data=Orange, type = “count”, panel=function(x,...){panel.histogram(x, ...);panel.abline(v=median(x), col=”red”)}) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”)+geom_vline(xintercept = median(Orange$circumference), colour=”red”) The preceding code results in the following output: ggplot2 and the grammar of graphics The ggplot2 package was developed by Hadley Wickham by implementing a completely different approach to statistical plots. As is the case with lattice, this package is also based on grid, providing a series of high-level functions that allow the creation of complete plots. The Grammar of Graphics by Leland Wilkinson. Briefly, The Grammar of Graphics assumes that a statistical graphic is a mapping of data to the aesthetic attributes and geometric objects used to represent data, such as points, lines, bars, and so on. Besides the aesthetic attributes, the plot can also contain statistical transformation or grouping of data. As in lattice, in ggplot2, we have the possibility of splitting data by a certain variable, obtaining a representation of each subset of data in an independent subplot; such representation in ggplot2 is called faceting. In a more formal way, the main components of the grammar of graphics are the data and its mapping, aesthetics, geometric objects, statistical transformations, scales, coordinates, and faceting: The data that must be visualized is mapped to aesthetic attributes, which define how the data should be perceived Geometric objects describe what is actually displayed on the plot, such as lines, points, or bars; the geometric objects basically define which kind of plot you are going to draw Statistical transformations are applied to the data to group them; examples of statistical transformations would be the smooth line or the regression lines of the previous examples or the binning of the histograms Scales represent the connection between the aesthetic spaces and the actual values that should be represented. Scales may also be used to draw legends Coordinates represent the coordinate system in which the data is drawn Faceting, which we have already mentioned, is the grouping of data in subsets defined by a value of one variable In ggplot2, there are two main high-level functions capable of directly creating a plot, qplot(), and ggplot(); qplot() stands for quick plot, and it is a simple function that serves a purpose similar to that served by the plot() function in graphics. The ggplot()function, on the other hand, is a much more advanced function that allows the user to have more control of the plot layout and details. In our journey into the world of ggplot2, we will see some examples of qplot(), in particular when we go through the different kinds of graphs, but we will dig a lot deeper into ggplot() since this last function is more suited to advanced examples. If you have a look at the different forums based on R programming, there is quite a bit of discussion as to which of these two functions would be more convenient to use. My general recommendation would be that it depends on the type of graph you are drawing more frequently. For simple and standard plots, where only the data should be represented and only the minor modification of standard layouts are required, the qplot() function will do the job. On the other hand, if you need to apply particular transformations to the data or if you would just like to keep the freedom of controlling and defining the different details of the plot layout, I would recommend that you focus on ggplot(). As you will see, the code between these functions is not completely different since they are both based on the same underlying philosophy, but the way in which the options are set is quite different, so if you want to adapt a plot from one function to the other, you will essentially need to rewrite your code. If you just want to focus on learning only one of them, I would definitely recommend that you learn ggplot(). In the following code, you will see an example of a plot realized with ggplot2, where you can identify some of the components of the grammar of graphics. The example is realized with the ggplot() function, which allows a more direct comparison with the grammar of graphics, but coming just after the following code, you could also find the corresponding qplot() code useful. Both codes generate the graph depicted in Figure 1.7: require(ggplot2)                             ## Load ggplot2 data(Orange)                                 ## Load the data   ggplot(data=Orange,                          ## Data used   aes(x=circumference,y=age, color=Tree))+   ## Aesthetic geom_point()+                                ## Geometry stat_smooth(method=”lm”,se=FALSE)            ## Statistics   ### Corresponding code with qplot() qplot(circumference,age,data=Orange,         ## Data used   color=Tree,                                ## Aesthetic mapping   geom=c(“point”,”smooth”),method=”lm”,se=FALSE) This simple example can give you an idea of the role of each portion of code in a ggplot2 graph; you have seen how the main function body creates the connection between the data and the aesthetics we are interested to represent and how, on top of this, you add the components of the plot, as in this case, we added the geometry element of points and the statistical element of regression. You can also notice how the components that need to be added to the main function call are included using the + sign. One more thing worth mentioning at this point is that if you run just the main body function in the ggplot() function, you will get an error message. This is because this call is not able to generate an actual plot. The step during which the plot is actually created is when you include the geometric attribute, which, in this case is geom_point(). This is perfectly in line with the grammar of graphics since, as we have seen, the geometry represents the actual connection between the data and what is represented on the plot. This is the stage where we specify that the data should be represented as points; before that, nothing was specified about which plot we were interested in drawing. Figure 1.7: This is an example of plotting the Orange dataset with ggplot2 Summary To learn more about the similar technology, the following books/videos published by Packt Publishing (https://www.packtpub.com/) are recommended: ggplot2 Essentials (https://www.packtpub.com/big-data-and-business-intelligence/ggplot2-essentials) Video: Building Interactive Graphs with ggplot2 and Shiny (https://www.packtpub.com/big-data-and-business-intelligence/building-interactive-graphs-ggplot2-and-shiny-video) Resources for Article: Further resources on this subject: Refresher [article] Interactive Documents [article] Driving Visual Analyses Automobile Data (Python) [article]
Read more
  • 0
  • 0
  • 6191

article-image-building-recommendation-engine-spark
Packt
24 Feb 2016
44 min read
Save for later

Building a Recommendation Engine with Spark

Packt
24 Feb 2016
44 min read
In this article, we will explore individual machine learning models in detail, starting with recommendation engines. (For more resources related to this topic, see here.) Recommendation engines are probably among the best types of machine learning model known to the general public. Even if people do not know exactly what a recommendation engine is, they have most likely experienced one through the use of popular websites such as Amazon, Netflix, YouTube, Twitter, LinkedIn, and Facebook. Recommendations are a core part of all these businesses, and in some cases, they drive significant percentages of their revenue. The idea behind recommendation engines is to predict what people might like and to uncover relationships between items to aid in the discovery process (in this way, it is similar and, in fact, often complementary to search engines, which also play a role in discovery). However, unlike search engines, recommendation engines try to present people with relevant content that they did not necessarily search for or that they might not even have heard of. Typically, a recommendation engine tries to model the connections between users and some type of item. If we can do a good job of showing our users movies related to a given movie, we could aid in discovery and navigation on our site, again improving our users' experience, engagement, and the relevance of our content to them. However, recommendation engines are not limited to movies, books, or products. The techniques we will explore in this article can be applied to just about any user-to-item relationship as well as user-to-user connections, such as those found on social networks, allowing us to make recommendations such as people you may know or who to follow. Recommendation engines are most effective in two general scenarios (which are not mutually exclusive). They are explained here: Large number of available options for users: When there are a very large number of available items, it becomes increasingly difficult for the user to find something they want. Searching can help when the user knows what they are looking for, but often, the right item might be something previously unknown to them. In this case, being recommended relevant items, that the user may not already know about, can help them discover new items. A significant degree of personal taste involved: When personal taste plays a large role in selection, recommendation models, which often utilize a wisdom of the crowd approach, can be helpful in discovering items based on the behavior of others that have similar taste profiles. In this article, we will: Introduce the various types of recommendation engines Build a recommendation model using data about user preferences Use the trained model to compute recommendations for a given user as well compute similar items for a given item (that is, related items) Apply standard evaluation metrics to the model that we created to measure how well it performs in terms of predictive capability Types of recommendation models Recommender systems are widely studied, and there are many approaches used, but there are two that are probably most prevalent: content-based filtering and collaborative filtering. Recently, other approaches such as ranking models have also gained in popularity. In practice, many approaches are hybrids, incorporating elements of many different methods into a model or combination of models. Content-based filtering Content-based methods try to use the content or attributes of an item, together with some notion of similarity between two pieces of content, to generate items similar to a given item. These attributes are often textual content (such as titles, names, tags, and other metadata attached to an item), or in the case of media, they could include other features of the item, such as attributes extracted from audio and video content. In a similar manner, user recommendations can be generated based on attributes of users or user profiles, which are then matched to item attributes using the same measure of similarity. For example, a user can be represented by the combined attributes of the items they have interacted with. This becomes their user profile, which is then compared to item attributes to find items that match the user profile. Collaborative filtering Collaborative filtering is a form of wisdom of the crowd approach where the set of preferences of many users with respect to items is used to generate estimated preferences of users for items with which they have not yet interacted. The idea behind this is the notion of similarity. In a user-based approach, if two users have exhibited similar preferences (that is, patterns of interacting with the same items in broadly the same way), then we would assume that they are similar to each other in terms of taste. To generate recommendations for unknown items for a given user, we can use the known preferences of other users that exhibit similar behavior. We can do this by selecting a set of similar users and computing some form of combined score based on the items they have shown a preference for. The overall logic is that if others have tastes similar to a set of items, these items would tend to be good candidates for recommendation. We can also take an item-based approach that computes some measure of similarity between items. This is usually based on the existing user-item preferences or ratings. Items that tend to be rated the same by similar users will be classed as similar under this approach. Once we have these similarities, we can represent a user in terms of the items they have interacted with and find items that are similar to these known items, which we can then recommend to the user. Again, a set of items similar to the known items is used to generate a combined score to estimate for an unknown item. The user- and item-based approaches are usually referred to as nearest-neighbor models, since the estimated scores are computed based on the set of most similar users or items (that is, their neighbors). Finally, there are many model-based methods that attempt to model the user-item preferences themselves so that new preferences can be estimated directly by applying the model to unknown user-item combinations. Matrix factorization Since Spark's recommendation models currently only include an implementation of matrix factorization, we will focus our attention on this class of models. This focus is with good reason; however, these types of models have consistently been shown to perform extremely well in collaborative filtering and were among the best models in well-known competitions such as the Netflix prize. For more information on and a brief overview of the performance of the best algorithms for the Netflix prize, see http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html. Explicit matrix factorization When we deal with data that consists of preferences of users that are provided by the users themselves, we refer to explicit preference data. This includes, for example, ratings, thumbs up, likes, and so on that are given by users to items. We can take these ratings and form a two-dimensional matrix with users as rows and items as columns. Each entry represents a rating given by a user to a certain item. Since in most cases, each user has only interacted with a relatively small set of items, this matrix has only a few non-zero entries (that is, it is very sparse). As a simple example, let's assume that we have the following user ratings for a set of movies: Tom, Star Wars, 5 Jane, Titanic, 4 Bill, Batman, 3 Jane, Star Wars, 2 Bill, Titanic, 3 We will form the following ratings matrix: A simple movie-rating matrix Matrix factorization (or matrix completion) attempts to directly model this user-item matrix by representing it as a product of two smaller matrices of lower dimension. Thus, it is a dimensionality-reduction technique. If we have U users and I items, then our user-item matrix is of dimension U x I and might look something like the one shown in the following diagram: A sparse ratings matrix If we want to find a lower dimension (low-rank) approximation to our user-item matrix with the dimension k, we would end up with two matrices: one for users of size U x k and one for items of size I x k. These are known as factor matrices. If we multiply these two factor matrices, we would reconstruct an approximate version of the original ratings matrix. Note that while the original ratings matrix is typically very sparse, each factor matrix is dense, as shown in the following diagram: The user- and item-factor matrices These models are often also called latent feature models, as we are trying to discover some form of hidden features (which are represented by the factor matrices) that account for the structure of behavior inherent in the user-item rating matrix. While the latent features or factors are not directly interpretable, they might, perhaps, represent things such as the tendency of a user to like movies from a certain director, genre, style, or group of actors, for example. As we are directly modeling the user-item matrix, the prediction in these models is relatively straightforward: to compute a predicted rating for a user and item, we compute the vector dot product between the relevant row of the user-factor matrix (that is, the user's factor vector) and the relevant row of the item-factor matrix (that is, the item's factor vector). This is illustrated with the highlighted vectors in the following diagram: Computing recommendations from user- and item-factor vectors To find out the similarity between two items, we can use the same measures of similarity as we would use in the nearest-neighbor models, except that we can use the factor vectors directly by computing the similarity between two item-factor vectors, as illustrated in the following diagram: Computing similarity with item-factor vectors The benefit of factorization models is the relative ease of computing recommendations once the model is created. However, for very large user and itemsets, this can become a challenge as it requires storage and computation across potentially many millions of user- and item-factor vectors. Another advantage, as mentioned earlier, is that they tend to offer very good performance. Projects such as Oryx (https://github.com/OryxProject/oryx) and Prediction.io (https://github.com/PredictionIO/PredictionIO) focus on model serving for large-scale models, including recommenders based on matrix factorization. On the down side, factorization models are relatively more complex to understand and interpret compared to nearest-neighbor models and are often more computationally intensive during the model's training phase. Implicit matrix factorization So far, we have dealt with explicit preferences such as ratings. However, much of the preference data that we might be able to collect is implicit feedback, where the preferences between a user and item are not given to us, but are, instead, implied from the interactions they might have with an item. Examples include binary data (such as whether a user viewed a movie, whether they purchased a product, and so on) as well as count data (such as the number of times a user watched a movie). There are many different approaches to deal with implicit data. MLlib implements a particular approach that treats the input rating matrix as two matrices: a binary preference matrix, P, and a matrix of confidence weights, C. For example, let's assume that the user-movie ratings we saw previously were, in fact, the number of times each user had viewed that movie. The two matrices would look something like ones shown in the following screenshot. Here, the matrix P informs us that a movie was viewed by a user, and the matrix C represents the confidence weighting, in the form of the view counts—generally, the more a user has watched a movie, the higher the confidence that they actually like it. Representation of an implicit preference and confidence matrix The implicit model still creates a user- and item-factor matrix. In this case, however, the matrix that the model is attempting to approximate is not the overall ratings matrix but the preference matrix P. If we compute a recommendation by calculating the dot product of a user- and item-factor vector, the score will not be an estimate of a rating directly. It will rather be an estimate of the preference of a user for an item (though not strictly between 0 and 1, these scores will generally be fairly close to a scale of 0 to 1). Alternating least squares Alternating Least Squares (ALS) is an optimization technique to solve matrix factorization problems; this technique is powerful, achieves good performance, and has proven to be relatively easy to implement in a parallel fashion. Hence, it is well suited for platforms such as Spark. At the time of writing this, it is the only recommendation model implemented in MLlib. ALS works by iteratively solving a series of least squares regression problems. In each iteration, one of the user- or item-factor matrices is treated as fixed, while the other one is updated using the fixed factor and the rating data. Then, the factor matrix that was solved for is, in turn, treated as fixed, while the other one is updated. This process continues until the model has converged (or for a fixed number of iterations). Spark's documentation for collaborative filtering contains references to the papers that underlie the ALS algorithms implemented each component of explicit and implicit data. You can view the documentation at http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html. Extracting the right features from your data In this section, we will use explicit rating data, without additional user or item metadata or other information related to the user-item interactions. Hence, the features that we need as inputs are simply the user IDs, movie IDs, and the ratings assigned to each user and movie pair. Extracting features from the MovieLens 100k dataset Start the Spark shell in the Spark base directory, ensuring that you provide enough memory via the –driver-memory option: >./bin/spark-shell –driver-memory 4g In this example, we will use the same MovieLens dataset. Use the directory in which you placed the MovieLens 100k dataset as the input path in the following code. First, let's inspect the raw ratings dataset: val rawData = sc.textFile("/PATH/ml-100k/u.data") rawData.first() You will see output similar to these lines of code: 14/03/30 11:42:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/03/30 11:42:41 WARN LoadSnappy: Snappy native library not loaded 14/03/30 11:42:41 INFO FileInputFormat: Total input paths to process : 1 14/03/30 11:42:41 INFO SparkContext: Starting job: first at <console>:15 14/03/30 11:42:41 INFO DAGScheduler: Got job 0 (first at <console>:15) with 1 output partitions (allowLocal=true) 14/03/30 11:42:41 INFO DAGScheduler: Final stage: Stage 0 (first at <console>:15) 14/03/30 11:42:41 INFO DAGScheduler: Parents of final stage: List() 14/03/30 11:42:41 INFO DAGScheduler: Missing parents: List() 14/03/30 11:42:41 INFO DAGScheduler: Computing the requested partition locally 14/03/30 11:42:41 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 11:42:41 INFO SparkContext: Job finished: first at <console>:15, took 0.030533 s res0: String = 196  242  3  881250949 Recall that this dataset consisted of the user id, movie id, rating, timestamp fields separated by a tab ("t") character. We don't need the time when the rating was made to train our model, so let's simply extract the first three fields: val rawRatings = rawData.map(_.split("t").take(3)) We will first split each record on the "t" character, which gives us an Array[String] array. We will then use Scala's take function to keep only the first 3 elements of the array, which correspond to user id, movie id, and rating, respectively. We can inspect the first record of our new RDD by calling rawRatings.first(), which collects just the first record of the RDD back to the driver program. This will result in the following output: 14/03/30 12:24:00 INFO SparkContext: Starting job: first at <console>:21 14/03/30 12:24:00 INFO DAGScheduler: Got job 1 (first at <console>:21) with 1 output partitions (allowLocal=true) 14/03/30 12:24:00 INFO DAGScheduler: Final stage: Stage 1 (first at <console>:21) 14/03/30 12:24:00 INFO DAGScheduler: Parents of final stage: List() 14/03/30 12:24:00 INFO DAGScheduler: Missing parents: List() 14/03/30 12:24:00 INFO DAGScheduler: Computing the requested partition locally 14/03/30 12:24:00 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 12:24:00 INFO SparkContext: Job finished: first at <console>:21, took 0.00391 s res6: Array[String] = Array(196, 242, 3) We will use Spark's MLlib library to train our model. Let's take a look at what methods are available for us to use and what input is required. First, import the ALS model from MLlib: import org.apache.spark.mllib.recommendation.ALS On the console, we can inspect the available methods on the ALS object using tab completion. Type in ALS. (note the dot) and then press the Tab key. You should see the autocompletion of the methods: ALS. asInstanceOf    isInstanceOf    main            toString        train           trainImplicit The method we want to use is train. If we type ALS.train and hit Enter, we will get an error. However, this error will tell us what the method signature looks like: ALS.train <console>:12: error: ambiguous reference to overloaded definition, both method train in object ALS of type (ratings: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating], rank: Int, iterations: Int)org.apache.spark.mllib.recommendation.MatrixFactorizationModel and  method train in object ALS of type (ratings: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating], rank: Int, iterations: Int, lambda: Double)org.apache.spark.mllib.recommendation.MatrixFactorizationModel match expected type ?               ALS.train                   ^ So, we can see that at a minimum, we need to provide the input arguments, ratings, rank, and iterations. The second method also requires an argument called lambda. We'll cover these three shortly, but let's take a look at the ratings argument. First, let's import the Rating class that it references and use a similar approach to find out what an instance of Rating requires, by typing in Rating() and hitting Enter: import org.apache.spark.mllib.recommendation.Rating Rating() <console>:13: error: not enough arguments for method apply: (user: Int, product: Int, rating: Double)org.apache.spark.mllib.recommendation.Rating in object Rating. Unspecified value parameters user, product, rating.               Rating()                     ^ As we can see from the preceding output, we need to provide the ALS model with an RDD that consists of Rating records. A Rating class, in turn, is just a wrapper around user id, movie id (called product here), and the actual rating arguments. We'll create our rating dataset using the map method and transforming the array of IDs and ratings into a Rating object: val ratings = rawRatings.map { case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) } Notice that we need to use toInt or toDouble to convert the raw rating data (which was extracted as Strings from the text file) to Int or Double numeric inputs. Also, note the use of a case statement that allows us to extract the relevant variable names and use them directly (this saves us from having to use something like val user = ratings(0)). For more on Scala case statements and pattern matching as used here, take a look at http://docs.scala-lang.org/tutorials/tour/pattern-matching.html. We now have an RDD[Rating] that we can verify by calling: ratings.first() 14/03/30 12:32:48 INFO SparkContext: Starting job: first at <console>:24 14/03/30 12:32:48 INFO DAGScheduler: Got job 2 (first at <console>:24) with 1 output partitions (allowLocal=true) 14/03/30 12:32:48 INFO DAGScheduler: Final stage: Stage 2 (first at <console>:24) 14/03/30 12:32:48 INFO DAGScheduler: Parents of final stage: List() 14/03/30 12:32:48 INFO DAGScheduler: Missing parents: List() 14/03/30 12:32:48 INFO DAGScheduler: Computing the requested partition locally 14/03/30 12:32:48 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 12:32:48 INFO SparkContext: Job finished: first at <console>:24, took 0.003752 s res8: org.apache.spark.mllib.recommendation.Rating = Rating(196,242,3.0) Training the recommendation model Once we have extracted these simple features from our raw data, we are ready to proceed with model training; MLlib takes care of this for us. All we have to do is provide the correctly-parsed input RDD we just created as well as our chosen model parameters. Training a model on the MovieLens 100k dataset We're now ready to train our model! The other inputs required for our model are as follows: rank: This refers to the number of factors in our ALS model, that is, the number of hidden features in our low-rank approximation matrices. Generally, the greater the number of factors, the better, but this has a direct impact on memory usage, both for computation and to store models for serving, particularly for large number of users or items. Hence, this is often a trade-off in real-world use cases. A rank in the range of 10 to 200 is usually reasonable. iterations: This refers to the number of iterations to run. While each iteration in ALS is guaranteed to decrease the reconstruction error of the ratings matrix, ALS models will converge to a reasonably good solution after relatively few iterations. So, we don't need to run for too many iterations in most cases (around 10 is often a good default). lambda: This parameter controls the regularization of our model. Thus, lambda controls over fitting. The higher the value of lambda, the more is the regularization applied. What constitutes a sensible value is very dependent on the size, nature, and sparsity of the underlying data, and as with almost all machine learning models, the regularization parameter is something that should be tuned using out-of-sample test data and cross-validation approaches. We'll use rank of 50, 10 iterations, and a lambda parameter of 0.01 to illustrate how to train our model: val model = ALS.train(ratings, 50, 10, 0.01) This returns a MatrixFactorizationModel object, which contains the user and item factors in the form of an RDD of (id, factor) pairs. These are called userFeatures and productFeatures, respectively. For example: model.userFeatures res14: org.apache.spark.rdd.RDD[(Int, Array[Double])] = FlatMappedRDD[659] at flatMap at ALS.scala:231 We can see that the factors are in the form of an Array[Double]. Note that the operations used in MLlib's ALS implementation are lazy transformations, so the actual computation will only be performed once we call some sort of action on the resulting RDDs of the user and item factors. We can force the computation using a Spark action such as count: model.userFeatures.count This will trigger the computation, and we will see a quite a bit of output text similar to the following lines of code: 14/03/30 13:10:40 INFO SparkContext: Starting job: count at <console>:26 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 665 (map at ALS.scala:147) 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 664 (map at ALS.scala:146) 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 674 (mapPartitionsWithIndex at ALS.scala:164) ... 14/03/30 13:10:45 INFO SparkContext: Job finished: count at <console>:26, took 5.068255 s res16: Long = 943 If we call count for the movie factors, we will see the following output: model.productFeatures.count 14/03/30 13:15:21 INFO SparkContext: Starting job: count at <console>:26 14/03/30 13:15:21 INFO DAGScheduler: Got job 10 (count at <console>:26) with 1 output partitions (allowLocal=false) 14/03/30 13:15:21 INFO DAGScheduler: Final stage: Stage 165 (count at <console>:26) 14/03/30 13:15:21 INFO DAGScheduler: Parents of final stage: List(Stage 169, Stage 166) 14/03/30 13:15:21 INFO DAGScheduler: Missing parents: List() 14/03/30 13:15:21 INFO DAGScheduler: Submitting Stage 165 (FlatMappedRDD[883] at flatMap at ALS.scala:231), which has no missing parents 14/03/30 13:15:21 INFO DAGScheduler: Submitting 1 missing tasks from Stage 165 (FlatMappedRDD[883] at flatMap at ALS.scala:231) ... 14/03/30 13:15:21 INFO SparkContext: Job finished: count at <console>:26, took 0.030044 s res21: Long = 1682 As expected, we have a factor array for each user (943 factors) and movie (1682 factors). Training a model using implicit feedback data The standard matrix factorization approach in MLlib deals with explicit ratings. To work with implicit data, you can use the trainImplicit method. It is called in a manner similar to the standard train method. There is an additional parameter, alpha, that can be set (and in the same way, the regularization parameter, lambda, should be selected via testing and cross-validation methods). The alpha parameter controls the baseline level of confidence weighting applied. A higher level of alpha tends to make the model more confident about the fact that missing data equates to no preference for the relevant user-item pair. As an exercise, try to take the existing MovieLens dataset and convert it into an implicit dataset. One possible approach is to convert it to binary feedback (0s and 1s) by applying a threshold on the ratings at some level. Another approach could be to convert the ratings' values into confidence weights (for example, perhaps, low ratings could imply zero weights, or even negative weights, which are supported by MLlib's implementation). Train a model on this dataset and compare the results of the following section with those generated by your implicit model. Using the recommendation model Now that we have our trained model, we're ready to use it to make predictions. These predictions typically take one of two forms: recommendations for a given user and related or similar items for a given item. User recommendations In this case, we would like to generate recommended items for a given user. This usually takes the form of a top-K list, that is, the K items that our model predicts will have the highest probability of the user liking them. This is done by computing the predicted score for each item and ranking the list based on this score. The exact method to perform this computation depends on the model involved. For example, in user-based approaches, the ratings of similar users on items are used to compute the recommendations for a user, while in an item-based approach, the computation is based on the similarity of items the user has rated to the candidate items. In matrix factorization, because we are modeling the ratings matrix directly, the predicted score can be computed as the vector dot product between a user-factor vector and an item-factor vector. Generating movie recommendations from the MovieLens 100k dataset As MLlib's recommendation model is based on matrix factorization, we can use the factor matrices computed by our model to compute predicted scores (or ratings) for a user. We will focus on the explicit rating case using MovieLens data; however, the approach is the same when using the implicit model. The MatrixFactorizationModel class has a convenient predict method that will compute a predicted score for a given user and item combination: val predictedRating = model.predict(789, 123) 14/03/30 16:10:10 INFO SparkContext: Starting job: lookup at MatrixFactorizationModel.scala:45 14/03/30 16:10:10 INFO DAGScheduler: Got job 30 (lookup at MatrixFactorizationModel.scala:45) with 1 output partitions (allowLocal=false) ... 14/03/30 16:10:10 INFO SparkContext: Job finished: lookup at MatrixFactorizationModel.scala:46, took 0.023077 s predictedRating: Double = 3.128545693368485 As we can see, this model predicts a rating of 3.12 for user 789 and movie 123. Note that you might see different results than those shown in this section because the ALS model is initialized randomly. So, different runs of the model will lead to different solutions.  The predict method can also take an RDD of (user, item) IDs as the input and will generate predictions for each of these. We can use this method to make predictions for many users and items at the same time. To generate the top-K recommended items for a user, MatrixFactorizationModel provides a convenience method called recommendProducts. This takes two arguments: user and num, where user is the user ID, and num is the number of items to recommend. It returns the top num items ranked in the order of the predicted score. Here, the scores are computed as the dot product between the user-factor vector and each item-factor vector. Let's generate the top 10 recommended items for user 789: val userId = 789 val K = 10 val topKRecs = model.recommendProducts(userId, K) We now have a set of predicted ratings for each movie for user 789. If we print this out, we could inspect the top 10 recommendations for this user: println(topKRecs.mkString("n")) You should see the following output on your console: Rating(789,715,5.931851273771102) Rating(789,12,5.582301095666215) Rating(789,959,5.516272981542168) Rating(789,42,5.458065302395629) Rating(789,584,5.449949837103569) Rating(789,750,5.348768847643657) Rating(789,663,5.30832117499004) Rating(789,134,5.278933936827717) Rating(789,156,5.250959077906759) Rating(789,432,5.169863417126231) Inspecting the recommendations We can give these recommendations a sense check by taking a quick look at the titles of the movies a user has rated and the recommended movies. First, we need to load the movie data. We'll collect this data as a Map[Int, String] method mapping the movie ID to the title: val movies = sc.textFile("/PATH/ml-100k/u.item") val titles = movies.map(line => line.split("\|").take(2)).map(array => (array(0).toInt,  array(1))).collectAsMap() titles(123) res68: String = Frighteners, The (1996) For our user 789, we can find out what movies they have rated, take the 10 movies with the highest rating, and then check the titles. We will do this now by first using the keyBy Spark function to create an RDD of key-value pairs from our ratings RDD, where the key will be the user ID. We will then use the lookup function to return just the ratings for this key (that is, that particular user ID) to the driver: val moviesForUser = ratings.keyBy(_.user).lookup(789) Let's see how many movies this user has rated. This will be the size of the moviesForUser collection: println(moviesForUser.size) We will see that this user has rated 33 movies. Next, we will take the 10 movies with the highest ratings by sorting the moviesForUser collection using the rating field of the Rating object. We will then extract the movie title for the relevant product ID attached to the Rating class from our mapping of movie titles and print out the top 10 titles with their ratings: moviesForUser.sortBy(-_.rating).take(10).map(rating => (titles(rating.product), rating.rating)).foreach(println) You will see the following output displayed: (Godfather, The (1972),5.0) (Trainspotting (1996),5.0) (Dead Man Walking (1995),5.0) (Star Wars (1977),5.0) (Swingers (1996),5.0) (Leaving Las Vegas (1995),5.0) (Bound (1996),5.0) (Fargo (1996),5.0) (Last Supper, The (1995),5.0) (Private Parts (1997),4.0) Now, let's take a look at the top 10 recommendations for this user and see what the titles are using the same approach as the one we used earlier (note that the recommendations are already sorted): topKRecs.map(rating => (titles(rating.product), rating.rating)).foreach(println) (To Die For (1995),5.931851273771102) (Usual Suspects, The (1995),5.582301095666215) (Dazed and Confused (1993),5.516272981542168) (Clerks (1994),5.458065302395629) (Secret Garden, The (1993),5.449949837103569) (Amistad (1997),5.348768847643657) (Being There (1979),5.30832117499004) (Citizen Kane (1941),5.278933936827717) (Reservoir Dogs (1992),5.250959077906759) (Fantasia (1940),5.169863417126231) We leave it to you to decide whether these recommendations make sense. Item recommendations Item recommendations are about answering the following question: for a certain item, what are the items most similar to it? Here, the precise definition of similarity is dependent on the model involved. In most cases, similarity is computed by comparing the vector representation of two items using some similarity measure. Common similarity measures include Pearson correlation and cosine similarity for real-valued vectors and Jaccard similarity for binary vectors. Generating similar movies for the MovieLens 100K dataset The current MatrixFactorizationModel API does not directly support item-to-item similarity computations. Therefore, we will need to create our own code to do this. We will use the cosine similarity metric, and we will use the jblas linear algebra library (a dependency of MLlib) to compute the required vector dot products. This is similar to how the existing predict and recommendProducts methods work, except that we will use cosine similarity as opposed to just the dot product. We would like to compare the factor vector of our chosen item with each of the other items, using our similarity metric. In order to perform linear algebra computations, we will first need to create a vector object out of the factor vectors, which are in the form of an Array[Double]. The JBLAS class, DoubleMatrix, takes an Array[Double] as the constructor argument as follows: import org.jblas.DoubleMatrix val aMatrix = new DoubleMatrix(Array(1.0, 2.0, 3.0)) aMatrix: org.jblas.DoubleMatrix = [1.000000; 2.000000; 3.000000] Note that using jblas, vectors are represented as a one-dimensional DoubleMatrix class, while matrices are a two-dimensional DoubleMatrix class. We will need a method to compute the cosine similarity between two vectors. Cosine similarity is a measure of the angle between two vectors in an n-dimensional space. It is computed by first calculating the dot product between the vectors and then dividing the result by a denominator, which is the norm (or length) of each vector multiplied together (specifically, the L2-norm is used in cosine similarity). In this way, cosine similarity is a normalized dot product. The cosine similarity measure takes on values between -1 and 1. A value of 1 implies completely similar, while a value of 0 implies independence (that is, no similarity). This measure is useful because it also captures negative similarity, that is, a value of -1 implies that not only are the vectors not similar, but they are also completely dissimilar. Let's create our cosineSimilarity function here: def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double = {   vec1.dot(vec2) / (vec1.norm2() * vec2.norm2()) } Note that we defined a return type for this function of Double. We are not required to do this, since Scala features type inference. However, it can often be useful to document return types for Scala functions. Let's try it out on one of our item factors for item 567. We will need to collect an item factor from our model; we will do this using the lookup method in a similar way that we did earlier to collect the ratings for a specific user. In the following lines of code, we also use the head function, since lookup returns an array of values, and we only need the first value (in fact, there will only be one value, which is the factor vector for this item). Since this will be an Array[Double], we will then need to create a DoubleMatrix object from it and compute the cosine similarity with itself: val itemId = 567 val itemFactor = model.productFeatures.lookup(itemId).head val itemVector = new DoubleMatrix(itemFactor) cosineSimilarity(itemVector, itemVector) A similarity metric should measure how close, in some sense, two vectors are to each other. Here, we can see that our cosine similarity metric tells us that this item vector is identical to itself, which is what we would expect: res113: Double = 1.0 Now, we are ready to apply our similarity metric to each item: val sims = model.productFeatures.map{ case (id, factor) =>  val factorVector = new DoubleMatrix(factor)   val sim = cosineSimilarity(factorVector, itemVector)   (id, sim) } Next, we can compute the top 10 most similar items by sorting out the similarity score for each item: // recall we defined K = 10 earlier val sortedSims = sims.top(K)(Ordering.by[(Int, Double), Double] { case (id, similarity) => similarity }) In the preceding code snippet, we used Spark's top function, which is an efficient way to compute top-K results in a distributed fashion, instead of using collect to return all the data to the driver and sorting it locally (remember that we could be dealing with millions of users and items in the case of recommendation models). We need to tell Spark how to sort the (item id, similarity score) pairs in the sims RDD. To do this, we will pass an extra argument to top, which is a Scala Ordering object that tells Spark that it should sort by the value in the key-value pair (that is, sort by similarity). Finally, we can print the 10 items with the highest computed similarity metric to our given item: println(sortedSims.take(10).mkString("n")) You will see output like the following one: (567,1.0000000000000002) (1471,0.6932331537649621) (670,0.6898690594544726) (201,0.6897964975027041) (343,0.6891221044611473) (563,0.6864214133620066) (294,0.6812075443259535) (413,0.6754663844488256) (184,0.6702643811753909) (109,0.6594872765176396) Not surprisingly, we can see that the top-ranked similar item is our item. The rest are the other items in our set of items, ranked in order of our similarity metric. Inspecting the similar items Let's see what the title of our chosen movie is: println(titles(itemId)) Wes Craven's New Nightmare (1994) As we did for user recommendations, we can sense check our item-to-item similarity computations and take a look at the titles of the most similar movies. This time, we will take the top 11 so that we can exclude our given movie. So, we will take the numbers 1 to 11 in the list: val sortedSims2 = sims.top(K + 1)(Ordering.by[(Int, Double), Double] { case (id, similarity) => similarity }) sortedSims2.slice(1, 11).map{ case (id, sim) => (titles(id), sim) }.mkString("n") You will see the movie titles and scores displayed similar to this output: (Hideaway (1995),0.6932331537649621) (Body Snatchers (1993),0.6898690594544726) (Evil Dead II (1987),0.6897964975027041) (Alien: Resurrection (1997),0.6891221044611473) (Stephen King's The Langoliers (1995),0.6864214133620066) (Liar Liar (1997),0.6812075443259535) (Tales from the Crypt Presents: Bordello of Blood (1996),0.6754663844488256) (Army of Darkness (1993),0.6702643811753909) (Mystery Science Theater 3000: The Movie (1996),0.6594872765176396) (Scream (1996),0.6538249646863378) Once again note that you might see quite different results due to random model initialization. Now that you have computed similar items using cosine similarity, see if you can do the same with the user-factor vectors to compute similar users for a given user. Evaluating the performance of recommendation models How do we know whether the model we have trained is a good model? We need to be able to evaluate its predictive performance in some way. Evaluation metrics are measures of a model's predictive capability or accuracy. Some are direct measures of how well a model predicts the model's target variable (such as Mean Squared Error), while others are concerned with how well the model performs at predicting things that might not be directly optimized in the model but are often closer to what we care about in the real world (such as Mean average precision). Evaluation metrics provide a standardized way of comparing the performance of the same model with different parameter settings and of comparing performance across different models. Using these metrics, we can perform model selection to choose the best-performing model from the set of models we wish to evaluate. Here, we will show you how to calculate two common evaluation metrics used in recommender systems and collaborative filtering models: Mean Squared Error and Mean average precision at K. Mean Squared Error The Mean Squared Error (MSE) is a direct measure of the reconstruction error of the user-item rating matrix. It is also the objective function being minimized in certain models, specifically many matrix-factorization techniques, including ALS. As such, it is commonly used in explicit ratings settings. It is defined as the sum of the squared errors divided by the number of observations. The squared error, in turn, is the square of the difference between the predicted rating for a given user-item pair and the actual rating. We will use our user 789 as an example. Let's take the first rating for this user from the moviesForUser set of Ratings that we previously computed: val actualRating = moviesForUser.take(1)(0) actualRating: org.apache.spark.mllib.recommendation.Rating = Rating(789,1012,4.0) We will see that the rating for this user-item combination is 4. Next, we will compute the model's predicted rating: val predictedRating = model.predict(789, actualRating.product) ... 14/04/13 13:01:15 INFO SparkContext: Job finished: lookup at MatrixFactorizationModel.scala:46, took 0.025404 s predictedRating: Double = 4.001005374200248 We will see that the predicted rating is about 4, very close to the actual rating. Finally, we will compute the squared error between the actual rating and the predicted rating: val squaredError = math.pow(predictedRating - actualRating.rating, 2.0) squaredError: Double = 1.010777282523947E-6 So, in order to compute the overall MSE for the dataset, we need to compute this squared error for each (user, movie, actual rating, predicted rating) entry, sum them up, and divide them by the number of ratings. We will do this in the following code snippet. Note the following code is adapted from the Apache Spark programming guide for ALS at http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html. First, we will extract the user and product IDs from the ratings RDD and make predictions for each user-item pair using model.predict. We will use the user-item pair as the key and the predicted rating as the value: val usersProducts = ratings.map{ case Rating(user, product, rating)  => (user, product)} val predictions = model.predict(usersProducts).map{     case Rating(user, product, rating) => ((user, product), rating) } Next, we extract the actual ratings and also map the ratings RDD so that the user-item pair is the key and the actual rating is the value. Now that we have two RDDs with the same form of key, we can join them together to create a new RDD with the actual and predicted ratings for each user-item combination: val ratingsAndPredictions = ratings.map{   case Rating(user, product, rating) => ((user, product), rating) }.join(predictions) Finally, we will compute the MSE by summing up the squared errors using reduce and dividing by the count method of the number of records: val MSE = ratingsAndPredictions.map{     case ((user, product), (actual, predicted)) =>  math.pow((actual - predicted), 2) }.reduce(_ + _) / ratingsAndPredictions.count println("Mean Squared Error = " + MSE) Mean Squared Error = 0.08231947642632852 It is common to use the Root Mean Squared Error (RMSE), which is just the square root of the MSE metric. This is somewhat more interpretable, as it is in the same units as the underlying data (that is, the ratings in this case). It is equivalent to the standard deviation of the differences between the predicted and actual ratings. We can compute it simply as follows: val RMSE = math.sqrt(MSE) println("Root Mean Squared Error = " + RMSE) Root Mean Squared Error = 0.2869137090247319 Mean average precision at K Mean average precision at K (MAPK) is the mean of the average precision at K (APK) metric across all instances in the dataset. APK is a metric commonly used in information retrieval. APK is a measure of the average relevance scores of a set of the top-K documents presented in response to a query. For each query instance, we will compare the set of top-K results with the set of actual relevant documents (that is, a ground truth set of relevant documents for the query). In the APK metric, the order of the result set matters, in that, the APK score would be higher if the result documents are both relevant and the relevant documents are presented higher in the results. It is, thus, a good metric for recommender systems in that typically we would compute the top-K recommended items for each user and present these to the user. Of course, we prefer models where the items with the highest predicted scores (which are presented at the top of the list of recommendations) are, in fact, the most relevant items for the user. APK and other ranking-based metrics are also more appropriate evaluation measures for implicit datasets; here, MSE makes less sense. In order to evaluate our model, we can use APK, where each user is the equivalent of a query, and the set of top-K recommended items is the document result set. The relevant documents (that is, the ground truth) in this case, is the set of items that a user interacted with. Hence, APK attempts to measure how good our model is at predicting items that a user will find relevant and choose to interact with. The code for the following average precision computation is based on https://github.com/benhamner/Metrics.  More information on MAPK can be found at https://www.kaggle.com/wiki/MeanAveragePrecision. Our function to compute the APK is shown here: def avgPrecisionK(actual: Seq[Int], predicted: Seq[Int], k: Int): Double = {   val predK = predicted.take(k)   var score = 0.0   var numHits = 0.0   for ((p, i) <- predK.zipWithIndex) {     if (actual.contains(p)) {       numHits += 1.0       score += numHits / (i.toDouble + 1.0)     }   }   if (actual.isEmpty) {     1.0   } else {     score / scala.math.min(actual.size, k).toDouble   } } As you can see, this takes as input a list of actual item IDs that are associated with the user and another list of predicted ids so that our estimate will be relevant for the user. We can compute the APK metric for our example user 789 as follows. First, we will extract the actual movie IDs for the user: val actualMovies = moviesForUser.map(_.product) actualMovies: Seq[Int] = ArrayBuffer(1012, 127, 475, 93, 1161, 286, 293, 9, 50, 294, 181, 1, 1008, 508, 284, 1017, 137, 111, 742, 248, 249, 1007, 591, 150, 276, 151, 129, 100, 741, 288, 762, 628, 124) We will then use the movie recommendations we made previously to compute the APK score using K = 10: val predictedMovies = topKRecs.map(_.product) predictedMovies: Array[Int] = Array(27, 497, 633, 827, 602, 849, 401, 584, 1035, 1014) val apk10 = avgPrecisionK(actualMovies, predictedMovies, 10) apk10: Double = 0.0 In this case, we can see that our model is not doing a very good job of predicting relevant movies for this user as the APK score is 0. In order to compute the APK for each user and average them to compute the overall MAPK, we will need to generate the list of recommendations for each user in our dataset. While this can be fairly intensive on a large scale, we can distribute the computation using our Spark functionality. However, one limitation is that each worker must have the full item-factor matrix available so that it can compute the dot product between the relevant user vector and all item vectors. This can be a problem when the number of items is extremely high as the item matrix must fit in the memory of one machine. There is actually no easy way around this limitation. One possible approach is to only compute recommendations for a subset of items from the total item set, using approximate techniques such as Locality Sensitive Hashing (http://en.wikipedia.org/wiki/Locality-sensitive_hashing). We will now see how to go about this. First, we will collect the item factors and form a DoubleMatrix object from them: val itemFactors = model.productFeatures.map { case (id, factor) => factor }.collect() val itemMatrix = new DoubleMatrix(itemFactors) println(itemMatrix.rows, itemMatrix.columns) (1682,50) This gives us a matrix with 1682 rows and 50 columns, as we would expect from 1682 movies with a factor dimension of 50. Next, we will distribute the item matrix as a broadcast variable so that it is available on each worker node: val imBroadcast = sc.broadcast(itemMatrix) 14/04/13 21:02:01 INFO MemoryStore: ensureFreeSpace(672960) called with curMem=4006896, maxMem=311387750 14/04/13 21:02:01 INFO MemoryStore: Block broadcast_21 stored as values to memory (estimated size 657.2 KB, free 292.5 MB) imBroadcast: org.apache.spark.broadcast.Broadcast[org.jblas.DoubleMatrix] = Broadcast(21) Now we are ready to compute the recommendations for each user. We will do this by applying a map function to each user factor within which we will perform a matrix multiplication between the user-factor vector and the movie-factor matrix. The result is a vector (of length 1682, that is, the number of movies we have) with the predicted rating for each movie. We will then sort these predictions by the predicted rating: val allRecs = model.userFeatures.map{ case (userId, array) =>   val userVector = new DoubleMatrix(array)   val scores = imBroadcast.value.mmul(userVector)   val sortedWithId = scores.data.zipWithIndex.sortBy(-_._1)   val recommendedIds = sortedWithId.map(_._2 + 1).toSeq   (userId, recommendedIds) } allRecs: org.apache.spark.rdd.RDD[(Int, Seq[Int])] = MappedRDD[269] at map at <console>:29 As we can see, we now have an RDD that contains a list of movie IDs for each user ID. These movie IDs are sorted in order of the estimated rating. Note that we needed to add 1 to the returned movie ids (as highlighted in the preceding code snippet), as the item-factor matrix is 0-indexed, while our movie IDs start at 1. We also need the list of movie IDs for each user to pass into our APK function as the actual argument. We already have the ratings RDD ready, so we can extract just the user and movie IDs from it. If we use Spark's groupBy operator, we will get an RDD that contains a list of (userid, movieid) pairs for each user ID (as the user ID is the key on which we perform the groupBy operation): val userMovies = ratings.map{ case Rating(user, product, rating) => (user, product) }.groupBy(_._1) userMovies: org.apache.spark.rdd.RDD[(Int, Seq[(Int, Int)])] = MapPartitionsRDD[277] at groupBy at <console>:21 Finally, we can use Spark's join operator to join these two RDDs together on the user ID key. Then, for each user, we have the list of actual and predicted movie IDs that we can pass to our APK function. In a manner similar to how we computed MSE, we will sum each of these APK scores using a reduce action and divide by the number of users (that is, the count of the allRecs RDD): val K = 10 val MAPK = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2).toSeq   avgPrecisionK(actual, predicted, K) }.reduce(_ + _) / allRecs.count println("Mean Average Precision at K = " + MAPK) Mean Average Precision at K = 0.030486963254725705 Our model achieves a fairly low MAPK. However, note that typical values for recommendation tasks are usually relatively low, especially if the item set is extremely large. Try out a few parameter settings for lambda and rank (and alpha if you are using the implicit version of ALS) and see whether you can find a model that performs better based on the RMSE and MAPK evaluation metrics. Using MLlib's built-in evaluation functions While we have computed MSE, RMSE, and MAPK from scratch, and it a useful learning exercise to do so, MLlib provides convenience functions to do this for us in the RegressionMetrics and RankingMetrics classes. RMSE and MSE First, we will compute the MSE and RMSE metrics using RegressionMetrics. We will instantiate a RegressionMetrics instance by passing in an RDD of key-value pairs that represent the predicted and true values for each data point, as shown in the following code snippet. Here, we will again use the ratingsAndPredictions RDD we computed in our earlier example: import org.apache.spark.mllib.evaluation.RegressionMetrics val predictedAndTrue = ratingsAndPredictions.map { case ((user, product), (predicted, actual)) => (predicted, actual) } val regressionMetrics = new RegressionMetrics(predictedAndTrue) We can then access various metrics, including MSE and RMSE. We will print out these metrics here: println("Mean Squared Error = " + regressionMetrics.meanSquaredError) println("Root Mean Squared Error = " + regressionMetrics.rootMeanSquaredError) You will see that the output for MSE and RMSE is exactly the same as the metrics we computed earlier: Mean Squared Error = 0.08231947642632852 Root Mean Squared Error = 0.2869137090247319 MAP As we did for MSE and RMSE, we can compute ranking-based evaluation metrics using MLlib's RankingMetrics class. Similarly, to our own average precision function, we need to pass in an RDD of key-value pairs, where the key is an Array of predicted item IDs for a user, while the value is an array of actual item IDs. The implementation of the average precision at the K function in RankingMetrics is slightly different from ours, so we will get different results. However, the computation of the overall mean average precision (MAP, which does not use a threshold at K) is the same as our function if we select K to be very high (say, at least as high as the number of items in our item set): First, we will calculate MAP using RankingMetrics: import org.apache.spark.mllib.evaluation.RankingMetrics val predictedAndTrueForRanking = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2)   (predicted.toArray, actual.toArray) } val rankingMetrics = new RankingMetrics(predictedAndTrueForRanking) println("Mean Average Precision = " + rankingMetrics.meanAveragePrecision) You will see the following output: Mean Average Precision = 0.07171412913757183 Next, we will use our function to compute the MAP in exactly the same way as we did previously, except that we set K to a very high value, say 2000: val MAPK2000 = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2).toSeq   avgPrecisionK(actual, predicted, 2000) }.reduce(_ + _) / allRecs.count println("Mean Average Precision = " + MAPK2000) You will see that the MAP from our own function is the same as the one computed using RankingMetrics: Mean Average Precision = 0.07171412913757186 We will not cover cross validation in this article. However, note that the same techniques for cross-validation can be used to evaluate recommendation models, using the performance metrics such as MSE, RMSE, and MAP, which we covered in this section. Summary In this article, we used Spark's MLlib library to train a collaborative filtering recommendation model, and you learned how to use this model to make predictions for the items that a given user might have a preference for. We also used our model to find items that are similar or related to a given item. Finally, we explored common metrics to evaluate the predictive capability of our recommendation model. To learn more about Spark, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Fast Data Processing with Spark - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/fast-data-processing-spark-second-edition) Spark Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook) Resources for Article: Further resources on this subject: Reactive Programming And The Flux Architecture [article] Spark - Architecture And First Program [article] The Design Patterns Out There And Setting Up Your Environment [article]
Read more
  • 0
  • 1
  • 18029

article-image-probability-r
Packt
23 Feb 2016
17 min read
Save for later

Probability of R?

Packt
23 Feb 2016
17 min read
It's time for us to put descriptive statistics down for the time being. It was fun for a while, but we're no longer content just determining the properties of observed data; now we want to start making deductions about data we haven't observed. This leads us to the realm of inferential statistics. In data analysis, probability is used to quantify uncertainty of our deductions about unobserved data. In the land of inferential statistics, probability reigns queen. Many regard her as a harsh mistress, but that's just a rumor. (For more resources related to this topic, see here.) Basic probability Probability measures the likeliness that a particular event will occur. When mathematicians (us, for now!) speak of an event, we are referring to a set of potential outcomes of an experiment, or trial, to which we can assign a probability of occurrence. Probabilities are expressed as a number between 0 and 1 (or as a percentage out of 100). An event with a probability of 0 denotes an impossible outcome, and a probability of 1 describes an event that is certain to occur. The canonical example of probability at work is a coin flip. In the coin flip event, there are two outcomes: the coin lands on heads, or the coin lands on tails. Pretending that coins never land on their edge (they almost never do), those two outcomes are the only ones possible. The sample space (the set of all possible outcomes), therefore, is {heads, tails}. Since the entire sample space is covered by these two outcomes, they are said to be collectively exhaustive. The sum of the probabilities of collectively exhaustive events is always 1. In this example, the probability that the coin flip will yield heads or yield tails is 1; it is certain that the coin will land on one of those. In a fair and correctly balanced coin, each of those two outcomes is equally likely. Therefore, we split the probability equally among the outcomes: in the event of a coin flip, the probability of obtaining heads is 0.5, and the probability of tails is 0.5 as well. This is usually denoted as follows: The probability of a coin flip yielding either heads or tails looks like this: And the probability of a coin flip yielding both heads and tails is denoted as follows: The two outcomes, in addition to being collectively exhaustive, are also mutually exclusive. This means that they can never co-occur. This is why the probability of heads and tails is 0; it just can't happen. The next obligatory application of beginner probability theory is in the case of rolling a standard six-sided die. In the event of a die roll, the sample space is {1, 2, 3, 4, 5, 6}. With every roll of the die, we are sampling from this space. In this event, too, each outcome is equally likely, except now we have to divide the probability across six outcomes. In the following equation, we denote the probability of rolling a 1 as P(1): Rolling a 1 or rolling a 2 is not collectively exhaustive (we can still roll a 3, 4, 5, or 6), but they are mutually exclusive; we can't roll a 1 and 2. If we want to calculate the probability of either one of two mutually exclusive events occurring, we add the probabilities: While rolling a 1 or rolling a 2 aren't mutually exhaustive, rolling 1 and not rolling a 1 are. This is usually denoted in this manner: These two events—and all events that are both collectively exhaustive and mutually exclusive—are called complementary events. Our last pedagogical example in the basic probability theory is using a deck of cards. Our deck has 52 cards—4 for each number from 2 to 10 and 4 each of Jack, Queen, King, and Ace (no Jokers!). Each of these 4 cards belong to one suit, either a Heart, Club, Spade or Diamond. There are, therefore, 13 cards in each suit. Further, every Heart and Diamond card is colored red, and every Spade and Club are black. From this, we can deduce the following probabilities for the outcome of randomly choosing a card: What, then, is the probability of getting a black card and an Ace? Well, these events are conditionally independent, meaning that the probability of either outcome does not affect the probability of the other. In cases like these, the probability of event A and event B is the product of the probability of A and the probability of B. Therefore: Intuitively, this makes sense, because there are two black Aces out of a possible 52. What about the probability that we choose a red card and a Heart? These two outcomes are not conditionally independent, because knowing that the card is red has a bearing on the likelihood that the card is also a Heart. In cases like these, the probability of event A and B is denoted as follows: Where P(A|B) means the probability of A given B. For example, if we represent A as drawing a Heart and B as drawing a red card, P(A | B) means what's the probability of drawing a heart if we know that the card we drew was red?. Since a red card is equally likely to be a Heart or a Diamond, P(A|B) is 0.5. Therefore: In the preceding equation, we used the form P(B) P(A|B). Had we used the form P(A) P(B|A), we would have got the same answer: So, these two forms are equivalent: For kicks, let's divide both sides of the equation by P(B). That yields the following equivalence: This equation is known as Bayes' Theorem. This equation is very easy to derive, but its meaning and influence is profound. In fact, it is one of the most famous equations in all of mathematics. Bayes' Theorem has been applied to and proven useful in an enormous amount of different disciplines and contexts. It was used to help crack the German Enigma code during World War II, saving the lives of millions. It was also used recently, and famously, by Nate Silver to help correctly predict the voting patterns of 49 states in the 2008 US presidential election. At its core, Bayes' Theorem tells us how to update the probability of a hypothesis in light of new evidence. Due to this, the following formulation of Bayes' Theorem is often more intuitive: where H is the hypothesis and E is the evidence. Let's see an example of Bayes' Theorem in action! There's a hot new recreational drug on the scene called Allighate (or Ally for short). It's named as such because it makes its users go wild and act like an alligator. Since the effect of the drug is so deleterious, very few people actually take the drug. In fact, only about 1 in every thousand people (0.1%) take it. Frightened by fear-mongering late-night news, Daisy Girl, Inc., a technology consulting firm, ordered an Allighate testing kit for all of its 200 employees so that it could offer treatment to any employee who has been using it. Not sparing any expense, they bought the best kit on the market; it had 99% sensitivity and 99% specificity. This means that it correctly identified drug users 99 out of 100 times, and only falsely identified a non-user as a user once in every 100 times. When the results finally came back, two employees tested positive. Though the two denied using the drug, their supervisor, Ronald, was ready to send them off to get help. Just as Ronald was about to send them off, Shanice, a clever employee from the statistics department, came to their defense. Ronald incorrectly assumed that each of the employees who tested positive were using the drug with 99% certainty and, therefore, the chances that both were using it was 98%. Shanice explained that it was actually far more likely that neither employee was using Allighate. How so? Let's find out by applying Bayes' theorem! Let's focus on just one employee right now; let H be the hypothesis that one of the employees is using Ally, and E represent the evidence that the employee tested positive. We want to solve the left side of the equation, so let's plug in values. The first part of the right side of the equation, P(Positive Test | Ally User), is called the likelihood. The probability of testing positive if you use the drug is 99%; this is what tripped up Ronald—and most other people when they first heard of the problem. The second part, P(Ally User), is called the prior. This is our belief that any one person has used the drug before we receive any evidence. Since we know that only .1% of people use Ally, this would be a reasonable choice for a prior. Finally, the denominator of the equation is a normalizing constant, which ensures that the final probability in the equation will add up to one of all possible hypotheses. Finally, the value we are trying to solve, P(Ally user | Positive Test), is the posterior. It is the probability of our hypothesis updated to reflect new evidence. In many practical settings, computing the normalizing factor is very difficult. In this case, because there are only two possible hypotheses, being a user or not, the probability of finding the evidence of a positive test is given as follows: Which is: (.99 * .001) + (.01 * .999) = 0.01098 Plugging that into the denominator, our final answer is calculated as follows: Note that the new evidence, which favored the hypothesis that the employee was using Ally, shifted our prior belief from .001 to .09. Even so, our prior belief about whether an employee was using Ally was so extraordinarily low, it would take some very very strong evidence indeed to convince us that an employee was an Ally user. Ignoring the prior probability in cases like these is known as base-rate fallacy. Shanice assuaged Ronald's embarrassment by assuring him that it was a very common mistake. Now to extend this to two employees: the probability of any two employees both using the drug is, as we now know, .01 squared, or 1 million to one. Squaring our new posterior yields, we get .0081. The probability that both employees use Ally, even given their positive results, is less than 1%. So, they are exonerated. Sally is a different story, though. Her friends noticed her behavior had dramatically changed as of late—she snaps at co-workers and has taken to eating pencils. Her concerned cubicle-mate even followed her after work and saw her crawl into a sewer, not to emerge until the next day to go back to work. Even though Sally passed the drug test, we know that it's likely (almost certain) that she uses Ally. Bayes' theorem gives us a way to quantify that probability! Our prior is the same, but now our likelihood is pretty much as close to 1 as you can get - after all, how many non-Ally users do you think eat pencils and live in sewers? A tale of two interpretations Though it may seem strange to hear, there is actually a hot philosophical debate about what probability really is. Though there are others, the two primary camps into which virtually all mathematicians fall are the frequentist camp and the Bayesian camp. The frequentist interpretation describes probability as the relative likelihood of observing an outcome in an experiment when you repeat the experiment multiple times. Flipping a coin is a perfect example; the probability of heads converges to 50% as the number of times it is flipped goes to infinity. The frequentist interpretation of probability is inherently objective; there is a true probability out there in the world, which we are trying to estimate. The Bayesian interpretation, however, views probability as our degree of belief about something. Because of this, the Bayesian interpretation is subjective; when evidence is scarce, there are sometimes wildly different degrees of belief among different people. Described in this manner, Bayesianism may scare many people off, but it is actually quite intuitive. For example, when a meteorologist describes the probability of rain as 70%, people rarely bat an eyelash. But this number only really makes sense within a Bayesian framework because exact meteorological conditions are not repeatable, as is required by frequentist probability. Not simply a heady academic exercise, these two interpretations lead to different methodologies in solving problems in data analysis. Many times, both approaches lead to similar results. Though practitioners may strongly align themselves with one side over another, good statisticians know that there's a time and a place for both approaches. Though Bayesianism as a valid way of looking at probability is debated, Bayes theorem is a fact about probability and is undisputed and non-controversial. Sampling from distributions Observing the outcome of trials that involve a random variable, a variable whose value changes due to chance, can be thought of as sampling from a probability distribution—one that describes the likelihood of each member of the sample space occurring. That sentence probably sounds much scarier than it needs to be. Take a die roll for example. Figure 4.1: Probability distribution of outcomes of a die roll Each roll of a die is like sampling from a discrete probability distribution for which each outcome in the sample space has a probability of 0.167 or 1/6. This is an example of a uniform distribution, because all the outcomes are uniformly as likely to occur. Further, there are a finite number of outcomes, so this is a discrete uniform distribution (there also exist continuous uniform distributions). Flipping a coin is like sampling from a uniform distribution with only two outcomes. More specifically, the probability distribution that describes coin-flip events is called a Bernoulli distribution—it's a distribution describing only two events. Parameters We use probability distributions to describe the behavior of random variables because they make it easy to compute with and give us a lot of information about how a variable behaves. But before we perform computations with probability distributions, we have to specify the parameters of those distributions. These parameters will determine exactly what the distribution looks like and how it will behave. For example, the behavior of both a 6-sided die and a 12-sided die is modeled with a uniform distribution. Even though the behavior of both the dice is modeled as uniform distributions, the behavior of each is a little different. To further specify the behavior of each distribution, we detail its parameter; in the case of the (discrete) uniform distribution, the parameter is called n. A uniform distribution with parameter n has n equally likely outcomes of probability 1 / n. The n for a 6-sided die and a 12-sided die is 6 and 12 respectively. For a Bernoulli distribution, which describes the probability distribution of an event with only two outcomes, the parameter is p. Outcome 1 occurs with probability p, and the other outcome occurs with probability 1 - p, because they are collectively exhaustive. The flip of a fair coin is modeled as a Bernoulli distribution with p = 0.5. Imagine a six-sided die with one side labeled 1 and the other five sides labeled 2. The outcome of the die roll trials can be described with a Bernoulli distribution, too! This time, p = 0.16 (1/6). Therefore, the probability of not rolling a 1 is 5/6. The binomial distribution The binomial distribution is a fun one. Like our uniform distribution described in the previous section, it is discrete. When an event has two possible outcomes, success or failure, this distribution describes the number of successes in a certain number of trials. Its parameters are n, the number of trials, and p, the probability of success. Concretely, a binomial distribution with n=1 and p=0.5 describes the behavior of a single coin flip—if we choose to view heads as successes (we could also choose to view tails as successes). A binomial distribution with n=30 and p=0.5 describes the number of heads we should expect. Figure 4.2: A binomial distribution (n=30, p=0.5) On average, of course, we would expect to have 15 heads. However, randomness is the name of the game, and seeing more or fewer heads is totally expected. How can we use the binomial distribution in practice?, you ask. Well, let's look at an application. Larry the Untrustworthy Knave—who can only be trusted some of the time—gives us a coin that he alleges is fair. We flip it 30 times and observe 10 heads. It turns out that the probability of getting exactly 10 heads on 30 flips is about 2.8%*. We can use R to tell us the probability of getting 10 or fewer heads using the pbinom function:   > pbinom(10, size=30, prob=.5)   [1] 0.04936857 It appears as if the probability of this occurring, in a correctly balanced coin, is roughly 5%. Do you think we should take Larry at his word? *If you're interested The way we determined the probability of getting exactly 10 heads is by using the probability formula for Bernoulli trials. The probability of getting k successes in n trials is equal to: where p is the probability of getting one success and: The normal distribution When we described the normal distribution and how ubiquitous it is? The behavior of many random variables in real life is very well described by a normal distribution with certain parameters. The two parameters that uniquely specify a normal distribution are µ (mu) and σ (sigma). µ, the mean, describes where the distribution's peak is located and σ, the standard deviation, describes how wide or narrow the distribution is. Figure 4.3: Normal distributions with different parameters The distribution of heights of American females is approximately normally distributed with parameters µ= 65 inches and σ= 3.5 inches. Figure 4.4: Normal distributions with different parameters With this information, we can easily answer questions about how probable it is to choose, at random, US women of certain heights. We can't really answer the question What is the probability that we choose a person who is exactly 60 inches?, because virtually no one is exactly 60 inches. Instead, we answer questions about how probable it is that a random person is within a certain range of heights. What is the probability that a randomly chosen woman is 70 inches or taller? If you recall, the probability of a height within a range is the area under the curve, or the integral over that range. In this case, the range we will integrate looks like this: Figure 4.5: Area under the curve of the height distribution from 70 inches to positive infinity    > f <- function(x){ dnorm(x, mean=65, sd=3.5) }   > integrate(f, 70, Inf)   0.07656373 with absolute error < 2.2e-06 The preceding R code indicates that there is a 7.66% chance of randomly choosing a woman who is 70 inches or taller. Luckily for us, the normal distribution is so popular and well studied, that there is a function built into R, so we don't need to use integration ourselves.   > pnorm(70, mean=65, sd=3.5)   [1] 0.9234363  The pnorm function tells us the probability of choosing a woman who is shorter than 70 inches. If we want to find P (> 70 inches), we can either subtract this value by 1 (which gives us the complement) or use the optional argument lower.tail=FALSE. If you do this, you'll see that the result matches the 7.66% chance we arrived at earlier. Summary You can check out similar books published by Packt Publishing on R (https://www.packtpub.com/tech/r): Unsupervised Learning with R by Erik Rodríguez Pacheco (https://www.packtpub.com/big-data-and-business-intelligence/unsupervised-learning-r) R Data Science Essentials by Raja B. Koushik and Sharan Kumar Ravindran (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science-essentials) Resources for Article: Further resources on this subject: Dealing With A Mess [article] Navigating The Online Drupal Community [article] Design With Spring AOP [article]
Read more
  • 0
  • 0
  • 3191

article-image-dealing-mess
Packt
23 Feb 2016
59 min read
Save for later

Dealing with a Mess

Packt
23 Feb 2016
59 min read
Analyzing data in the real world often requires some know-how outside of the typical introductory data analysis curriculum. For example, rarely do we get a neatly formatted, tidy dataset with no errors, junk, or missing values. Rather, we often get messy, unwieldy datasets. What makes a dataset messy? Different people in different roles have different ideas about what constitutes messiness. Some regard any data that invalidates the assumptions of the parametric model as messy. Others see messiness in datasets with a grievously imbalanced number of observations in each category for a categorical variable. Some examples of things that I would consider messy are: Many missing values (NAs) Misspelled names in categorical variables Inconsistent data coding Numbers in the same column being in different units Mis-recorded data and data entry mistakes Extreme outliers Since there are an infinite number of ways that data can be messy, there's simply no chance of enumerating every example and their respective solutions. Instead, we are going to talk about two tools that help combat the bulk of the messiness issues that I cited just now. (For more resources related to this topic, see here.) Analysis with missing data Missing data is another one of those topics that are largely ignored in most introductory texts. Probably, part of the reason why this is the case is that many myths about analysis with missing data still abound. Additionally, some of the research into cutting-edge techniques is still relatively new. A more legitimate reason for its absence in introductory texts is that most of the more principled methodologies are fairly complicated—mathematically speaking. Nevertheless, the incredible ubiquity of problems related to missing data in real life data analysis necessitates some broaching of the subject. This section serves as a gentle introduction into the subject and one of the more effective techniques for dealing with it. A common refrain on the subject is something along the lines of the best way to deal with missing data is not to have any. It's true that missing data is a messy subject, and there are a lot of ways to do it wrong. It's important not to take this advice to the extreme, though. In order to bypass missing data problems, some have disallowed survey participants, for example, to go on without answering all the questions on a form. You can coerce the participants in a longitudinal study to not drop out, too. Don't do this. Not only is it unethical, it is also prodigiously counter-productive; there are treatments for missing data, but there are no treatments for bad data. The standard treatment to the problem of missing data is to replace the missing data with non-missing values. This process is called imputation. In most cases, the goal of imputation is not to recreate the lost completed dataset but to allow valid statistical estimates or inferences to be drawn from incomplete data. Because of this, the effectiveness of different imputation techniques can't be evaluated by their ability to most accurately recreate the data from a simulated missing dataset; they must, instead, be judged by their ability to support the same statistical inferences as would be drawn from the analysis on the complete data. In this way, filling in the missing data is only a step towards the real goal—the analysis. The imputed dataset is rarely considered the final goal of imputation. There are many different ways that missing data is dealt with in practice—some are good, some are not so good. Some are okay under certain circumstances, but not okay in others. Some involve missing data deletion, while some involve imputation. We will briefly touch on some of the more common methods. The ultimate goal of this article, though, is to get you started on what is often described as the gold-standard of imputation techniques: multiple imputation. Visualizing missing data In order to demonstrate the visualizing patterns of missing data, we first have to create some missing data. This will also be the same dataset that we perform analysis on later in the article. To showcase how to use multiple imputation for a semi-realistic scenario, we are going to create a version of the mtcars dataset with a few missing values: Okay, let's set the seed (for deterministic randomness), and create a variable to hold our new marred dataset. set.seed(2) miss_mtcars <- mtcars First, we are going to create seven missing values in drat (about 20 percent), five missing values in the mpg column (about 15 percent), five missing values in the cyl column, three missing values in wt (about 10 percent), and three missing values in vs: some_rows <- sample(1:nrow(miss_mtcars), 7) miss_mtcars$drat[some_rows] <- NA  some_rows <- sample(1:nrow(miss_mtcars), 5) miss_mtcars$mpg[some_rows] <- NA  some_rows <- sample(1:nrow(miss_mtcars), 5) miss_mtcars$cyl[some_rows] <- NA  some_rows <- sample(1:nrow(miss_mtcars), 3) miss_mtcars$wt[some_rows] <- NA  some_rows <- sample(1:nrow(miss_mtcars), 3) miss_mtcars$vs[some_rows] <- NA Now, we are going to create four missing values in qsec, but only for automatic cars: only_automatic <- which(miss_mtcars$am==0) some_rows <- sample(only_automatic, 4) miss_mtcars$qsec[some_rows] <- NA Now, let's take a look at the dataset: > miss_mtcars                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 Datsun 710          22.8   4 108.0  93 3.85    NA 18.61  1  1    4    1 Hornet 4 Drive      21.4   6 258.0 110   NA 3.215 19.44  1  0    3    1 Hornet Sportabout   18.7   8 360.0 175   NA 3.440 17.02  0  0    3    2 Valiant             18.1  NA 225.0 105   NA 3.460    NA  1  0    3    1 Great, now let's visualize the missingness. The first way we are going to visualize the pattern of missing data is by using the md.pattern function from the mice package (which is also the package that we are ultimately going to use for imputing our missing data). If you don't have the package already, install it. > library(mice) > md.pattern(miss_mtcars)    disp hp am gear carb wt vs qsec mpg cyl drat   12    1  1  1    1    1  1  1    1   1   1    1  0  4    1  1  1    1    1  1  1    1   0   1    1  1  2    1  1  1    1    1  1  1    1   1   0    1  1  3    1  1  1    1    1  1  1    1   1   1    0  1  3    1  1  1    1    1  0  1    1   1   1    1  1  2    1  1  1    1    1  1  1    0   1   1    1  1  1    1  1  1    1    1  1  1    1   0   1    0  2  1    1  1  1    1    1  1  1    0   1   0    1  2  1    1  1  1    1    1  1  0    1   1   0    1  2  2    1  1  1    1    1  1  0    1   1   1    0  2  1    1  1  1    1    1  1  1    0   1   0    0  3       0  0  0    0    0  3  3    4   5   5    7 27 A row-wise missing data pattern refers to the columns that are missing for each row. This function aggregates and counts the number of rows with the same missing data pattern. This function outputs a binary (0 and 1) matrix. Cells with a 1 represent non-missing data; 0s represent missing data. Since the rows are sorted in an increasing-amount-of-missingness order, the first row always refers to the missing data pattern containing the least amount of missing data. In this case, the missing data pattern with the least amount of missing data is the pattern containing no missing data at all. Because of this, the first row has all 1s in the columns that are named after the columns in the miss_mtcars dataset. The left-most column is a count of the number of rows that display the missing data pattern, and the right-most column is a count of the number of missing data points in that pattern. The last row contains a count of the number of missing data points in each column. As you can see, 12 of the rows contain no missing data. The next most common missing data pattern is the one with missing just mpg; four rows fit this pattern. There are only six rows that contain more than one missing value. Only one of these rows contains more than two missing values (as shown in the second-to-last row). As far as datasets with missing data go, this particular one doesn't contain much. It is not uncommon for some datasets to have more than 30 percent of its data missing. This data set doesn't even hit 3 percent. Now let's visualize the missing data pattern graphically using the VIM package. You will probably have to install this, too. library(VIM) aggr(miss_mtcars, numbers=TRUE) Figure11.1: The output of VIM's visual aggregation of missing data. The left plot shows the proportion on missing values for each column. The right plot depicts the prevalence of row-wise missing data patterns, like md.pattern At a glance, this representation shows us, effortlessly, that the drat column accounts for the highest proportion of missingness, column-wise, followed by mpg, cyl, qsec, vs, and wt. The graphic on the right shows us information similar to that of the output of md.pattern. This representation, though, makes it easier to tell if there is some systematic pattern of missingness. The blue cells represent non-missing data, and the red cells represent missing data. The numbers on the right of the graphic represent the proportion of rows displaying that missing data pattern. 37.5 percent of the rows contain no missing data whatsoever. Types of missing data The VIM package allowed us to visualize the missing data patterns. A related term, the missing data mechanism, describes the process that determines each data point's likelihood of being missing. There are three main categories of missing data mechanisms: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Discrimination based on missing data mechanism is crucial, since it informs us about the options for handling the missingness. The first mechanism, MCAR, occurs when data's missingness is unrelated to the data. This would occur, for example, if rows were deleted from a database at random, or if a gust of wind took a random sample of a surveyor's survey forms off into the horizon. The mechanism that governs the missingness of drat, mpg, cyl, wt, and vs' is MCAR, because we randomly selected elements to go missing. This mechanism, while being the easiest to work with, is seldom tenable in practice. MNAR, on the other hand, occurs when a variable's missingness is related to the variable itself. For example, suppose the scale that weighed each car had a capacity of only 3,700 pounds, and because of this, the eight cars that weighed more than that were recorded as NA. This is a classic example of the MNAR mechanism—it is the weight of the observation itself that is the cause for its being missing. Another example would be if during the course of trial of an anti-depressant drug, participants who were not being helped by the drug became too depressed to continue with the trial. At the end of the trial, when all the participants' level of depression is accessed and recorded, there would be missing values for participants whose reason for absence is related to their level of depression. The last mechanism, missing at random, is somewhat unfortunately named. Contrary to what it may sound like, it means there is a systematic relationship between the missingness of an outcome variable' and other observed variables, but not the outcome variable itself. This is probably best explained by the following example. Suppose that in a survey, there is a question about income level that, in its wording, uses a particular colloquialism. Due to this, a large number of the participants in the survey whose native language is not English couldn't interpret the question, and left it blank. If the survey collected just the name, gender, and income, the missing data mechanism of the question on income would be MNAR. If, however, the questionnaire included a question that asked if the participant spoke English as a first language, then the mechanism would be MAR. The inclusion of the Is English your first language? variable means that the missingness of the income question can be completely accounted for. The reason for the moniker missing at random is that when you control the relationship between the missing variable and the observed variable(s) it is related to (for example, What is your income? and Is English your first language? respectively), the data are missing at random. As another example, there is a systematic relationship between the am and qsec variables in our simulated missing dataset: qsecs were missing only for automatic cars. But within the group of automatic cars, the qsec variable is missing at random. Therefore, qsec 's mechanism is MAR; controlling for transmission type, qsec is missing at random. Bear in mind, though, if we removed am from our simulated dataset, qsec would become MNAR. As mentioned earlier, MCAR is the easiest type to work with because of the complete absence of a systematic relationship in the data's missingness. Many unsophisticated techniques for handling missing data rest on the assumption that the data are MCAR. On the other hand, MNAR data is the hardest to work with since the properties of the missing data that caused its missingness has to be understood quantifiably, and included in the imputation model. Though multiple imputations can handle the MNAR mechanisms, the procedures involved become more complicated and far beyond the scope of this text. The MCAR and MAR mechanisms allow us not to worry about the properties and parameters of the missing data. For this reason, may sometimes find MCAR or MAR missingness being referred to as ignorable missingness. MAR data is not as hard to work with as MNAR data, but it is not as forgiving as MCAR. For this reason, though our simulated dataset contains MCAR and MAR components, the mechanism that describes the whole data is MAR—just one MAR mechanism makes the whole dataset MAR. So which one is it? You may have noticed that the place of a particular dataset in the missing data mechanism taxonomy is dependent on the variables that it includes. For example, we know that the mechanism behind qsec is MAR, but if the dataset did not include am, it would be MNAR. Since we are the ones that created the data, we know the procedure that resulted in qsec 's missing values. If we weren't the ones creating the data—as happens in the real world—and the dataset did not contain the am column, we would just see a bunch of arbitrarily missing qsec values. This might lead us to believe that the data is MCAR. It isn't, though; just because the variable to which another variable's missingness is systematically related is non-observed, doesn't mean that it doesn't exist. This raises a critical question: can we ever be sure that our data is not MNAR? The unfortunate answer is no. Since the data that we need to prove or disprove MNAR is ipso facto missing, the MNAR assumption can never be conclusively disconfirmed. It's our job, as critically thinking data analysts, to ask whether there is likely an MNAR mechanism or not. Unsophisticated methods for dealing with missing data Here we are going to look at various types of methods for dealing with missing data: Complete case analysis This method, also called list-wise deletion, is a straightforward procedure that simply removes all rows or elements containing missing values prior to the analysis. In the univariate case—taking the mean of the drat column, for example—all elements of drat that are missing would simply be removed: > mean(miss_mtcars$drat) [1] NA > mean(miss_mtcars$drat, na.rm=TRUE) [1] 3.63 In a multivariate procedure—for example, linear regression predicting mpg from am, wt, and qsec—all rows that have a missing value in any of the columns included in the regression are removed: listwise_model <- lm(mpg ~ am + wt + qsec,                      data=miss_mtcars,                      na.action = na.omit) ## OR # complete.cases returns a boolean vector comp <- complete.cases(cbind(miss_mtcars$mpg,                              miss_mtcars$am,                              miss_mtcars$wt,                              miss_mtcars$qsec)) comp_mtcars <- mtcars[comp,] listwise_model <- lm(mpg ~ am + wt + qsec,                      data=comp_mtcars) Under an MCAR mechanism, a complete case analysis produces unbiased estimates of the mean, variance/standard deviation, and regression coefficients, which means that the estimates don't systematically differ from the true values on average, since the included data elements are just a random sampling of the recorded data elements. However, inference-wise, since we lost a number of our samples, we are going to lose statistical power and generate standard errors and confidence intervals that are bigger than they need to be. Additionally, in the multivariate regression case, note that our sample size depends on the variables that we include in the regression; more the variables, more is the missing data that we open ourselves up to, and more the rows that we are liable to lose. This makes comparing results across different models slightly hairy. Under an MAR or MNAR mechanism, list-wise deletion will produce biased estimates of the mean and variance. For example, if am were highly correlated with qsec, the fact that we are missing qsec only for automatic cars would significantly shift our estimates of the mean of qsec. Surprisingly, list-wise deletion produces unbiased estimates of the regression coefficients, even if the data is MNAR or MAR, as long as the relevant variables are included in the regression equations. For this reason, if there are relatively few missing values in a data set that is to be used in regression analysis, list-wise deletion could be an acceptable alternative to more principled approaches. Pairwise deletion Also called available-case analysis, this technique is (somewhat unfortunately) common when estimating covariance or correlation matrices. For each pair of variables, it only uses the cases that are non-missing for both. This often means that the number of elements used will vary from cell to cell of the covariance/correlation matrices. This can result in absurd correlation coefficients that are above 1, making the resulting matrices largely useless to methodologies that depend on them. Mean substitution Mean substitution, as the name suggests, replaces all the missing values with the mean of the available cases. For example: mean_sub <- miss_mtcars mean_sub$qsec[is.na(mean_sub$qsec)] <- mean(mean_sub$qsec,                                             na.rm=TRUE) # etc... Although this seemingly solves the problem of the loss of sample size in the list-wise deletion procedure, mean substitution has some very unsavory properties of it's own. Whilst mean substitution produces unbiased estimates of the mean of a column, it produces biased estimates of the variance, since it removes the natural variability that would have occurred in the missing values had they not been missing. The variance estimates from mean substitution will therefore be, systematically, too small. Additionally, it's not hard to see that mean substitution will result in biased estimates if the data are MAR or MNAR. For these reasons, mean substitution is not recommended under virtually any circumstance. Hot deck imputation Hot deck imputation is an intuitively elegant approach that fills in the missing data with donor values from another row in the dataset. In the least sophisticated formulation, a random non-missing element from the same dataset is shared with a missing value. In more sophisticated hot deck approaches, the donor value comes from a row that is similar to the row with the missing data. The multiple imputation techniques that we will be using in a later section of this article borrows this idea for one of its imputation methods. The term hot deck refers to the old practice of storing data in decks of punch cards. The deck that holds the donor value would be hot because it is the one that is currently being processed. Regression imputation This approach attempts to fill in the missing data in a column using regression to predict likely values of the missing elements using other columns as predictors. For example, using regression imputation on the drat column would employ a linear regression predicting drat from all the other columns in miss_mtcars. The process would be repeated for all columns containing missing data, until the dataset is complete. This procedure is intuitively appealing, because it integrates knowledge of the other variables and patterns of the dataset. This creates a set of more informed imputations. As a result, this produces unbiased estimates of the mean and regression coefficients under MCAR and MAR (so long as the relevant variables are included in the regression model. However, this approach is not without its problems. The predicted values of the missing data lie right on the regression line but, as we know, very few data points lie right on the regression line—there is usually a normally distributed residual (error) term. Due to this, regression imputation underestimates the variability of the missing values. As a result, it will result in biased estimates of the variance and covariance between different columns. However, we're on the right track. Stochastic regression imputation As far as unsophisticated approaches go, stochastic regression is fairly evolved. This approach solves some of the issues of regression imputation, and produces unbiased estimates of the mean, variance, covariance, and regression coefficients under MCAR and MAR. It does this by adding a random (stochastic) value to the predictions of regression imputation. This random added value is sampled from the residual (error) distribution of the linear regression—which, if you remember, is assumed to be a normal distribution. This restores the variability in the missing values (that we lost in regression imputation) that those values would have had if they weren't missing. However, as far as subsequent analysis and inference on the imputed dataset goes, stochastic regression results in standard errors and confidence intervals that are smaller than they should be. Since it produces only one imputed dataset, it does not capture the extent to which we are uncertain about the residuals and our coefficient estimates. Nevertheless, stochastic regression forms the basis of still more sophisticated imputation methods. There are two sophisticated, well-founded, and recommended methods of dealing with missing data. One is called the Expectation Maximization (EM) method, which we do not cover here. The second is called Multiple Imputation, and because it is widely considered the most effective method, it is the one we explore in this article. Multiple imputation The big idea behind multiple imputation is that instead of generating one set of imputed data with our best estimation of the missing data, we generate multiple versions of the imputed data where the imputed values are drawn from a distribution. The uncertainty about what the imputed values should be is reflected in the variation between the multiply imputed datasets. We perform our intended analysis separately with each of these m amount of completed datasets. These analyses will then yield m different parameter estimates (like regression coefficients, and so on). The critical point is that these parameter estimates are different solely due to the variability in the imputed missing values, and hence, our uncertainty about what the imputed values should be. This is how multiple imputation integrates uncertainty, and outperforms more limited imputation methods that produce one imputed dataset, conferring an unwarranted sense of confidence in the filled-in data of our analysis. The following diagram illustrates this idea: Figure 11.2: Multiple imputation in a nutshell So how does mice come up with the imputed values? Let's focus on the univariate case—where only one column contains missing data and we use all the other (completed) columns to impute the missing values—before generalizing to a multivariate case. mice actually has a few different imputation methods up its sleeve, each best suited for a particular use case. mice will often choose sensible defaults based on the data type (continuous, binary, non-binary categorical, and so on). The most important method is what the package calls the norm method. This method is very much like stochastic regression. Each of the m imputations is created by adding a normal "noise" term to the output of a linear regression predicting the missing variable. What makes this slightly different than just stochastic regression repeated m times is that the norm method also integrates uncertainty about the regression coefficients used in the predictive linear model. Recall that the regression coefficients in a linear regression are just estimates of the population coefficients from a random sample (that's why each regression coefficient has a standard error and confidence interval). Another sample from the population would have yielded slightly different coefficient estimates. If through all our imputations, we just added a normal residual term from a linear regression equation with the same coefficients, we would be systematically understating our uncertainty regarding what the imputed values should be. To combat this, in multiple imputation, each imputation of the data contains two steps. The first step performs stochastic linear regression imputation using coefficients for each predictor estimated from the data. The second step chooses slightly different estimates of these regression coefficients, and proceeds into the next imputation. The first step of the next imputation uses the slightly different coefficient estimates to perform stochastic linear regression imputation again. After that, in the second step of the second iteration, still other coefficient estimates are generated to be used in the third imputation. This cycle goes on until we have m multiply imputed datasets. How do we choose these different coefficient estimates at the second step of each imputation? Traditionally, the approach is Bayesian in nature; these new coefficients are drawn from each of the coefficients' posterior distribution, which describes credible values of the estimate using the observed data and uninformative priors. This is the approach that norm uses. There is an alternate method that chooses these new coefficient estimates from a sampling distribution that is created by taking repeated samples of the data (with replacement) and estimating the regression coefficients of each of these samples. mice calls this method norm.boot. The multivariate case is a little more hairy, since the imputation for one column depends on the other columns, which may contain missing data of their own. For this reason, we make several passes over all the columns that need imputing, until the imputation of all missing data in a particular column is informed by informed estimates of the missing data in the predictor columns. These passes over all the columns are called iterations. So that you really understand how this iteration works, let's say we are performing multiple imputation on a subset of miss_mtcars containing only mpg, wt and drat. First, all the missing data in all the columns are set to a placeholder value like the mean or a randomly sampled non-missing value from its column. Then, we visit mpg where the placeholder values are turned back into missing values. These missing values are predicted using the two-part procedure described in the univariate case. Then we move on to wt; the placeholder values are turned back into missing values, whose new values are imputed with the two-step univariate procedure using mpg and drat as predictors. Then this is repeated with drat. This is one iteration. On the next iteration, it is not the placeholder values that get turned back into random values and imputed but the imputed values from the previous iteration. As this repeats, we shift away from the starting values and the imputed values begin to stabilize. This usually happens within just a few iterations. The dataset at the completion of the last iteration is the first multiply imputed dataset. Each m starts the iteration process all over again. The default in mice is five iterations. Of course, you can increase this number if you have reason to believe that you need to. We'll discuss how to tell if this is necessary later in the section. Methods of imputation The method of imputation that we described for the univariate case, norm, works best for imputed values that follow an unconstrained normal distribution—but it could lead to some nonsensical imputations otherwise. For example, since the weights in wt are so close to 0 (because it's in units of a thousand pounds) it is possible for the norm method to impute a negative weight. Though this will no doubt balance out over the other m-1 multiply imputed datasets, we can combat this situation by using another method of imputation called predictive mean matching. Predictive mean matching (mice calls this pmm) works a lot like norm. The difference is that the norm imputations are then used to find the d closest values to the imputed value among the non-missing data in the column. Then, one of these d values is chosen as the final imputed value—d=3 is the default in mice. This method has a few great properties. For one, the possibility of imputing a negative value for wt is categorically off the table; the imputed value would have to be chosen from the set {1.513, 1.615, 1.835}, since these are the three lowest weights. More generally, any natural constraint in the data (lower or upper bounds, integer count data, numbers rounded to the nearest one-half, and so on) is respected with predictive mean matching, because the imputed values appear in the actual non-missing observed values. In this way, predictive mean matching is like hot-deck imputation. Predictive mean matching is the default imputation method in mice for numerical data, though it may be inferior to norm for small datasets and/or datasets with a lot of missing values. Many of the other imputation methods in mice are specially suited for one particular data type. For example, binary categorical variables use logreg by default; this is like norm but uses logistic regression to impute a binary outcome. Similarly, non-binary categorical data uses multinomial regression—mice calls this method polyreg. Multiple imputation in practice There are a few steps to follow and decisions to make when using this powerful imputation technique: Are the data MAR?: And be honest! If the mechanism is likely not MAR, then more complicated measures have to be taken. Are there any derived terms, redundant variables, or irrelevant variables in the data set?: Any of these types of variables will interfere with the regression process. Irrelevant variables—like unique IDs—will not have any predictive power. Derived terms or redundant variables—like having a column for weight in pounds and grams, or a column for area in addition to a length and width column—will similarly interfere with the regression step. Convert all categorical variables to factors, otherwise mice will not be able to tell that the variable is supposed to be categorical. Choose number of iterations and m: By default, these are both five. Using five iterations is usually okay—and we'll be able to tell if we need more. Five imputations are usually okay, too, but we can achieve more statistical power from more imputed datasets. I suggest setting m to 20, unless the processing power and time can't be spared. Choose an imputation method for each variable: You can stick with the defaults as long as you are aware of what they are and think they're the right fit.     Choose the predictors: Let mice use all the available columns as predictors as long as derived terms and redundant/irrelevant columns are removed. Not only does using more    predictors result in reduced bias, but it also increases the likelihood that the data is MAR.     Perform the imputations     Audit the imputations     Perform analysis with the imputations     Pool the results of the analyses Before we get down to it, let's call the mice function on our data frame with missing data, and use its default arguments, just to see what we shouldn't do and why: # we are going to set the seed and printFlag to FALSE, but # everything else will the default argument imp <- mice(miss_mtcars, seed=3, printFlag=FALSE) print(imp)   ------------------------------   Multiply imputed data set Call: mice(data = miss_mtcars, printFlag = FALSE, seed = 3) Number of multiple imputations:  5 Missing cells per column:  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb    5    5    0    0    7    3    4    3    0    0    0 Imputation methods:   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb "pmm" "pmm"    ""    "" "pmm" "pmm" "pmm" "pmm"    ""    ""    "" VisitSequence:  mpg  cyl drat   wt qsec   vs    1    2    5    6    7    8 PredictorMatrix:      mpg cyl disp hp drat wt qsec vs am gear carb mpg    0   1    1  1    1  1    1  1  1    1    1 cyl    1   0    1  1    1  1    1  1  1    1    1 disp   0   0    0  0    0  0    0  0  0    0    0  ... Random generator seed value:  3 The first thing we notice (on line four of the output) is that mice chose to create five multiply imputed datasets, by default. As we discussed, this isn't a bad default, but more imputation can only improve our statistical power (if only marginally); when we impute this data set in earnest, we will use m=20. The second thing we notice (on lines 8-10 of the output) is that it used predictive mean matching as the imputation method for all the columns with missing data. If you recall, predictive mean matching is the default imputation method for numeric columns. However, vs and cyl are binary categorical and non-binary categorical variables, respectively. Because we didn't convert them to factors, mice thinks these are just regular numeric columns. We'll have to fix this. The last thing we should notice here is the predictor matrix (starting on line 14). Each row and column of the predictor matrix refers to a column in the dataset to impute. If a cell contains a 1, it means that the variable referred to in the column is used as a predictor for the variable in the row. The first row indicates that all available attributes are used to help predict mpg with the exception of mpg itself. All the values in the diagonal are 0, because mice won't use an attribute to predict itself. Note that the disp, hp, am, gear, and carb rows all contain `0`s—this is because these variables are complete, and don't need to use any predictors. Since we thought carefully about whether there were any attributes that should be removed before we perform the imputation, we can use mice's default predictor matrix for this dataset. If there were any non-predictive attributes (like unique identifiers, redundant variables, and so on) we would have either had to remove them (easiest option), or instruct mice not to use them as predictors (harder). Let's now correct the issues that we've discussed. # convert categorical variables into factors miss_mtcars$vs <- factor(miss_mtcars$vs) miss_mtcars$cyl <- factor(miss_mtcars$cyl)   imp <- mice(miss_mtcars, m=20, seed=3, printFlag=FALSE) imp$method -------------------------------------       mpg       cyl      disp        hp      drat     "pmm" "polyreg"        ""        ""     "pmm"        wt      qsec        vs        am      gear     "pmm"     "pmm"  "logreg"        ""        ""      carb        "" Now mice has corrected the imputation method of cyl and vs to their correct defaults. In truth, cyl is a kind of discrete numeric variable called an ordinal variable, which means that yet another imputation method may be optimal for that attribute, but, for the sake of simplicity, we'll treat it as a categorical variable. Before we get to use the imputations in an analysis, we have to check the output. The first thing we need to check is the convergence of the iterations. Recall that for imputing data with missing values in multiple columns, multiple imputation requires iteration over all these columns a few times. At each iteration, mice produces imputations—and samples new parameter estimates from the parameters' posterior distributions—for all columns that need to be imputed. The final imputations, for each multiply imputed dataset m, are the imputed values from the final iteration. In contrast to when we used MCMC the convergence in mice is much faster; it usually occurs in just a few iterations. However, visually checking for convergence is highly recommended. We even check for it similarly; when we call the plot function on the variable that we assign the mice output to, it displays trace plots of the mean and standard deviation of all the variables involved in the imputations. Each line in each plot is one of the m imputations. plot(imp) Figure 11.3: A subset of the trace plots produced by plotting an object returned by a mice imputation As you can see from the preceding trace plot on imp, there are no clear trends and the variables are all overlapping from one iteration to the next. Put another way, the variance within a chain (there are m chains ) should be about equal to the variance between the chains. This indicates that convergence was achieved. If convergence was not achieved, you can increase the number of iterations that mice employs by explicitly specifying the maxit parameter to the mice function. To see an example of non-convergence, take a look at Figures 7 and 8 in the paper that describes this package written by the authors of the package' themselves. It is available at http://www.jstatsoft.org/article/view/v045i03 The next step is to make sure the imputed values are reasonable. In general, whenever we quickly review the results of something to see if they make sense, it is called a sanity test or sanity check. With the following line, we're going to display the imputed values for the five missing mpgs for the first six imputations: imp$imp$mpg[,1:6] ------------------------------------                       1    2    3    4    5    6 Duster 360         19.2 16.4 17.3 15.5 15.0 19.2 Cadillac Fleetwood 15.2 13.3 15.0 13.3 10.4 17.3 Chrysler Imperial  10.4 15.0 15.0 16.4 10.4 10.4 Porsche 914-2      27.3 22.8 21.4 22.8 21.4 15.5 Ferrari Dino       19.2 21.4 19.2 15.2 18.1 19.2 These sure look reasonable. A better method for sanity checking is to call densityplot on the variable that we assign the mice output to: densityplot(imp) Figure 11.4: Density plots of all the imputed values for mpg, drat, wt, and qsec. Each imputation has its own density curve in each quadrant This displays, for every attribute imputed, a density plot of the actual non-missing values (the thick line) and the imputed values (the thin lines). We are looking to see that the distributions are similar. Note that the density curve of the imputed values extend much higher than the observed values' density curve in this case. This is partly because we imputed so few variables that there weren't enough data points to properly smooth the density approximation. Height and non-smoothness notwithstanding, these density plots indicate no outlandish behavior among the imputed variables. We are now ready for the analysis phase. We are going to perform linear regression on each imputed dataset and attempt to model mpg as a function of am, wt, and qsec. Instead of repeating the analyses on each dataset manually, we can apply an expression to all the datasets at one time with the with function, as follows: imp_models <- with(imp, lm(mpg ~ am + wt + qsec)) We could take a peak at the estimated coefficients from each dataset using lapply on the analyses attribute of the returned object: lapply(imp_models$analyses, coef) --------------------------------- [[1]] (Intercept)          am          wt        qsec  18.1534095   2.0284014  -4.4054825   0.8637856   [[2]] (Intercept)          am          wt        qsec    8.375455    3.336896   -3.520882    1.219775   [[3]] (Intercept)          am          wt        qsec    5.254578    3.277198   -3.233096    1.337469 ......... Finally, let's pool the results of the analyses (with the pool function), and call summary on it: pooled_model <- pool(imp_models) summary(pooled_model) ----------------------------------                   est        se         t       df    Pr(>|t|) (Intercept)  7.049781 9.2254581  0.764166 17.63319 0.454873254 am           3.182049 1.7445444  1.824000 21.36600 0.082171407 wt          -3.413534 0.9983207 -3.419276 14.99816 0.003804876 qsec         1.270712 0.3660131  3.471765 19.93296 0.002416595                   lo 95     hi 95 nmis       fmi    lambda (Intercept) -12.3611281 26.460690   NA 0.3459197 0.2757138 am           -0.4421495  6.806247    0 0.2290359 0.1600952 wt           -5.5414268 -1.285641    3 0.4324828 0.3615349 qsec          0.5070570  2.034366    4 0.2736026 0.2042003 Though we could have performed the pooling ourselves using the equations that Donald Rubin outlined in his 1987 classic Multiple Imputation for Nonresponse in Surveys, it is less of a hassle and less error-prone to have the pool function do it for us. Readers who are interested in the pooling rules are encouraged to consult the aforementioned text. As you can see, for each parameter, pool has combined the coefficient estimate and standard errors, and calculated the appropriate degrees of freedom. These allow us to t-test each coefficient against the null hypothesis that the coefficient is equal to 0, produce p-values for the t-test, and construct confidence intervals. The standard errors and confidence intervals are wider than those that would have resulted from linear regression on a single imputed dataset, but that's because it is appropriately taking into account our uncertainty regarding what the missing values would have been. There are, at present time, a limited number of analyses that can be automatically pooled by mice—the most important being lm/glm. If you recall, though, the generalized linear model is extremely flexible, and can be used to express a wide array of different analyses. By extension, we could use multiple imputation for not only linear regression but logistic regression, Poisson regression, t-tests, ANOVA, ANCOVA, and more. Analysis with unsanitized data Very often, there will be errors or mistakes in data that can severely complicate analyses—especially with public data or data outside of your organization. For example, say there is a stray comma or punctuation mark in a column that was supposed to be numeric. If we aren't careful, R will read this column as character, and subsequent analysis may, in the best case scenario, fail; it is also possible, however, that our analysis will silently chug along, and return an unexpected result. This will happen, for example, if we try to perform linear regression using the punctuation-containing-but-otherwise-numeric column as a predictor, which will compel R to convert it into a factor thinking that it is a categorical variable. In the worst-case scenario, an analysis with unsanitized data may not error out or return nonsensical results, but return results that look plausible but are actually incorrect. For example, it is common (for some reason) to encode missing data with 999 instead of NA; performing a regression analysis with 999 in a numeric column can severely adulterate our linear models, but often not enough to cause clearly inappropriate results. This mistake may then go undetected indefinitely. Some problems like these could, rather easily, be detected in small datasets by visually auditing the data. Often, however, mistakes like these are notoriously easy to miss. Further, visual inspection is an untenable solution for datasets with thousands of rows and hundreds of columns. Any sustainable solution must off-load this auditing process to R. But how do we describe aberrant behavior to R so that it can catch mistakes on its own? The package assertr seeks to do this by introducing a number of data checking verbs. Using assertr grammar, these verbs (functions) can be combined with subjects (data) in different ways to express a rich vocabulary of data validation tasks. More prosaically, assertr provides a suite of functions designed to verify the assumptions about data early in the analysis process, before any time is wasted computing on bad data. The idea is to provide as much information as you can about how you expect the data to look upfront so that any deviation from this expectation can be dealt with immediately. Given that the assertr grammar is designed to be able to describe a bouquet of error-checking routines, rather than list all the functions and functionalities that the package provides, it would be more helpful to visit particular use cases. Two things before we start. First, make sure you install assertr. Second, bear in mind that all data verification verbs in assertr take a data frame to check as their first argument, and either (a) returns the same data frame if the check passes, or (b) produces a fatal error. Since the verbs return a copy of the chosen data frame if the check passes, the main idiom in assertr involves reassignment of the returning data frame after it passes the check. a_dataset <- CHECKING_VERB(a_dataset, ....) Checking for out-of-bounds data It's common for numeric values in a column to have a natural constraint on the values that it should hold. For example, if a column represents a percent of something, we might want to check if all the values in that column are between 0 and 1 (or 0 and 100). In assertr, we typically use the within_bounds function in conjunction with the assert verb to ensure that this is the case. For example, if we added a column to mtcars that represented the percent of heaviest car's weight, the weight of each car is: library(assertr) mtcars.copy <- mtcars   mtcars.copy$Percent.Max.Wt <- round(mtcars.copy$wt /                                     max(mtcars.copy$wt),                                     2)   mtcars.copy <- assert(mtcars.copy, within_bounds(0,1),                      Percent.Max.Wt) within_bounds is actually a function that takes the lower and upper bounds and returns a predicate, a function that returns TRUE or FALSE. The assert function then applies this predicate to every element of the column specified in the third argument. If there are more than three arguments, assert will assume there are more columns to check. Using within_bounds, we can also avoid the situation where NA values are specified as "999", as long as the second argument in within_bounds is less than this value. within_bounds can take other information such as whether the bounds should be inclusive or exclusive, or whether it should ignore the NA values. To see the options for this, and all the other functions in assertr, use the help function on them. Let's see an example of what it looks like when the assert function fails: mtcars.copy$Percent.Max.Wt[c(10,15)] <- 2 mtcars.copy <- assert(mtcars.copy, within_bounds(0,1),                       Percent.Max.Wt) ------------------------------------------------------------ Error: Vector 'Percent.Max.Wt' violates assertion 'within_bounds' 2 times (e.g. [2] at index 10) We get an informative error message that tells us how many times the assertion was violated, and the index and value of the first offending datum. With assert, we have the option of checking a condition on multiple columns at the same time. For example, none of the measurements in iris can possibly be negative. Here's how we might make sure our dataset is compliant: iris <- assert(iris, within_bounds(0, Inf),                Sepal.Length, Sepal.Width,                Petal.Length, Petal.Width)   # or simply "-Species" because that # will include all columns *except* Species iris <- assert(iris, within_bounds(0, Inf),                -Species) On occasion, we will want to check elements for adherence to a more complicated pattern. For example, let's say we had a column that we knew was either between -10 and -20, or 10 and 20. We can check for this by using the more flexible verify verb, which takes a logical expression as its second argument; if any of the results in the logical expression is FALSE, verify will cause an error. vec <- runif(10, min=10, max=20) # randomly turn some elements negative vec <- vec * sample(c(1, -1), 10,                     replace=TRUE)   example <- data.frame(weird=vec)   example <- verify(example, ((weird < 20 & weird > 10) |                               (weird < -10 & weird > -20)))   # or   example <- verify(example, abs(weird) < 20 & abs(weird) > 10) # passes   example$weird[4] <- 0 example <- verify(example, abs(weird) < 20 & abs(weird) > 10) # fails ------------------------------------- Error in verify(example, abs(weird) < 20 & abs(weird) > 10) :   verification failed! (1 failure) Checking the data type of a column By default, most of the data import functions in R will attempt to guess the data type for each column at the import phase. This is usually nice, because it saves us from tedious work. However, it can backfire when there are, for example, stray punctuation marks in what are supposed to be numeric columns. To verify this, we can use the assert function with the is.numeric base function: iris <- assert(iris, is.numeric, -Species) We can use the is.character and is.logical functions with assert, too. An alternative method that will disallow the import of unexpected data types is to specify the data type that each column should be at the data import phase with the colClasses optional argument: iris <- read.csv("PATH_TO_IRIS_DATA.csv",                  colClasses=c("numeric", "numeric",                               "numeric", "numeric",                               "character")) This solution comes with the added benefit of speeding up the data import process, since R doesn't have to waste time guessing each column's data type. Checking for unexpected categories Another data integrity impropriety that is, unfortunately, very common is the mislabeling of categorical variables. There are two types of mislabeling of categories that can occur: an observation's class is mis-entered/mis-recorded/mistaken for that of another class, or the observation's class is labeled in a way that is not consistent with the rest of the labels. To see an example of what we can do to combat the former case, read assertr's vignette. The latter case covers instances where, for example, the species of iris could be misspelled (such as "versicolour", "verginica") or cases where the pattern established by the majority of class names is ignored ("iris setosa", "i. setosa", "SETOSA"). Either way, these misspecifications prove to be a great bane to data analysts for several reasons. For example, an analysis that is predicated upon a two-class categorical variable (for example, logistic regression) will now have to contend with more than two categories. Yet another way in which unexpected categories can haunt you is by producing statistics grouped by different values of a categorical variable; if the categories were extracted from the main data manually—with subset, for example, as opposed to with by, tapply, or aggregate—you'll be missing potentially crucial observations. If you know what categories you are expecting from the start, you can use the in_set function, in concert with assert, to confirm that all the categories of a particular column are squarely contained within a predetermined set. # passes iris <- assert(iris, in_set("setosa", "versicolor",                             "virginica"), Species)   # mess up the data iris.copy <- iris # We have to make the 'Species' column not # a factor ris.copy$Species <- as.vector(iris$Species) iris.copy$Species[4:9] <- "SETOSA" iris.copy$Species[135] <- "verginica" iris.copy$Species[95] <- "i. versicolor"   # fails iris.copy <- assert(iris.copy, in_set("setosa", "versicolor",                                       "virginica"), Species) ------------------------------------------- Error: Vector 'Species' violates assertion 'in_set' 8 times (e.g. [SETOSA] at index 4) If you don't know the categories that you should be expecting, a priori, the following incantation, which will tell you how many rows each category contains, may help you identify the categories that are either rare or misspecified: by(iris.copy, iris.copy$Species, nrow) Checking for outliers, entry errors, or unlikely data points Automatic outlier detection (sometimes known as anomaly detection) is something that a lot of analysts scoff at and view as a pipe dream. Though the creation of a routine that automagically detects all erroneous data points with 100 percent specificity and precision is impossible, unmistakably mis-entered data points and flagrant outliers are not hard to detect even with very simple methods. In my experience, there are a lot of errors of this type. One simple way to detect the presence of a major outlier is to confirm that every data point is within some n number of standard deviations away from the mean of the group. assertr has a function, within_n_sds—in conjunction with the insist verb—to do just this; if we wanted to check that every numeric value in iris is within five standard deviations of its respective column's mean, we could express so thusly: iris <- insist(iris, within_n_sds(5), -Species) An issue with using standard deviations away from the mean (z-scores) for detecting outliers is that both the mean and standard deviation are influenced heavily by outliers; this means that the very thing we are trying to detect is obstructing our ability to find it. There is a more robust measure for finding central tendency and dispersion than the mean and standard deviation: the median and median absolute deviation. The median absolute deviation is the median of the absolute value of all the elements of a vector subtracted by the vector's median. assertr has a sister to within_n_sds, within_n_mads, that checks every element of a vector to make sure it is within n median absolute deviations away from its column's median. iris <- insist(iris, within_n_mads(4), -Species) iris$Petal.Length[5] <- 15 iris <- insist(iris, within_n_mads(4), -Species) --------------------------------------------- Error: Vector 'Petal.Length' violates assertion 'within_n_mads' 1 time (value [15] at index 5) In my experience, within_n_mads can be an effective guard against illegitimate univariate outliers if n is chosen carefully. The examples here have been focusing on outlier identification in the univariate case—across one dimension at a time. Often, there are times where an observation is truly anomalous but it wouldn't be evident by looking at the spread of each dimension individually. assertr has support for this type of multivariate outlier analysis, but a full discussion of it would require a background outside the scope of this text. Chaining assertions The check assertr aims to make the checking of assumptions so effortless that the user never feels the need to hold back any implicit assumption. Therefore, it's expected that the user uses multiple checks on one data frame. The usage examples that we've seen so far are really only appropriate for one or two checks. For example, a usage pattern such as the following is clearly unworkable: iris <- CHECKING_CONSTRUCT4(CHECKING_CONSTRUCT3(CHECKING_CONSTRUCT2(CHECKING_CONSTRUCT1(this, ...), ...), ...), ...) To combat this visual cacophony, assertr provides direct support for chaining multiple assertions by using the "piping" construct from the magrittr package. The pipe operator of magrittr', %>%, works as follows: it takes the item on the left-hand side of the pipe and inserts it (by default) into the position of the first argument of the function on the right-hand side. The following are some examples of simple magrittr usage patterns: library(magrittr) 4 %>% sqrt              # 2 iris %>% head(n=3)      # the first 3 rows of iris iris <- iris %>% assert(within_bounds(0, Inf), -Species) Since the return value of a passed assertr check is the validated data frame, you can use the magrittr pipe operator to tack on more checks in a way that lends itself to easier human understanding. For example: iris <- iris %>%   assert(is.numeric, -Species) %>%   assert(within_bounds(0, Inf), -Species) %>%   assert(in_set("setosa", "versicolor", "virginica"), Species) %>%   insist(within_n_mads(4), -Species)   # or, equivalently   CHECKS <- . %>%   assert(is.numeric, -Species) %>%   assert(within_bounds(0, Inf), -Species) %>%   assert(in_set("setosa", "versicolor", "virginica"), Species) %>%   insist(within_n_mads(4), -Species)   iris <- iris %>% CHECKS When chaining assertions, I like to put the most integral and general one right at the top. I also like to put the assertions most likely to be violated right at the top so that execution is terminated before any more checks are run. There are many other capabilities built into the assertr multivariate outlier checking. For more information about these, read the package's vignette, (vignette("assertr")). On the magrittr side, besides the forward-pipe operator, this package sports some other very helpful pipe operators. Additionally, magrittr allows the substitution at the right side of the pipe operator to occur at locations other than the first argument. For more information about the wonderful magrittr package, read its vignette. Other messiness As we discussed in this article's preface, there are countless ways that a dataset may be messy. There are many other messy situations and solutions that we couldn't discuss at length here. In order that you, dear reader, are not left in the dark regarding custodial solutions, here are some other remedies which you may find helpful along your analytics journey: OpenRefine Though OpenRefine (formerly Google Refine) doesn't have anything to do with R per se, it is a sophisticated tool for working with and for cleaning up messy data. Among its numerous, sophisticated capabilities is the capacity to auto-detect misspelled or mispecified categories and fix them at the click of a button. Regular expressions Suppose you find that there are commas separating every third digit of the numbers in a numeric column. How would you remove them? Or suppose you needed to strip a currency symbol from values in columns that hold monetary values so that you can compute with them as numbers. These, and vastly more complicated text transformations, can be performed using regular expressions (a formal grammar for specifying the search patterns in text) and associate R functions like grep and sub. Any time spent learning regular expressions will pay enormous dividends over your career as an analyst, and there are many great, free tutorials available on the web for this purpose. tidyr There are a few different ways in which you can represent the same tabular dataset. In one form—called long, narrow, stacked, or entity-attribute-value model—each row contains an observation ID, a variable name, and the value of that variable. For example:             member  attribute  value 1     Ringo Starr  birthyear   1940 2  Paul McCartney  birthyear   1942 3 George Harrison  birthyear   1943 4     John Lennon  birthyear   1940 5     Ringo Starr instrument  Drums 6  Paul McCartney instrument   Bass 7 George Harrison instrument Guitar 8     John Lennon instrument Guitar In another form (called wide or unstacked), each of the observation's variables are stored in each column:             member birthyear instrument 1 George Harrison      1943     Guitar 2     John Lennon      1940     Guitar 3  Paul McCartney      1942       Bass 4     Ringo Starr      1940      Drums If you ever need to convert between these representations, (which is a somewhat common operation, in practice) tidyr is your tool for the job. Exercises The following are a few exercises for you to strengthen your grasp over the concepts learned in this article: Normally, when there is missing data for a question such as "What is your income?", we strongly suspect an MNAR mechanism, because we live in a dystopia that equates wealth with worth. As a result, the participants with the lowest income may be embarrassed to answer that question. In the relevant section, we assumed that because the question was poorly worded and we could account for whether English was the first language of the participant, the mechanism is MAR. If we were wrong about this reason, and it was really because the lower income participants were reticent to admit their income, what would the missing data mechanism be now? If, however, the differences in income were fully explained by whether English was the first language of the participant, what would the missing data mechanism be in that case? Find a dataset on the web with missing data. What does it use to denote that data is missing? Think about that dataset's missing data mechanism. Is there a chance that this data is MNAR? Find a freely available government dataset on the web. Read the dataset's description, and think about what assumptions you might make about the data when planning a certain analysis. Translate these into actual code so that R can check them for you. Were there any deviations from your expectations? When two autonomous individuals decide to voluntarily trade, the transaction can be in both parties' best interests. Does it necessarily follow that a voluntary trade between nations benefits both states? Why or why not? Summary "Messy data"—no matter what definition you use—present a huge roadblock for people who work with data. This article focused on two of the most notorious and prolific culprits: missing data and data that has not been cleaned or audited for quality. On the missing data side, you learned how to visualize missing data patterns, and how to recognize different types of missing data. You saw a few unprincipled ways of tackling the problem, and learned why they were suboptimal solutions. Multiple imputation, so you learned, addresses the shortcomings of these approaches and, through its usage of several imputed data sets, correctly communicates our uncertainty surrounding the imputed values. On unsanitized data, we saw that the, perhaps, optimal solution (visually auditing the data) was untenable for moderately sized datasets or larger. We discovered that the grammar of the package assertr provides a mechanism to offload this auditing process to R. You now have a few assertr checking "recipes" under your belt for some of the more common manifestations of the mistakes that plague data that has not been scrutinized. You can check out similar books published by Packt Publishing on R (https://www.packtpub.com/tech/r): Unsupervised Learning with R by Erik Rodríguez Pacheco (https://www.packtpub.com/big-data-and-business-intelligence/unsupervised-learning-r) R Data Science Essentials by Raja B. Koushik and Sharan Kumar Ravindran (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science-essentials) Resources for Article: Further resources on this subject: Debugging The Scheduler In Oracle 11G Databases [article] Looking Good – The Graphical Interface [article] Being Offline [article]
Read more
  • 0
  • 0
  • 2752

article-image-introduction-clustering-and-unsupervised-learning
Packt
23 Feb 2016
16 min read
Save for later

Introduction to Clustering and Unsupervised Learning

Packt
23 Feb 2016
16 min read
The act of clustering, or spotting patterns in data, is not much different from spotting patterns in groups of people. In this article, you will learn: The ways clustering tasks differ from the classification tasks How clustering defines a group, and how such groups are identified by k-means, a classic and easy-to-understand clustering algorithm The steps needed to apply clustering to a real-world task of identifying marketing segments among teenage social media users Before jumping into action, we'll begin by taking an in-depth look at exactly what clustering entails. (For more resources related to this topic, see here.) Understanding clustering Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items. It does this without having been told how the groups should look ahead of time. As we may not even know what we're looking for, clustering is used for knowledge discovery rather than prediction. It provides an insight into the natural groupings found within data. Without advance knowledge of what comprises a cluster, how can a computer possibly know where one group ends and another begins? The answer is simple. Clustering is guided by the principle that items inside a cluster should be very similar to each other, but very different from those outside. The definition of similarity might vary across applications, but the basic idea is always the same—group the data so that the related elements are placed together. The resulting clusters can then be used for action. For instance, you might find clustering methods employed in the following applications: Segmenting customers into groups with similar demographics or buying patterns for targeted marketing campaigns Detecting anomalous behavior, such as unauthorized network intrusions, by identifying patterns of use falling outside the known clusters Simplifying extremely large datasets by grouping features with similar values into a smaller number of homogeneous categories Overall, clustering is useful whenever diverse and varied data can be exemplified by a much smaller number of groups. It results in meaningful and actionable data structures that reduce complexity and provide insight into patterns of relationships. Clustering as a machine learning task Clustering is somewhat different from the classification, numeric prediction, and pattern detection tasks we examined so far. In each of these cases, the result is a model that relates features to an outcome or features to other features; conceptually, the model describes the existing patterns within data. In contrast, clustering creates new data. Unlabeled examples are given a cluster label that has been inferred entirely from the relationships within the data. For this reason, you will, sometimes, see the clustering task referred to as unsupervised classification because, in a sense, it classifies unlabeled examples. The catch is that the class labels obtained from an unsupervised classifier are without intrinsic meaning. Clustering will tell you which groups of examples are closely related—for instance, it might return the groups A, B, and C—but it's up to you to apply an actionable and meaningful label. To see how this impacts the clustering task, let's consider a hypothetical example. Suppose you were organizing a conference on the topic of data science. To facilitate professional networking and collaboration, you planned to seat people in groups according to one of three research specialties: computer and/or database science, math and statistics, and machine learning. Unfortunately, after sending out the conference invitations, you realize that you had forgotten to include a survey asking which discipline the attendee would prefer to be seated with. In a stroke of brilliance, you realize that you might be able to infer each scholar's research specialty by examining his or her publication history. To this end, you begin collecting data on the number of articles each attendee published in computer science-related journals and the number of articles published in math or statistics-related journals. Using the data collected for several scholars, you create a scatterplot: As expected, there seems to be a pattern. We might guess that the upper-left corner, which represents people with many computer science publications but few articles on math, could be a cluster of computer scientists. Following this logic, the lower-right corner might be a group of mathematicians. Similarly, the upper-right corner, those with both math and computer science experience, may be machine learning experts. Our groupings were formed visually; we simply identified clusters as closely grouped data points. Yet in spite of the seemingly obvious groupings, we unfortunately have no way to know whether they are truly homogeneous without personally asking each scholar about his/her academic specialty. The labels we applied required us to make qualitative, presumptive judgments about the types of people that would fall into the group. For this reason, you might imagine the cluster labels in uncertain terms, as follows: Rather than defining the group boundaries subjectively, it would be nice to use machine learning to define them objectively. This might provide us with a rule in the form if a scholar has few math publications, then he/she is a computer science expert. Unfortunately, there's a problem with this plan. As we do not have data on the true class value for each point, a supervised learning algorithm would have no ability to learn such a pattern, as it would have no way of knowing what splits would result in homogenous groups. On the other hand, clustering algorithms use a process very similar to what we did by visually inspecting the scatterplot. Using a measure of how closely the examples are related, homogeneous groups can be identified. In the next section, we'll start looking at how clustering algorithms are implemented. This example highlights an interesting application of clustering. If you begin with unlabeled data, you can use clustering to create class labels. From there, you could apply a supervised learner such as decision trees to find the most important predictors of these classes. This is called semi-supervised learning. The k-means clustering algorithm The k-means algorithm is perhaps the most commonly used clustering method. Having been studied for several decades, it serves as the foundation for many more sophisticated clustering techniques. If you understand the simple principles it uses, you will have the knowledge needed to understand nearly any clustering algorithm in use today. Many such methods are listed on the following site, the CRAN Task View for clustering at http://cran.r-project.org/web/views/Cluster.html. As k-means has evolved over time, there are many implementations of the algorithm. One popular approach is described in : Hartigan JA, Wong MA. A k-means clustering algorithm. Applied Statistics. 1979; 28:100-108. Even though clustering methods have advanced since the inception of k-means, this is not to imply that k-means is obsolete. In fact, the method may be more popular now than ever. The following table lists some reasons why k-means is still used widely: Strengths Weaknesses Uses simple principles that can be explained in non-statistical terms Highly flexible, and can be adapted with simple adjustments to address nearly all of its shortcomings Performs well enough under many real-world use cases Not as sophisticated as more modern clustering algorithms Because it uses an element of random chance, it is not guaranteed to find the optimal set of clusters Requires a reasonable guess as to how many clusters naturally exist in the data Not ideal for non-spherical clusters or clusters of widely varying density The k-means algorithm assigns each of the n examples to one of the k clusters, where k is a number that has been determined ahead of time. The goal is to minimize the differences within each cluster and maximize the differences between the clusters. Unless k and n are extremely small, it is not feasible to compute the optimal clusters across all the possible combinations of examples. Instead, the algorithm uses a heuristic process that finds locally optimal solutions. Put simply, this means that it starts with an initial guess for the cluster assignments, and then modifies the assignments slightly to see whether the changes improve the homogeneity within the clusters. We will cover the process in depth shortly, but the algorithm essentially involves two phases. First, it assigns examples to an initial set of k clusters. Then, it updates the assignments by adjusting the cluster boundaries according to the examples that currently fall into the cluster. The process of updating and assigning occurs several times until changes no longer improve the cluster fit. At this point, the process stops and the clusters are finalized. Due to the heuristic nature of k-means, you may end up with somewhat different final results by making only slight changes to the starting conditions. If the results vary dramatically, this could indicate a problem. For instance, the data may not have natural groupings or the value of k has been poorly chosen. With this in mind, it's a good idea to try a cluster analysis more than once to test the robustness of your findings. To see how the process of assigning and updating works in practice, let's revisit the case of the hypothetical data science conference. Though this is a simple example, it will illustrate the basics of how k-means operates under the hood. Using distance to assign and update clusters As with k-NN, k-means treats feature values as coordinates in a multidimensional feature space. For the conference data, there are only two features, so we can represent the feature space as a two-dimensional scatterplot as depicted previously. The k-means algorithm begins by choosing k points in the feature space to serve as the cluster centers. These centers are the catalyst that spurs the remaining examples to fall into place. Often, the points are chosen by selecting k random examples from the training dataset. As we hope to identify three clusters, according to this method, k = 3 points will be selected at random. These points are indicated by the star, triangle, and diamond in the following diagram: It's worth noting that although the three cluster centers in the preceding diagram happen to be widely spaced apart, this is not always necessarily the case. Since they are selected at random, the three centers could have just as easily been three adjacent points. As the k-means algorithm is highly sensitive to the starting position of the cluster centers, this means that random chance may have a substantial impact on the final set of clusters. To address this problem, k-means can be modified to use different methods for choosing the initial centers. For example, one variant chooses random values occurring anywhere in the feature space (rather than only selecting among the values observed in the data). Another option is to skip this step altogether; by randomly assigning each example to a cluster, the algorithm can jump ahead immediately to the update phase. Each of these approaches adds a particular bias to the final set of clusters, which you may be able to use to improve your results. In 2007, an algorithm called k-means++ was introduced, which proposes an alternative method for selecting the initial cluster centers. It purports to be an efficient way to get much closer to the optimal clustering solution while reducing the impact of random chance. For more information, refer to Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. 2007:1027–1035. After choosing the initial cluster centers, the other examples are assigned to the cluster center that is nearest according to the distance function. You will remember that we studied distance functions while learning about k-Nearest Neighbors. Traditionally, k-means uses Euclidean distance, but Manhattan distance or Minkowski distance are also sometimes used. Recall that if n indicates the number of features, the formula for Euclidean distance between example x and example y is: For instance, if we are comparing a guest with five computer science publications and one math publication to a guest with zero computer science papers and two math papers, we could compute this in R as follows: > sqrt((5 - 0)^2 + (1 - 2)^2) [1] 5.09902 Using this distance function, we find the distance between each example and each cluster center. The example is then assigned to the nearest cluster center. Keep in mind that as we are using distance calculations, all the features need to be numeric, and the values should be normalized to a standard range ahead of time. As shown in the following diagram, the three cluster centers partition the examples into three segments labeled Cluster A, Cluster B, and Cluster C. The dashed lines indicate the boundaries for the Voronoi diagram created by the cluster centers. The Voronoi diagram indicates the areas that are closer to one cluster center than any other; the vertex where all the three boundaries meet is the maximal distance from all three cluster centers. Using these boundaries, we can easily see the regions claimed by each of the initial k-means seeds: Now that the initial assignment phase has been completed, the k-means algorithm proceeds to the update phase. The first step of updating the clusters involves shifting the initial centers to a new location, known as the centroid, which is calculated as the average position of the points currently assigned to that cluster. The following diagram illustrates how as the cluster centers shift to the new centroids, the boundaries in the Voronoi diagram also shift and a point that was once in Cluster B (indicated by an arrow) is added to Cluster A: As a result of this reassignment, the k-means algorithm will continue through another update phase. After shifting the cluster centroids, updating the cluster boundaries, and reassigning points into new clusters (as indicated by arrows), the figure looks like this: Because two more points were reassigned, another update must occur, which moves the centroids and updates the cluster boundaries. However, because these changes result in no reassignments, the k-means algorithm stops. The cluster assignments are now final: The final clusters can be reported in one of the two ways. First, you might simply report the cluster assignments such as A, B, or C for each example. Alternatively, you could report the coordinates of the cluster centroids after the final update. Given either reporting method, you are able to define the cluster boundaries by calculating the centroids or assigning each example to its nearest cluster. Choosing the appropriate number of clusters In the introduction to k-means, we learned that the algorithm is sensitive to the randomly-chosen cluster centers. Indeed, if we had selected a different combination of three starting points in the previous example, we may have found clusters that split the data differently from what we had expected. Similarly, k-means is sensitive to the number of clusters; the choice requires a delicate balance. Setting k to be very large will improve the homogeneity of the clusters, and at the same time, it risks overfitting the data. Ideally, you will have a priori knowledge (a prior belief) about the true groupings and you can apply this information to choosing the number of clusters. For instance, if you were clustering movies, you might begin by setting k equal to the number of genres considered for the Academy Awards. In the data science conference seating problem that we worked through previously, k might reflect the number of academic fields of study that were invited. Sometimes the number of clusters is dictated by business requirements or the motivation for the analysis. For example, the number of tables in the meeting hall could dictate how many groups of people should be created from the data science attendee list. Extending this idea to another business case, if the marketing department only has resources to create three distinct advertising campaigns, it might make sense to set k = 3 to assign all the potential customers to one of the three appeals. Without any prior knowledge, one rule of thumb suggests setting k equal to the square root of (n / 2), where n is the number of examples in the dataset. However, this rule of thumb is likely to result in an unwieldy number of clusters for large datasets. Luckily, there are other statistical methods that can assist in finding a suitable k-means cluster set. A technique known as the elbow method attempts to gauge how the homogeneity or heterogeneity within the clusters changes for various values of k. As illustrated in the following diagrams, the homogeneity within clusters is expected to increase as additional clusters are added; similarly, heterogeneity will also continue to decrease with more clusters. As you could continue to see improvements until each example is in its own cluster, the goal is not to maximize homogeneity or minimize heterogeneity, but rather to find k so that there are diminishing returns beyond that point. This value of k is known as the elbow point because it looks like an elbow. There are numerous statistics to measure homogeneity and heterogeneity within the clusters that can be used with the elbow method (the following information box provides a citation for more detail). Still, in practice, it is not always feasible to iteratively test a large number of k values. This is in part because clustering large datasets can be fairly time consuming; clustering the data repeatedly is even worse. Regardless, applications requiring the exact optimal set of clusters are fairly rare. In most clustering applications, it suffices to choose a k value based on convenience rather than strict performance requirements. For a very thorough review of the vast assortment of cluster performance measures, refer to: Halkidi M, Batistakis Y, Vazirgiannis M. On clustering validation techniques. Journal of Intelligent Information Systems. 2001; 17:107-145. The process of setting k itself can sometimes lead to interesting insights. By observing how the characteristics of the clusters change as k is varied, one might infer where the data have naturally defined boundaries. Groups that are more tightly clustered will change a little, while less homogeneous groups will form and disband over time. In general, it may be wise to spend little time worrying about getting k exactly right. The next example will demonstrate how even a tiny bit of subject-matter knowledge borrowed from a Hollywood film can be used to set k such that actionable and interesting clusters are found. As clustering is unsupervised, the task is really about what you make of it; the value is in the insights you take away from the algorithm's findings. Summary This article covered only the fundamentals of clustering. As a very mature machine learning method, there are many variants of the k-means algorithm as well as many other clustering algorithms that bring unique biases and heuristics to the task. Based on the foundation in this article, you will be able to understand and apply other clustering methods to new problems. To learn more about different machine learning techniques, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r) Mastering Scientific Computing with R (https://www.packtpub.com/application-development/mastering-scientific-computing-r) R for Data Science (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science) Resources for Article:   Further resources on this subject: Displaying SQL Server Data using a Linq Data Source [article] Probability of R? [article] Working with Commands and Plugins [article]
Read more
  • 0
  • 0
  • 18351
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-training-neural-networks-efficiently-using-keras
Packt
22 Feb 2016
9 min read
Save for later

Training neural networks efficiently using Keras

Packt
22 Feb 2016
9 min read
In this article, we will take a look at Keras, one of the most recently developed libraries to facilitate neural network training. The development on Keras started in the early months of 2015; as of today, it has evolved into one of the most popular and widely used libraries that are built on top of Theano, and allows us to utilize our GPU to accelerate neural network training. One of its prominent features is that it's a very intuitive API, which allows us to implement neural networks in only a few lines of code. Once you have Theano installed, you can install Keras from PyPI by executing the following command from your terminal command line: (For more resources related to this topic, see here.) pip install Keras For more information about Keras, please visit the official website at http://keras.io. To see what neural network training via Keras looks like, let's implement a multilayer perceptron to classify the handwritten digits from the MNIST dataset. The MNIST dataset can be downloaded from http://yann.lecun.com/exdb/mnist/ in four parts as listed here: train-images-idx3-ubyte.gz: These are training set images (9912422 bytes) train-labels-idx1-ubyte.gz: These are training set labels (28881 bytes) t10k-images-idx3-ubyte.gz: These are test set images (1648877 bytes) t10k-labels-idx1-ubyte.gz: These are test set labels (4542 bytes) After downloading and unzipped the archives, we place the files into a directory mnist in our current working directory, so that we can load the training as well as the test dataset using the following function: import os import struct import numpy as np def load_mnist(path, kind='train'): """Load MNIST data from `path`""" labels_path = os.path.join(path, '%s-labels-idx1-ubyte' % kind) images_path = os.path.join(path, '%s-images-idx3-ubyte' % kind) with open(labels_path, 'rb') as lbpath: magic, n = struct.unpack('>II', lbpath.read(8)) labels = np.fromfile(lbpath, dtype=np.uint8) with open(images_path, 'rb') as imgpath: magic, num, rows, cols = struct.unpack(">IIII", imgpath.read(16)) images = np.fromfile(imgpath, dtype=np.uint8).reshape(len(labels), 784) return images, labels X_train, y_train = load_mnist('mnist', kind='train') print('Rows: %d, columns: %d' % (X_train.shape[0], X_train.shape[1])) Rows: 60000, columns: 784 X_test, y_test = load_mnist('mnist', kind='t10k') print('Rows: %d, columns: %d' % (X_test.shape[0], X_test.shape[1])) Rows: 10000, columns: 784 On the following pages, we will walk through the code examples for using Keras step by step, which you can directly execute from your Python interpreter. However, if you are interested in training the neural network on your GPU, you can either put it into a Python script, or download the respective code from the Packt Publishing website. In order to run the Python script on your GPU, execute the following command from the directory where the mnist_keras_mlp.py file is located: THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python mnist_keras_mlp.py To continue with the preparation of the training data, let's cast the MNIST image array into 32-bit format: >>> import theano >>> theano.config.floatX = 'float32' >>> X_train = X_train.astype(theano.config.floatX) >>> X_test = X_test.astype(theano.config.floatX) Next, we need to convert the class labels (integers 0-9) into the one-hot format. Fortunately, Keras provides a convenient tool for this: >>> from keras.utils import np_utils >>> print('First 3 labels: ', y_train[:3]) First 3 labels: [5 0 4] >>> y_train_ohe = np_utils.to_categorical(y_train) >>> print('nFirst 3 labels (one-hot):n', y_train_ohe[:3]) First 3 labels (one-hot): [[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]] Now, we can get to the interesting part and implement a neural network. However, we will replace the logistic units in the hidden layer with hyperbolic tangent activation functions, replace the logistic function in the output layer with softmax, and add an additional hidden layer. Keras makes these tasks very simple, as you can see in the following code implementation: >>> from keras.models import Sequential >>> from keras.layers.core import Dense >>> from keras.optimizers import SGD >>> np.random.seed(1) >>> model = Sequential() >>> model.add(Dense(input_dim=X_train.shape[1], ... output_dim=50, ... init='uniform', ... activation='tanh')) >>> model.add(Dense(input_dim=50, ... output_dim=50, ... init='uniform', ... activation='tanh')) >>> model.add(Dense(input_dim=50, ... output_dim=y_train_ohe.shape[1], ... init='uniform', ... activation='softmax')) >>> sgd = SGD(lr=0.001, decay=1e-7, momentum=.9) >>> model.compile(loss='categorical_crossentropy', optimizer=sgd) First, we initialize a new model using the Sequential class to implement a feedforward neural network. Then, we can add as many layers to it as we like. However, since the first layer that we add is the input layer, we have to make sure that the input_dim attribute matches the number of features (columns) in the training set (here, 768). Also, we have to make sure that the number of output units (output_dim) and input units (input_dim) of two consecutive layers match. In the preceding example, we added two hidden layers with 50 hidden units plus 1 bias unit each. Note that bias units are initialized to 0 in fully connected networks in Keras. This is in contrast to the MLP implementation, where we initialized the bias units to 1, which is a more common (not necessarily better) convention. Finally, the number of units in the output layer should be equal to the number of unique class labels—the number of columns in the one-hot encoded class label array. Before we can compile our model, we also have to define an optimizer. In the preceding example, we chose a stochastic gradient descent optimization. Furthermore, we can set values for the weight decay constant and momentum learning to adjust the learning rate at each epoch. Lastly, we set the cost (or loss) function to categorical_crossentropy. The (binary) cross-entropy is just the technical term for the cost function in logistic regression, and the categorical cross-entropy is its generalization for multi-class predictions via softmax. After compiling the model, we can now train it by calling the fit method. Here, we are using mini-batch stochastic gradient with a batch size of 300 training samples per batch. We train the MLP over 50 epochs, and we can follow the optimization of the cost function during training by setting verbose=1. The validation_split parameter is especially handy, since it will reserve 10 percent of the training data (here, 6,000 samples) for validation after each epoch, so that we can check if the model is overfitting during training. >>> model.fit(X_train, ... y_train_ohe, ... nb_epoch=50, ... batch_size=300, ... verbose=1, ... validation_split=0.1, ... show_accuracy=True) Train on 54000 samples, validate on 6000 samples Epoch 0 54000/54000 [==============================] - 1s - loss: 2.2290 - acc: 0.3592 - val_loss: 2.1094 - val_acc: 0.5342 Epoch 1 54000/54000 [==============================] - 1s - loss: 1.8850 - acc: 0.5279 - val_loss: 1.6098 - val_acc: 0.5617 Epoch 2 54000/54000 [==============================] - 1s - loss: 1.3903 - acc: 0.5884 - val_loss: 1.1666 - val_acc: 0.6707 Epoch 3 54000/54000 [==============================] - 1s - loss: 1.0592 - acc: 0.6936 - val_loss: 0.8961 - val_acc: 0.7615 […] Epoch 49 54000/54000 [==============================] - 1s - loss: 0.1907 - acc: 0.9432 - val_loss: 0.1749 - val_acc: 0.9482 Printing the value of the cost function is extremely useful during training, since we can quickly spot whether the cost is decreasing during training and stop the algorithm earlier if otherwise to tune the hyperparameters values. To predict the class labels, we can then use the predict_classes method to return the class labels directly as integers: >>> y_train_pred = model.predict_classes(X_train, verbose=0) >>> print('First 3 predictions: ', y_train_pred[:3]) >>> First 3 predictions: [5 0 4] Finally, let's print the model accuracy on training and test sets: >>> train_acc = np.sum( ... y_train == y_train_pred, axis=0) / X_train.shape[0] >>> print('Training accuracy: %.2f%%' % (train_acc * 100)) Training accuracy: 94.51% >>> y_test_pred = model.predict_classes(X_test, verbose=0) >>> test_acc = np.sum(y_test == y_test_pred, ... axis=0) / X_test.shape[0] print('Test accuracy: %.2f%%' % (test_acc * 100)) Test accuracy: 94.39% Note that this is just a very simple neural network without optimized tuning parameters. If you are interested in playing more with Keras, please feel free to further tweak the learning rate, momentum, weight decay, and number of hidden units. Although Keras is great library for implementing and experimenting with neural networks, there are many other Theano wrapper libraries that are worth mentioning. A prominent example is Pylearn2 (http://deeplearning.net/software/pylearn2/), which has been developed in the LISA lab in Montreal. Also, Lasagne (https://github.com/Lasagne/Lasagne) may be of interest to you if you prefer a more minimalistic but extensible library, that offers more control over the underlying Theano code. Summary We caught a glimpse of the most beautiful and most exciting algorithms in the whole machine learning field: artificial neural networks. I can recommend you to follow the works of the leading experts in this field, such as Geoff Hinton (http://www.cs.toronto.edu/~hinton/), Andrew Ng (http://www.andrewng.org), Yann LeCun (http://yann.lecun.com), Juergen Schmidhuber (http://people.idsia.ch/~juergen/), and Yoshua Bengio (http://www.iro.umontreal.ca/~bengioy), just to name a few. To learn more about material design, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Building Machine Learning Systems with Python (https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python) Neural Network Programming with Java (https://www.packtpub.com/networking-and-servers/neural-network-programming-java) Resources for Article: Further resources on this subject: Python Data Analysis Utilities [article] Machine learning and Python – the Dream Team [article] Adding a Spark to R [article]
Read more
  • 0
  • 0
  • 3708

article-image-social-media-insight-using-naive-bayes
Packt
22 Feb 2016
48 min read
Save for later

Social Media Insight Using Naive Bayes

Packt
22 Feb 2016
48 min read
Text-based datasets contain a lot of information, whether they are books, historical documents, social media, e-mail, or any of the other ways we communicate via writing. Extracting features from text-based datasets and using them for classification is a difficult problem. There are, however, some common patterns for text mining. (For more resources related to this topic, see here.) We look at disambiguating terms in social media using the Naive Bayes algorithm, which is a powerful and surprisingly simple algorithm. Naive Bayes takes a few shortcuts to properly compute the probabilities for classification, hence the term naive in the name. It can also be extended to other types of datasets quite easily and doesn't rely on numerical features. The model in this article is a baseline for text mining studies, as the process can work reasonably well for a variety of datasets. We will cover the following topics in this article: Downloading data from social network APIs Transformers for text Naive Bayes classifier Using JSON for saving and loading datasets The NLTK library for extracting features from text The F-measure for evaluation Disambiguation Text is often called an unstructured format. There is a lot of information there, but it is just there; no headings, no required format, loose syntax and other problems prohibit the easy extraction of information from text. The data is also highly connected, with lots of mentions and cross-references—just not in a format that allows us to easily extract it! We can compare the information stored in a book with that stored in a large database to see the difference. In the book, there are characters, themes, places, and lots of information. However, the book needs to be read and, more importantly, interpreted to gain this information. The database sits on your server with column names and data types. All the information is there and the level of interpretation needed is quite low. Information about the data, such as its type or meaning is called metadata, and text lacks it. A book also contains some metadata in the form of a table of contents and index but the degree is significantly lower than that of a database. One of the problems is the term disambiguation. When a person uses the word bank, is this a financial message or an environmental message (such as river bank)? This type of disambiguation is quite easy in many circumstances for humans (although there are still troubles), but much harder for computers to do. In this article, we will look at disambiguating the use of the term Python on Twitter's stream. A message on Twitter is called a tweet and is limited to 140 characters. This means there is little room for context. There isn't much metadata available although hashtags are often used to denote the topic of the tweet. When people talk about Python, they could be talking about the following things: The programming language Python Monty Python, the classic comedy group The snake Python A make of shoe called Python There can be many other things called Python. The aim of our experiment is to take a tweet mentioning Python and determine whether it is talking about the programming language, based only on the content of the tweet. Downloading data from a social network We are going to download a corpus of data from Twitter and use it to sort out spam from useful content. Twitter provides a robust API for collecting information from its servers and this API is free for small-scale usage. It is, however, subject to some conditions that you'll need to be aware of if you start using Twitter's data in a commercial setting. First, you'll need to sign up for a Twitter account (which is free). Go to http://twitter.com and register an account if you do not already have one. Next, you'll need to ensure that you only make a certain number of requests per minute. This limit is currently 180 requests per hour. It can be tricky ensuring that you don't breach this limit, so it is highly recommended that you use a library to talk to Twitter's API. You will need a key to access Twitter's data. Go to http://twitter.com and sign in to your account. When you are logged in, go to https://apps.twitter.com/ and click on Create New App. Create a name and description for your app, along with a website address. If you don't have a website to use, insert a placeholder. Leave the Callback URL field blank for this app—we won't need it. Agree to the terms of use (if you do) and click on Create your Twitter application. Keep the resulting website open—you'll need the access keys that are on this page. Next, we need a library to talk to Twitter. There are many options; the one I like is simply called twitter, and is the official Twitter Python library. You can install twitter using pip3 install twitter if you are using pip to install your packages. If you are using another system, check the documentation at https://github.com/sixohsix/twitter. Create a new IPython Notebook to download the data. We will create several notebooks in this article for various different purposes, so it might be a good idea to also create a folder to keep track of them. This first notebook, ch6_get_twitter, is specifically for downloading new Twitter data. First, we import the twitter library and set our authorization tokens. The consumer key, consumer secret will be available on the Keys and Access Tokens tab on your Twitter app's page. To get the access tokens, you'll need to click on the Create my access token button, which is on the same page. Enter the keys into the appropriate places in the following code: import twitter consumer_key = "<Your Consumer Key Here>" consumer_secret = "<Your Consumer Secret Here>" access_token = "<Your Access Token Here>" access_token_secret = "<Your Access Token Secret Here>" authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret) We are going to get our tweets from Twitter's search function. We will create a reader that connects to twitter using our authorization, and then use that reader to perform searches. In the Notebook, we set the filename where the tweets will be stored: import os output_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") We also need the json library for saving our tweets: import json Next, create an object that can read from Twitter. We create this object with our authorization object that we set up earlier: t = twitter.Twitter(auth=authorization) We then open our output file for writing. We open it for appending—this allows us to rerun the script to obtain more tweets. We then use our Twitter connection to perform a search for the word Python. We only want the statuses that are returned for our dataset. This code takes the tweet, uses the json library to create a string representation using the dumps function, and then writes it to the file. It then creates a blank line under the tweet so that we can easily distinguish where one tweet starts and ends in our file: with open(output_filename, 'a') as output_file: search_results = t.search.tweets(q="python", count=100)['statuses'] for tweet in search_results: if 'text' in tweet: output_file.write(json.dumps(tweet)) output_file.write("nn") In the preceding loop, we also perform a check to see whether there is text in the tweet or not. Not all of the objects returned by twitter will be actual tweets (some will be actions to delete tweets and others). The key difference is the inclusion of text as a key, which we test for. Running this for a few minutes will result in 100 tweets being added to the output file. You can keep rerunning this script to add more tweets to your dataset, keeping in mind that you may get some duplicates in the output file if you rerun it too fast (that is, before Twitter gets new tweets to return!). Loading and classifying the dataset After we have collected a set of tweets (our dataset), we need labels to perform classification. We are going to label the dataset by setting up a form in an IPython Notebook to allow us to enter the labels. The dataset we have stored is nearly in a JSON format. JSON is a format for data that doesn't impose much structure and is directly readable in JavaScript (hence the name, JavaScript Object Notation). JSON defines basic objects such as numbers, strings, lists and dictionaries, making it a good format for storing datasets if they contain data that isn't numerical. If your dataset is fully numerical, you would save space and time using a matrix-based format like in NumPy. A key difference between our dataset and real JSON is that we included newlines between tweets. The reason for this was to allow us to easily append new tweets (the actual JSON format doesn't allow this easily). Our format is a JSON representation of a tweet, followed by a newline, followed by the next tweet, and so on. To parse it, we can use the json library but we will have to first split the file by newlines to get the actual tweet objects themselves. Set up a new IPython Notebook (I called mine ch6_label_twitter) and enter the dataset's filename. This is the same filename in which we saved the data in the previous section. We also define the filename that we will use to save the labels to. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") As stated, we will use the json library, so import that too: import json We create a list that will store the tweets we received from the file: tweets = [] We then iterate over each line in the file. We aren't interested in lines with no information (they separate the tweets for us), so check if the length of the line (minus any whitespace characters) is zero. If it is, ignore it and move to the next line. Otherwise, load the tweet using json.loads (which loads a JSON object from a string) and add it to our list of tweets. The code is as follows: with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)) We are now interested in classifying whether an item is relevant to us or not (in this case, relevant means refers to the programming language Python). We will use the IPython Notebook's ability to embed HTML and talk between JavaScript and Python to create a viewer of tweets to allow us to easily and quickly classify the tweets as spam or not. The code will present a new tweet to the user (you) and ask for a label: is it relevant or not? It will then store the input and present the next tweet to be labeled. First, we create a list for storing the labels. These labels will be stored whether or not the given tweet refers to the programming language Python, and it will allow our classifier to learn how to differentiate between meanings. We also check if we have any labels already and load them. This helps if you need to close the notebook down midway through labeling. This code will load the labels from where you left off. It is generally a good idea to consider how to save at midpoints for tasks like this. Nothing hurts quite like losing an hour of work because your computer crashed before you saved the labels! The code is as follows: labels = [] if os.path.exists(labels_filename): with open(labels_filename) as inf: labels = json.load(inf) Next, we create a simple function that will return the next tweet that needs to be labeled. We can work out which is the next tweet by finding the first one that hasn't yet been labeled. The code is as follows: def get_next_tweet(): return tweet_sample[len(labels)]['text'] The next step in our experiment is to collect information from the user (you!) on which tweets are referring to Python (the programming language) and which are not. As of yet, there is not a good, straightforward way to get interactive feedback with pure Python in IPython Notebooks. For this reason, we will use some JavaScript and HTML to get this input from the user. Next we create some JavaScript in the IPython Notebook to run our input. Notebooks allow us to use magic functions to embed HTML and JavaScript (among other things) directly into the Notebook itself. Start a new cell with the following line at the top: %%javascript The code in here will be in JavaScript, hence the curly braces that are coming up. Don't worry, we will get back to Python soon. Keep in mind here that the following code must be in the same cell as the %%javascript magic function. The first function we will define in JavaScript shows how easy it is to talk to your Python code from JavaScript in IPython Notebooks. This function, if called, will add a label to the labels array (which is in python code). To do this, we load the IPython kernel as a JavaScript object and give it a Python command to execute. The code is as follows: function set_label(label){ var kernel = IPython.notebook.kernel; kernel.execute("labels.append(" + label + ")"); load_next_tweet(); } At the end of that function, we call the load_next_tweet function. This function loads the next tweet to be labeled. It runs on the same principle; we load the IPython kernel and give it a command to execute (calling the get_next_tweet function we defined earlier). However, in this case we want to get the result. This is a little more difficult. We need to define a callback, which is a function that is called when the data is returned. The format for defining callback is outside the scope of this book. If you are interested in more advanced JavaScript/Python integration, consult the IPython documentation. The code is as follows: function load_next_tweet(){ var code_input = "get_next_tweet()"; var kernel = IPython.notebook.kernel; var callbacks = { 'iopub' : {'output' : handle_output}}; kernel.execute(code_input, callbacks, {silent:false}); } The callback function is called handle_output, which we will define now. This function gets called when the Python function that kernel.execute calls returns a value. As before, the full format of this is outside the scope of this book. However, for our purposes the result is returned as data of the type text/plain, which we extract and show in the #tweet_text div of the form we are going to create in the next cell. The code is as follows: function handle_output(out){ var res = out.content.data["text/plain"]; $("div#tweet_text").html(res); } Our form will have a div that shows the next tweet to be labeled, which we will give the ID #tweet_text. We also create a textbox to enable us to capture key presses (otherwise, the Notebook will capture them and JavaScript won't do anything). This allows us to use the keyboard to set labels of 1 or 0, which is faster than using the mouse to click buttons—given that we will need to label at least 100 tweets. Run the previous cell to embed some JavaScript into the page, although nothing will be shown to you in the results section. We are going to use a different magic function now, %%html. Unsurprisingly, this magic function allows us to directly embed HTML into our Notebook. In a new cell, start with this line: %%html For this cell, we will be coding in HTML and a little JavaScript. First, define a div element to store our current tweet to be labeled. I've also added some instructions for using this form. Then, create the #tweet_text div that will store the text of the next tweet to be labeled. As stated before, we need to create a textbox to be able to capture key presses. The code is as follows: <div name="tweetbox"> Instructions: Click in textbox. Enter a 1 if the tweet is relevant, enter 0 otherwise.<br> Tweet: <div id="tweet_text" value="text"></div><br> <input type=text id="capture"></input><br> </div> Don't run the cell just yet! We create the JavaScript for capturing the key presses. This has to be defined after creating the form, as the #tweet_text div doesn't exist until the above code runs. We use the JQuery library (which IPython is already using, so we don't need to include the JavaScript file) to add a function that is called when key presses are made on the #capture textbox we defined. However, keep in mind that this is a %%html cell and not a JavaScript cell, so we need to enclose this JavaScript in the <script> tags. We are only interested in key presses if the user presses the 0 or the 1, in which case the relevant label is added. We can determine which key was pressed by the ASCII value stored in e.which. If the user presses 0 or 1, we append the label and clear out the textbox. The code is as follows: <script> $("input#capture").keypress(function(e) { if(e.which == 48) { set_label(0); $("input#capture").val(""); }else if (e.which == 49){ set_label(1); $("input#capture").val(""); } }); All other key presses are ignored. As a last bit of JavaScript for this article (I promise), we call the load_next_tweet() function. This will set the first tweet to be labeled and then close off the JavaScript. The code is as follows: load_next_tweet(); </script> After you run this cell, you will get an HTML textbox, alongside the first tweet's text. Click in the textbox and enter 1 if it is relevant to our goal (in this case, it means is the tweet related to the programming language Python) and a 0 if it is not. After you do this, the next tweet will load. Enter the label and the next one will load. This continues until the tweets run out. When you finish all of this, simply save the labels to the output filename we defined earlier for the class values: with open(labels_filename, 'w') as outf: json.dump(labels, outf) You can call the preceding code even if you haven't finished. Any labeling you have done to that point will be saved. Running this Notebook again will pick up where you left off and you can keep labeling your tweets. This might take a while to do this! If you have a lot of tweets in your dataset, you'll need to classify all of them. If you are pushed for time, you can download the same dataset I used, which contains classifications. Creating a replicable dataset from Twitter In data mining, there are lots of variables. These aren't just in the data mining algorithms—they also appear in the data collection, environment, and many other factors. Being able to replicate your results is important as it enables you to verify or improve upon your results. Getting 80 percent accuracy on one dataset with algorithm X, and 90 percent accuracy on another dataset with algorithm Y doesn't mean that Y is better. We need to be able to test on the same dataset in the same conditions to be able to properly compare. On running the preceding code, you will get a different dataset to the one I created and used. The main reasons are that Twitter will return different search results for you than me based on the time you performed the search. Even after that, your labeling of tweets might be different from what I do. While there are obvious examples where a given tweet relates to the python programming language, there will always be gray areas where the labeling isn't obvious. One tough gray area I ran into was tweets in non-English languages that I couldn't read. In this specific instance, there are options in Twitter's API for setting the language, but even these aren't going to be perfect. Due to these factors, it is difficult to replicate experiments on databases that are extracted from social media, and Twitter is no exception. Twitter explicitly disallows sharing datasets directly. One solution to this is to share tweet IDs only, which you can share freely. In this section, we will first create a tweet ID dataset that we can freely share. Then, we will see how to download the original tweets from this file to recreate the original dataset. First, we save the replicable dataset of tweet IDs. Creating another new IPython Notebook, first set up the filenames. This is done in the same way we did labeling but there is a new filename where we can store the replicable dataset. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") replicable_dataset = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_dataset.json") We load the tweets and labels as we did in the previous notebook: import json tweets = [] with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)) if os.path.exists(labels_filename): with open(classes_filename) as inf: labels = json.load(inf) Now we create a dataset by looping over both the tweets and labels at the same time and saving those in a list: dataset = [(tweet['id'], label) for tweet, label in zip(tweets, labels)] Finally, we save the results in our file: with open(replicable_dataset, 'w') as outf: json.dump(dataset, outf) Now that we have the tweet IDs and labels saved, we can recreate the original dataset. If you are looking to recreate the dataset I used for this article, it can be found in the code bundle that comes with this book. Loading the preceding dataset is not difficult but it can take some time. Start a new IPython Notebook and set the dataset, label, and tweet ID filenames as before. I've adjusted the filenames here to ensure that you don't overwrite your previously collected dataset, but feel free to change these if you want. The code is as follows: import os tweet_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_python_classes.json") replicable_dataset = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_dataset.json") Then load the tweet IDs from the file using JSON: import json with open(replicable_dataset) as inf: tweet_ids = json.load(inf) Saving the labels is very easy. We just iterate through this dataset and extract the IDs. We could do this quite easily with just two lines of code (open file and save tweets). However, we can't guarantee that we will get all the tweets we are after (for example, some may have been changed to private since collecting the dataset) and therefore the labels will be incorrectly indexed against the data. As an example, I tried to recreate the dataset just one day after collecting them and already two of the tweets were missing (they might be deleted or made private by the user). For this reason, it is important to only print out the labels that we need. To do this, we first create an empty actual labels list to store the labels for tweets that we actually recover from twitter, and then create a dictionary mapping the tweet IDs to the labels. The code is as follows: actual_labels = [] label_mapping = dict(tweet_ids) Next, we are going to create a twitter server to collect all of these tweets. This is going to take a little longer. Import the twitter library that we used before, creating an authorization token and using that to create the twitter object: import twitter consumer_key = "<Your Consumer Key Here>" consumer_secret = "<Your Consumer Secret Here>" access_token = "<Your Access Token Here>" access_token_secret = "<Your Access Token Secret Here>" authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret) t = twitter.Twitter(auth=authorization) Iterate over each of the twitter IDs by extracting the IDs into a list using the following command: all_ids = [tweet_id for tweet_id, label in tweet_ids] Then, we open our output file to save the tweets: with open(tweets_filename, 'a') as output_file: The Twitter API allows us get 100 tweets at a time. Therefore, we iterate over each batch of 100 tweets: for start_index in range(0, len(tweet_ids), 100): To search by ID, we first create a string that joins all of the IDs (in this batch) together: id_string = ",".join(str(i) for i in all_ids[start_index:start_index+100]) Next, we perform a statuses/lookup API call, which is defined by Twitter. We pass our list of IDs (which we turned into a string) into the API call in order to have those tweets returned to us: search_results = t.statuses.lookup(_id=id_string) Then for each tweet in the search results, we save it to our file in the same way we did when we were collecting the dataset originally: for tweet in search_results: if 'text' in tweet: output_file.write(json.dumps(tweet)) output_file.write("nn") As a final step here (and still under the preceding if block), we want to store the labeling of this tweet. We can do this using the label_mapping dictionary we created before, looking up the tweet ID. The code is as follows: actual_labels.append(label_mapping[tweet['id']]) Run the previous cell and the code will collect all of the tweets for you. If you created a really big dataset, this may take a while—Twitter does rate-limit requests. As a final step here, save the actual_labels to our classes file: with open(labels_filename, 'w') as outf: json.dump(actual_labels, outf) Text transformers Now that we have our dataset, how are we going to perform data mining on it? Text-based datasets include books, essays, websites, manuscripts, programming code, and other forms of written expression. All of the algorithms we have seen so far deal with numerical or categorical features, so how do we convert our text into a format that the algorithm can deal with? There are a number of measurements that could be taken. For instance, average word and average sentence length are used to predict the readability of a document. However, there are lots of feature types such as word occurrence which we will now investigate. Bag-of-words One of the simplest but highly effective models is to simply count each word in the dataset. We create a matrix, where each row represents a document in our dataset and each column represents a word. The value of the cell is the frequency of that word in the document. Here's an excerpt from The Lord of the Rings, J.R.R. Tolkien: Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in halls of stone, Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne In the Land of Mordor where the Shadows lie. One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them. In the Land of Mordor where the Shadows lie.                                            - J.R.R. Tolkien's epigraph to The Lord of The Rings The word the appears nine times in this quote, while the words in, for, to, and one each appear four times. The word ring appears three times, as does the word of. We can create a dataset from this, choosing a subset of words and counting the frequency: Word the one ring to Frequency 9 4 3 4 We can use the counter class to do a simple count for a given string. When counting words, it is normal to convert all letters to lowercase, which we do when creating the string. The code is as follows: s = """Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in halls of stone, Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne In the Land of Mordor where the Shadows lie. One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them. In the Land of Mordor where the Shadows lie. """.lower() words = s.split() from collections import Counter c = Counter(words) Printing c.most_common(5) gives the list of the top five most frequently occurring words. Ties are not handled well as only five are given and a very large number of words all share a tie for fifth place. The bag-of-words model has three major types. The first is to use the raw frequencies, as shown in the preceding example. This does have a drawback when documents vary in size from fewer words to many words, as the overall values will be very different. The second model is to use the normalized frequency, where each document's sum equals 1. This is a much better solution as the length of the document doesn't matter as much. The third type is to simply use binary features—a value is 1 if the word occurs at all and 0 if it doesn't. We will use binary representation in this article. Another popular (arguably more popular) method for performing normalization is called term frequency - inverse document frequency, or tf-idf. In this weighting scheme, term counts are first normalized to frequencies and then divided by the number of documents in which it appears in the corpus There are a number of libraries for working with text data in Python. We will use a major one, called Natural Language ToolKit (NLTK). The scikit-learn library also has the CountVectorizer class that performs a similar action, and it is recommended you take a look at it. However the NLTK version has more options for word tokenization. If you are doing natural language processing in python, NLTK is a great library to use. N-grams A step up from single bag-of-words features is that of n-grams. An n-gram is a subsequence of n consecutive tokens. In this context, a word n-gram is a set of n words that appear in a row. They are counted the same way, with the n-grams forming a word that is put in the bag. The value of a cell in this dataset is the frequency that a particular n-gram appears in the given document. The value of n is a parameter. For English, setting it to between 2 to 5 is a good start, although some applications call for higher values. As an example, for n=3, we extract the first few n-grams in the following quote: Always look on the bright side of life. The first n-gram (of size 3) is Always look on, the second is look on the, the third is on the bright. As you can see, the n-grams overlap and cover three words. Word n-grams have advantages over using single words. This simple concept introduces some context to word use by considering its local environment, without a large overhead of understanding the language computationally. A disadvantage of using n-grams is that the matrix becomes even sparser—word n-grams are unlikely to appear twice (especially in tweets and other short documents!). Specially for social media and other short documents, word n-grams are unlikely to appear in too many different tweets, unless it is a retweet. However, in larger documents, word n-grams are quite effective for many applications. Another form of n-gram for text documents is that of a character n-gram. Rather than using sets of words, we simply use sets of characters (although character n-grams have lots of options for how they are computed!). This type of dataset can pick up words that are misspelled, as well as providing other benefits. We will test character n-grams in this article. Other features There are other features that can be extracted too. These include syntactic features, such as the usage of particular words in sentences. Part-of-speech tags are also popular for data mining applications that need to understand meaning in text. Such feature types won't be covered in this book. If you are interested in learning more, I recommend Python 3 Text Processing with NLTK 3 Cookbook, Jacob Perkins, Packt publication. Naive Bayes Naive Bayes is a probabilistic model that is unsurprisingly built upon a naive interpretation of Bayesian statistics. Despite the naive aspect, the method performs very well in a large number of contexts. It can be used for classification of many different feature types and formats, but we will focus on one in this article: binary features in the bag-of-words model. Bayes' theorem For most of us, when we were taught statistics, we started from a frequentist approach. In this approach, we assume the data comes from some distribution and we aim to determine what the parameters are for that distribution. However, those parameters are (perhaps incorrectly) assumed to be fixed. We use our model to describe the data, even testing to ensure the data fits our model. Bayesian statistics instead model how people (non-statisticians) actually reason. We have some data and we use that data to update our model about how likely something is to occur. In Bayesian statistics, we use the data to describe the model rather than using a model and confirming it with data (as per the frequentist approach). Bayes' theorem computes the value of P(A|B), that is, knowing that B has occurred, what is the probability of A. In most cases, B is an observed event such as it rained yesterday, and A is a prediction it will rain today. For data mining, B is usually we observed this sample and A is it belongs to this class. We will see how to use Bayes' theorem for data mining in the next section. The equation for Bayes' theorem is given as follows: As an example, we want to determine the probability that an e-mail containing the word drugs is spam (as we believe that such a tweet may be a pharmaceutical spam). A, in this context, is the probability that this tweet is spam. We can compute P(A), called the prior belief directly from a training dataset by computing the percentage of tweets in our dataset that are spam. If our dataset contains 30 spam messages for every 100 e-mails, P(A) is 30/100 or 0.3. B, in this context, is this tweet contains the word 'drugs'. Likewise, we can compute P(B) by computing the percentage of tweets in our dataset containing the word drugs. If 10 e-mails in every 100 of our training dataset contain the word drugs, P(B) is 10/100 or 0.1. Note that we don't care if the e-mail is spam or not when computing this value. P(B|A) is the probability that an e-mail contains the word drugs if it is spam. It is also easy to compute from our training dataset. We look through our training set for spam e-mails and compute the percentage of them that contain the word drugs. Of our 30 spam e-mails, if 6 contain the word drugs, then P(B|A) is calculated as 6/30 or 0.2. From here, we use Bayes' theorem to compute P(A|B), which is the probability that a tweet containing the word drugs is spam. Using the previous equation, we see the result is 0.6. This indicates that if an e-mail has the word drugs in it, there is a 60 percent chance that it is spam. Note the empirical nature of the preceding example—we use evidence directly from our training dataset, not from some preconceived distribution. In contrast, a frequentist view of this would rely on us creating a distribution of the probability of words in tweets to compute similar equations. Naive Bayes algorithm Looking back at our Bayes' theorem equation, we can use it to compute the probability that a given sample belongs to a given class. This allows the equation to be used as a classification algorithm. With C as a given class and D as a sample in our dataset, we create the elements necessary for Bayes' theorem, and subsequently Naive Bayes. Naive Bayes is a classification algorithm that utilizes Bayes' theorem to compute the probability that a new data sample belongs to a particular class. P(C) is the probability of a class, which is computed from the training dataset itself (as we did with the spam example). We simply compute the percentage of samples in our training dataset that belong to the given class. P(D) is the probability of a given data sample. It can be difficult to compute this, as the sample is a complex interaction between different features, but luckily it is a constant across all classes. Therefore, we don't need to compute it at all. We will see later how to get around this issue. P(D|C) is the probability of the data point belonging to the class. This could also be difficult to compute due to the different features. However, this is where we introduce the naive part of the Naive Bayes algorithm. We naively assume that each feature is independent of each other. Rather than computing the full probability of P(D|C), we compute the probability of each feature D1, D2, D3, … and so on. Then, we multiply them together: P(D|C) = P(D1|C) x P(D2|C).... x P(Dn|C) Each of these values is relatively easy to compute with binary features; we simply compute the percentage of times it is equal in our sample dataset. In contrast, if we were to perform a non-naive Bayes version of this part, we would need to compute the correlations between different features for each class. Such computation is infeasible at best, and nearly impossible without vast amounts of data or adequate language analysis models. From here, the algorithm is straightforward. We compute P(C|D) for each possible class, ignoring the P(D) term. Then we choose the class with the highest probability. As the P(D) term is consistent across each of the classes, ignoring it has no impact on the final prediction. How it works As an example, suppose we have the following (binary) feature values from a sample in our dataset: [0, 0, 0, 1]. Our training dataset contains two classes with 75 percent of samples belonging to the class 0, and 25 percent belonging to the class 1. The likelihood of the feature values for each class are as follows: For class 0: [0.3, 0.4, 0.4, 0.7] For class 1: [0.7, 0.3, 0.4, 0.9] These values are to be interpreted as: for feature 1, it is a 1 in 30 percent of cases for class 0. We can now compute the probability that this sample should belong to the class 0. P(C=0) = 0.75 which is the probability that the class is 0. P(D) isn't needed for the Naive Bayes algorithm. Let's take a look at the calculation: P(D|C=0) = P(D1|C=0) x P(D2|C=0) x P(D3|C=0) x P(D4|C=0) = 0.3 x 0.6 x 0.6 x 0.7 = 0.0756 The second and third values are 0.6, because the value of that feature in the sample was 0. The listed probabilities are for values of 1 for each feature. Therefore, the probability of a 0 is its inverse: P(0) = 1 – P(1). Now, we can compute the probability of the data point belonging to this class. An important point to note is that we haven't computed P(D), so this isn't a real probability. However, it is good enough to compare against the same value for the probability of the class 1. Let's take a look at the calculation: P(C=0|D) = P(C=0) P(D|C=0) = 0.75 * 0.0756 = 0.0567 Now, we compute the same values for the class 1: P(C=1) = 0.25 P(D) isn't needed for naive Bayes. Let's take a look at the calculation: P(D|C=1) = P(D1|C=1) x P(D2|C=1) x P(D3|C=1) x P(D4|C=1) = 0.7 x 0.7 x 0.6 x 0.9 = 0.2646 P(C=1|D) = P(C=1)P(D|C=1) = 0.25 * 0.2646 = 0.06615 Normally, P(C=0|D) + P(C=1|D) should equal to 1. After all, those are the only two possible options! However, the probabilities are not 1 due to the fact we haven't included the computation of P(D) in our equations here. The data point should be classified as belonging to the class 1. You may have guessed this while going through the equations anyway; however, you may have been a bit surprised that the final decision was so close. After all, the probabilities in computing P(D|C) were much, much higher for the class 1. This is because we introduced a prior belief that most samples generally belong to the class 0. If the classes had been equal sizes, the resulting probabilities would be much different. Try it yourself by changing both P(C=0) and P(C=1) to 0.5 for equal class sizes and computing the result again. Application We will now create a pipeline that takes a tweet and determines whether it is relevant or not, based only on the content of that tweet. To perform the word extraction, we will be using the NLTK, a library that contains a large number of tools for performing analysis on natural language. We will use NLTK in future articles as well. To get NLTK on your computer, use pip to install the package: pip3 install nltk If that doesn't work, see the NLTK installation instructions at www.nltk.org/install.html. We are going to create a pipeline to extract the word features and classify the tweets using Naive Bayes. Our pipeline has the following steps: Transform the original text documents into a dictionary of counts using NLTK's word_tokenize function. Transform those dictionaries into a vector matrix using the DictVectorizer transformer in scikit-learn. This is necessary to enable the Naive Bayes classifier to read the feature values extracted in the first step. Train the Naive Bayes classifier, as we have seen in previous articles. We will need to create another Notebook (last one for the article!) called ch6_classify_twitter for performing the classification. Extracting word counts We are going to use NLTK to extract our word counts. We still want to use it in a pipeline, but NLTK doesn't conform to our transformer interface. We will therefore need to create a basic transformer to do this to obtain both fit and transform methods, enabling us to use this in a pipeline. First, set up the transformer class. We don't need to fit anything in this class, as this transformer simply extracts the words in the document. Therefore, our fit is an empty function, except that it returns self which is necessary for transformer objects. Our transform is a little more complicated. We want to extract each word from each document and record True if it was discovered. We are only using the binary features here—True if in the document, False otherwise. If we wanted to use the frequency we would set up counting dictionaries. Let's take a look at the code: from sklearn.base import TransformerMixin class NLTKBOW(TransformerMixin): def fit(self, X, y=None): return self def transform(self, X): return [{word: True for word in word_tokenize(document)} for document in X] The result is a list of dictionaries, where the first dictionary is the list of words in the first tweet, and so on. Each dictionary has a word as key and the value true to indicate this word was discovered. Any word not in the dictionary will be assumed to have not occurred in the tweet. Explicitly stating that a word's occurrence is False will also work, but will take up needless space to store. Converting dictionaries to a matrix This step converts the dictionaries built as per the previous step into a matrix that can be used with a classifier. This step is made quite simple through the DictVectorizer transformer. The DictVectorizer class simply takes a list of dictionaries and converts them into a matrix. The features in this matrix are the keys in each of the dictionaries, and the values correspond to the occurrence of those features in each sample. Dictionaries are easy to create in code, but many data algorithm implementations prefer matrices. This makes DictVectorizer a very useful class. In our dataset, each dictionary has words as keys and only occurs if the word actually occurs in the tweet. Therefore, our matrix will have each word as a feature and a value of True in the cell if the word occurred in the tweet. To use DictVectorizer, simply import it using the following command: from sklearn.feature_extraction import DictVectorizer Training the Naive Bayes classifier Finally, we need to set up a classifier and we are using Naive Bayes for this article. As our dataset contains only binary features, we use the BernoulliNB classifier that is designed for binary features. As a classifier, it is very easy to use. As with DictVectorizer, we simply import it and add it to our pipeline: from sklearn.naive_bayes import BernoulliNB Putting it all together Now comes the moment to put all of these pieces together. In our IPython Notebook, set the filenames and load the dataset and classes as we have done before. Set the filenames for both the tweets themselves (not the IDs!) and the labels that we assigned to them. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") Load the tweets themselves. We are only interested in the content of the tweets, so we extract the text value and store only that. The code is as follows: tweets = [] with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)['text']) Load the labels for each of the tweets: with open(classes_filename) as inf: labels = json.load(inf) Now, create a pipeline putting together the components from before. Our pipeline has three parts: The NLTKBOW transformer we created A DictVectorizer transformer A BernoulliNB classifier The code is as follows: from sklearn.pipeline import Pipeline pipeline = Pipeline([('bag-of-words', NLTKBOW()), ('vectorizer', DictVectorizer()), ('naive-bayes', BernoulliNB()) ]) We can nearly run our pipeline now, which we will do with cross_val_score as we have done many times before. Before that though, we will introduce a better evaluation metric than the accuracy metric we used before. As we will see, the use of accuracy is not adequate for datasets when the number of samples in each class is different. Evaluation using the F1-score When choosing an evaluation metric, it is always important to consider cases where that evaluation metric is not useful. Accuracy is a good evaluation metric in many cases, as it is easy to understand and simple to compute. However, it can be easily faked. In other words, in many cases you can create algorithms that have a high accuracy by poor utility. While our dataset of tweets (typically, your results may vary) contains about 50 percent programming-related and 50 percent nonprogramming, many datasets aren't as balanced as this. As an example, an e-mail spam filter may expect to see more than 80 percent of incoming e-mails be spam. A spam filter that simply labels everything as spam is quite useless; however, it will obtain an accuracy of 80 percent! To get around this problem, we can use other evaluation metrics. One of the most commonly employed is called an f1-score (also called f-score, f-measure, or one of many other variations on this term). The f1-score is defined on a per-class basis and is based on two concepts: the precision and recall. The precision is the percentage of all the samples that were predicted as belonging to a specific class that were actually from that class. The recall is the percentage of samples in the dataset that are in a class and actually labeled as belonging to that class. In the case of our application, we could compute the value for both classes (relevant and not relevant). However, we are really interested in the spam. Therefore, our precision computation becomes the question: of all the tweets that were predicted as being relevant, what percentage were actually relevant? Likewise, the recall becomes the question: of all the relevant tweets in the dataset, how many were predicted as being relevant? After you compute both the precision and recall, the f1-score is the harmonic mean of the precision and recall: To use the f1-score in scikit-learn methods, simply set the scoring parameter to f1. By default, this will return the f1-score of the class with label 1. Running the code on our dataset, we simply use the following line of code: scores = cross_val_score(pipeline, tweets, labels, scoring='f1') We then print out the average of the scores: import numpy as np print("Score: {:.3f}".format(np.mean(scores))) The result is 0.798, which means we can accurately determine if a tweet using Python relates to the programing language nearly 80 percent of the time. This is using a dataset with only 200 tweets in it. Go back and collect more data and you will find that the results increase! More data usually means a better accuracy, but it is not guaranteed! Getting useful features from models One question you may ask is what are the best features for determining if a tweet is relevant or not? We can extract this information from of our Naive Bayes model and find out which features are the best individually, according to Naive Bayes. First we fit a new model. While the cross_val_score gives us a score across different folds of cross-validated testing data, it doesn't easily give us the trained models themselves. To do this, we simply fit our pipeline with the tweets, creating a new model. The code is as follows: model = pipeline.fit(tweets, labels) Note that we aren't really evaluating the model here, so we don't need to be as careful with the training/testing split. However, before you put these features into practice, you should evaluate on a separate test split. We skip over that here for the sake of clarity. A pipeline gives you access to the individual steps through the named_steps attribute and the name of the step (we defined these names ourselves when we created the pipeline object itself). For instance, we can get the Naive Bayes model: nb = model.named_steps['naive-bayes'] From this model, we can extract the probabilities for each word. These are stored as log probabilities, which is simply log(P(A|f)), where f is a given feature. The reason these are stored as log probabilities is because the actual values are very low. For instance, the first value is -3.486, which correlates to a probability under 0.03 percent. Logarithm probabilities are used in computation involving small probabilities like this as they stop underflow errors where very small values are just rounded to zeros. Given that all of the probabilities are multiplied together, a single value of 0 will result in the whole answer always being 0! Regardless, the relationship between values is still the same; the higher the value, the more useful that feature is. We can get the most useful features by sorting the array of logarithm probabilities. We want descending order, so we simply negate the values first. The code is as follows: top_features = np.argsort(-feature_probabilities[1])[:50] The preceding code will just give us the indices and not the actual feature values. This isn't very useful, so we will map the feature's indices to the actual values. The key is the DictVectorizer step of the pipeline, which created the matrices for us. Luckily this also records the mapping, allowing us to find the feature names that correlate to different columns. We can extract the features from that part of the pipeline: dv = model.named_steps['vectorizer'] From here, we can print out the names of the top features by looking them up in the feature_names_ attribute of DictVectorizer. Enter the following lines into a new cell and run it to print out a list of the top features: for i, feature_index in enumerate(top_features): print(i, dv.feature_names_[feature_index], np.exp(feature_probabilities[1][feature_index])) The first few features include :, http, # and @. These are likely to be noise (although the use of a colon is not very common outside programming), based on the data we collected. Collecting more data is critical to smoothing out these issues. Looking through the list though, we get a number of more obvious programming features: 7 for 0.188679245283 11 with 0.141509433962 28 installing 0.0660377358491 29 Top 0.0660377358491 34 Developer 0.0566037735849 35 library 0.0566037735849 36 ] 0.0566037735849 37 [ 0.0566037735849 41 version 0.0471698113208 43 error 0.0471698113208 There are some others too that refer to Python in a work context, and therefore might be referring to the programming language (although freelance snake handlers may also use similar terms, they are less common on Twitter): 22 jobs 0.0660377358491 30 looking 0.0566037735849 31 Job 0.0566037735849 34 Developer 0.0566037735849 38 Freelancer 0.0471698113208 40 projects 0.0471698113208 47 We're 0.0471698113208 That last one is usually in the format: We're looking for a candidate for this job. Looking through these features gives us quite a few benefits. We could train people to recognize these tweets, look for commonalities (which give insight into a topic), or even get rid of features that make no sense. For example, the word RT appears quite high in this list; however, this is a common Twitter phrase for retweet (that is, forwarding on someone else's tweet). An expert could decide to remove this word from the list, making the classifier less prone to the noise we introduced by having a small dataset. Summary In this article, we looked at text mining—how to extract features from text, how to use those features, and ways of extending those features. In doing this, we looked at putting a tweet in context—was this tweet mentioning python referring to the programming language? We downloaded data from a web-based API, getting tweets from the popular microblogging website Twitter. This gave us a dataset that we labeled using a form we built directly in the IPython Notebook. We also looked at reproducibility of experiments. While Twitter doesn't allow you to send copies of your data to others, it allows you to send the tweet's IDs. Using this, we created code that saved the IDs and recreated most of the original dataset. Not all tweets were returned; some had been deleted in the time since the ID list was created and the dataset was reproduced. We used a Naive Bayes classifier to perform our text classification. This is built upon the Bayes' theorem that uses data to update the model, unlike the frequentist method that often starts with the model first. This allows the model to incorporate and update new data, and incorporate a prior belief. In addition, the naive part allows to easily compute the frequencies without dealing with complex correlations between features. The features we extracted were word occurrences—did this word occur in this tweet? This model is called bag-of-words. While this discards information about where a word was used, it still achieves a high accuracy on many datasets. This entire pipeline of using the bag-of-words model with Naive Bayes is quite robust. You will find that it can achieve quite good scores on most text-based tasks. It is a great baseline for you, before trying more advanced models. As another advantage, the Naive Bayes classifier doesn't have any parameters that need to be set (although there are some if you wish to do some tinkering). In the next article, we will look at extracting features from another type of data, graphs, in order to make recommendations on who to follow on social media. Resources for Article: Further resources on this subject: Putting the Fun in Functional Python[article] Python Data Analysis Utilities[article] Leveraging Python in the World of Big Data[article]
Read more
  • 0
  • 0
  • 4053

article-image-adding-spark-r
Packt
22 Feb 2016
3 min read
Save for later

Adding a Spark to R

Packt
22 Feb 2016
3 min read
Spark is written in a language called Scala. It has interfaces to use from Java and Python and from the recent version 1.4.0; it also supports R. This is called SparkR, which we will describe in the next section. The four classes of libraries available in Spark are SQL and DataFrames, Spark Streaming, MLib (machine learning), and GraphX (graph algorithms). Currently, SparkR supports only SQL and DataFrames; others are definitely in the roadmap. Spark can be downloaded from the Apache project page at http://spark.apache.org/downloads.html. Starting from 1.4.0 version, SparkR is included in Spark and no separate download is required. (For more resources related to this topic, see here.) SparkR Similar to RHadoop, SparkR is an R package that allows R users to use Spark APIs through the RDD class. For example, using SparkR, users can run jobs on Spark from RStudio. SparkR can be evoked from RStudio. To enable this, include the following lines in your .Rprofile file that R uses at startup to initialize the environments: Sys.setenv(SPARK_HOME/.../spark-1.5.0-bin-hadoop2.6") #provide the correct path where spark downloaded folder is kept for SPARK_HOME .libPaths(c(file.path(Sys.getenv("SPARK_HOME"),""R",""lib"),".libPaths())) Once this is done, start RStudio and enter the following commands to start using SparkR: >library(SparkR) >sc ← sparkR.init(master="local") As mentioned, as of the latest version 1.5 when this chapter is in writing, SparkR supports limited functionalities of R. This mainly includes data slicing and dicing and summary stat functions. The current version does not support the use of contributed R packages; however, it is planned for a future release. On machine learning, currently SparkR supports the glm( ) function. We will do an example in the next section. Linear regression using SparkR In the following example, we will illustrate how to use SparkR for machine learning. >library(SparkR) >sc ← sparkR.init(master="local") >sqlContext ← sparkRSQL.init(sc) #Importing data >df ← read.csv("/Users/harikoduvely/Projects/Book/Data /ENB2012_data.csv",header = T) >#Excluding variable Y2,X6,X8 and removing records from 768 containing mainly null values >df ← df[1:768,c(1,2,3,4,5,7,9)] >#Converting to a Spark R Dataframe >dfsr ← createDataFrame(sqlContext,df) >model ← glm(Y1 ~ X1 + X2 + X3 + X4 + X5 + X7,data = dfsr,family = "gaussian") > summary(model) Summary In this article we have seen examples of SparkR and linear regression using SparkR. For more information on Spark you can refer to: https://www.packtpub.com/big-data-and-business-intelligence/spark-python-developers https://www.packtpub.com/big-data-and-business-intelligence/spark-beginners Resources for Article: Further resources on this subject: Data Analysis Using R[article] Introducing Bayesian Inference[article] Bayesian Network Fundamentals[article]
Read more
  • 0
  • 0
  • 2637

article-image-customizing-heat-maps-intermediate
Packt
22 Feb 2016
11 min read
Save for later

Customizing heat maps (Intermediate)

Packt
22 Feb 2016
11 min read
This article will help you explore more advanced functions to customize the layout of the heat maps. The main focus lies on the usage of different color palettes, but we will also cover other useful features, such as cell notes that will be used in this recipe. (For more resources related to this topic, see here.) To ensure that our heat maps look good in any situation, we will make use of different color palettes in this recipe, and we will even learn how to create our own. Further, we will add some more extras to our heat maps including visual aids such as cell note labels, which will make them even more useful and accessible as a tool for visual data analysis. The following image shows a heat map with cell notes and an alternative color palette created from the arabidopsis_genes.csv data set: Getting ready Download the 5644OS_03_01.r script and the Arabidopsis_genes.csv data set from your account at http://www.packtpub.com and save it to your hard drive. I recommend that you save the script and data file to the same folder on your hard drive. If you execute the script from a different location to the data file, you will have to change the current R working directory accordingly. The script will check automatically if any additional packages need to be installed in R. How to do it... Execute the following code in R via the 5644OS_03_01.r script and take a look at the PDF file custom_heatmaps.pdf that will be created in the current working directory: ### loading packages if (!require("gplots")) { install.packages("gplots", dependencies = TRUE) library(RColorBrewer) } if (!require("RColorBrewer")) { install.packages("RColorBrewer", dependencies = TRUE) library(RColorBrewer) } ### reading in data gene_data <- read.csv("arabidopsis_genes.csv") row_names <- gene_data[,1] gene_data <- data.matrix(gene_data[,2:ncol(gene_data)]) rownames(gene_data) <- row_names ### setting heatmap.2() default parameters heat2 <- function(...) heatmap.2(gene_data, tracecol = "black", dendrogram = "column", Rowv = NA, trace = "none", margins = c(8,10), density.info = "density", ...) pdf("custom_heatmaps.pdf") ### 1) customizing colors # 1.1) in-built color palettes heat2(col = terrain.colors(n = 1000), main = "1.1) Terrain Colors") # 1.2) RColorBrewer palettes heat2(col = brewer.pal(n = 9, "YlOrRd"), main = "1.2) Brewer Palette") # 1.3) creating own color palettes my_colors <- c(y1 = "#F7F7D0", y2 = "#FCFC3A", y3 = "#D4D40D", b1 = "#40EDEA", b2 = "#18B3F0", b3 = "#186BF0", r1 = "#FA8E8E", r2 = "#F26666", r1 = "#C70404") heat2(col = my_colors, main = "1.3) Own Color Palette") my_palette <- colorRampPalette(c("blue", "yellow", "red"))(n = 1000) heat2(col = my_palette, main = "1.3) ColorRampPalette") # 1.4) gray scale heat2(col = gray(level = (0:100)/100), main ="1.4) Gray Scale") ### 2) adding cell notes fold_change <- 2^gene_data rounded_fold_changes <- round(rounded_fold_changes, 2) heat2(cellnote = rounded, notecex = 0.5, notecol = "black", col = my_palette, main = "2) Cell Notes") ### 3) adding column side colors heat2(ColSideColors = c("red", "gray", "red", rep("green",13)), main = "3) ColSideColors") dev.off() How it works... Primarily, we will be using read.csv() and heatmap.2() to read in data into R and construct our heat maps. In this recipe, however, we will focus on advanced features to enhance our heat maps, such as customizing color and other visual elements: Inspecting the arabidopsis_genes.csv data set: The arabidopsis_genes.csv file contains a compilation of gene expression data from the model plant Arabidopsis thaliana. I obtained the freely available data of 16 different genes as log 2 ratios of target and reference gene from the Arabidopsis eFP Browser (http://bar.utoronto.ca/efp_arabidopsis/). For each gene, expression data of 47 different areas of the plant is available in this data file. Reading the data and converting it into a numeric matrix: We have to convert the data table into a numeric matrix first before we can construct our heat maps: gene_data <- read.csv("arabidopsis_genes.csv") row_names <- gene_data[,1] gene_data <- data.matrix(gene_data[,2:ncol(gene_data)]) rownames(gene_data) <- row_names Creating a customized heatmap.2() function: To reduce typing efforts, we are defining our own version of the heatmap.2() function now, where we will include some arguments that we are planning to keep using throughout this recipe: heat2 <- function(...) heatmap.2(gene_data, tracecol = "black", dendrogram = "column", Rowv = NA, trace = "none", margins = c(8,10), density.info = "density", ...) So, each time we call our newly defined heat2() function, it will behave similar to the heatmap.2() function, except for the additional arguments that we will pass along. We also include a new argument, black, for the tracecol parameter, to better distinguish the density plot in the color key from the background. The built-in color palettes: There are four more color palettes available in the base R that we could use instead of the heat.colors palette: rainbow, terrain.colors, topo.colors, and cm.colors. So let us make use of the terrain.colors color palette now, which will give us a nice color transition from green over yellow to rose: heat2(col = terrain.colors(n = 1000), main = "1.1) Terrain Colors") Every number for the parameter n that is larger than the default value 12 will add additional colors, which will make the transition smoother. A value of 1000 for the n parameter should be more than sufficient to make the transition between the individual colors indistinguishable to the human eye. The following image shows a side-by-side comparison of the heat.colors and terrain.colors color palettes using a different number of color shades: Further, it is also possible to reverse the direction of the color transition. For example, if we want to have a heat.color transition from yellow to red instead of red to yellow in our heat map, we could simply define a reverse function: rev_heat.colors <- function(x) rev(heat.colors(x)) heat2(col = rev_heat.colors(500)) RColorBrewer palettes: A lot of color palettes are available from the RColorBrewer package. To see how they look like, you can type display.brewer.all() into the R command-line after loading the RColorBrewer package. However, in contrast to the dynamic range color palettes that we have seen previously, the RColorBrewer palettes have a distinct number of different colors. So to select all nine colors from the YlOrRd palette, a gradient from yellow to red, we use the following command: heat2(col = brewer.pal(n = 9, "YlOrRd"), main = "1.2) Brewer Palette") The following image gives you a good overview of all the different color palettes that are available from the RColorBrewer package: Creating our own color palettes: Next, we will see how we can create our own color palettes. A whole bunch of different colors are already defined in R. An overview of those colors can be seen by typing colors() into the command line of R. The most convenient way to assign new colors to a color palette is using hex colors (hexadecimal colors). Many different online tools are freely available that allow us to obtain the necessary hex codes. A great example is color picker (http://www.colorpicker.com), which allows us to choose from a rich color table and provides us with the corresponding hex codes. Once we gather all the hexadecimal codes for the colors that we want to use for our color palette, we can assign them to a variable as we have done before with the explicit color names: my_colors <- c(y1 = "#F7F7D0", y2 = "#FCFC3A", y3 = "#D4D40D", b1 = "#40EDEA", b2 = "#18B3F0", b3 = "#186BF0", r1 = "#FA8E8E", r2 = "#F26666", r1 = "#C70404") heat2(col = my_colors, main = "1.3) Own Color Palette") This is a very handy approach for creating a color key with very distinct colors. However, the downside of this method is that we have to provide a lot of different colors if we want to create a smooth color gradient; we have used 1000 different colors for the terrain.color() palette to get a smooth transition in the color key! Using colorRampPalette for smoother color gradients: A convenient approach to create a smoother color gradient is to use the colorRampPalette() function, so we don't have to insert all the different colors manually. The function takes a vector of different colors as an argument. Here, we provide three colors: blue for the lower end of the color key, yellow for the middle range, and red for the higher end. As we did it for the in-built color palettes, such as heat.color, we assign the value 1000 to the n parameter: my_palette <- colorRampPalette(c("blue", "yellow", "red"))(n = 1000) heat2(col = my_palette, main = "1.3) ColorRampPalette") In this case, it is more convenient to use discrete color names over hex colors, since we are using the colorRampPalette() function to create a gradient and do not need all the different shades of a particular color. Grayscales: It might happen that the medium or device that we use to display our heat maps does not support colors. Under these circumstances, we can use the gray palette to create a heat map that is optimized for those conditions. The level parameter of the gray() function takes a vector with values between 0 and 1 as an argument, where 0 represents black and 1 represents white, respectively. For a smooth gradient, we use a vector with 100 equally spaced shades of gray ranging from 0 to 1. heat2(col = gray(level = (0:200)/200), main ="1.4) Gray Scale") We can make use of the same color palettes for the levelplot() function too. It works in a similar way as it did for the heatmap.2() function that we are using in this recipe. However, inside the levelplot() function call, we must use col.regions instead of the simple col, so that we can include a color palette argument. Adding cell notes to our heat map: Sometimes, we want to show a data set along with our heat map. A neat way is to use so-called cell notes to display data values inside the individual heat map cells. The underlying data matrix for the cell notes does not necessarily have to be the same numeric matrix we used to construct our heat map, as long as it has the same number of rows and columns. As we recall, the data we read from arabidopsis_genes.csv resembles log 2 ratios of sample and reference gene expression levels. Let us calculate the fold changes of the gene expression levels now and display them—rounded to two digits after the decimal point—as cell notes on our heat map: fold_change <- 2^gene_data rounded_fold_changes <- round(fold_change, 2) heat2(cellnote = rounded_fold_changes, notecex = 0.5, notecol = "black", col = rev_heat.colors, main = "Cell Notes") The notecex parameter controls the size of the cell notes. Its default size is 1, and every argument between 0 and 1 will make the font smaller, whereas values larger than 1 will make the font larger. Here, we decreased the font size of the cell notes by 50 percent to fit it into the cell boundaries. Also, we want to display the cell notes in black to have a nice contrast to the colored background; this is controlled by the notecol parameter. Row and column side colors: Another approach to pronounce certain regions, that is, rows or columns on the heat map is to make use of row and column side colors. The ColSideColors argument will place a colored box between the dendrogram and heat map that can be used to annotate certain columns. We pass our vector with colors to ColSideColors, where its length must be equal to the number of columns of the heat map. Here, we want to color the first and third column red, the second one gray, and all the remaining 13 columns green: heat2(ColSideColors = c("red", "gray", "red", rep("green", 13)), main = "ColSideColors") You can see in the following image how the column side colors look like when we include the ColSideColors argument as shown previously: Attentive readers may have noticed that the order of colors in the column color box slightly differs from the order of colors we passed as a vector to ColSideColors. We see red two times next to each other, followed by a green and a gray box. This is due to the fact that the columns of our heat map have been reordered by the hierarchical clustering algorithm. Summary To learn more about the similar technology, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Instant R Starter (https://www.packtpub.com/big-data-and-business-intelligence/instant-r-starter-instant) Machine Learning with R - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition) Mastering RStudio – Develop, Communicate, and Collaborate with R (https://www.packtpub.com/application-development/mastering-rstudio-%E2%80%93-develop-communicate-and-collaborate-r) Resources for Article: Further resources on this subject: Data Analysis Using R[article] Big Data Analysis[article] Big Data Analysis (R and Hadoop)[article]
Read more
  • 0
  • 0
  • 4540
article-image-what-logistic-regression
Packt
19 Feb 2016
9 min read
Save for later

What is logistic regression?

Packt
19 Feb 2016
9 min read
In logistic regression, input features are linearly scaled just as with linear regression; however, the result is then fed as an input to the logistic function. This function provides a nonlinear transformation on its input and ensures that the range of the output, which is interpreted as the probability of the input belonging to class 1, lies in the interval [0,1]. (For more resources related to this topic, see here.) The form of the logistic function is as follows: The plot of the logistic function is as follows: When x = 0, the logistic function takes the value 0.5. As x tends to +∞, the exponential in the denominator vanishes and the function approaches the value 1. As x tends to -∞, the exponential, and hence the denominator, tends to move toward infinity and the function approaches the value 0. Thus, our output is guaranteed to be in the interval [0,1], which is necessary for it to be a probability. Generalized linear models Logistic regression belongs to a class of models known as generalized linear models (GLMs). Generalized linear models have three unifying characteristics. The first of these is that they all involve a linear combination of the input features, thus explaining part of their name. The second characteristic is that the output is considered to have an underlying probability distribution belonging to the family of exponential distributions. These include the normal distribution, the Poisson and the binomial distribution. Finally, the mean of the output distribution is related to the linear combination of input features by way of a function, known as the link function. Let's see how this all ties in with logistic regression, which is just one of many examples of a GLM. We know that we begin with a linear combination of input features, so for example, in the case of one input feature, we can build up an x term as follows: Note that in the case of logistic regression, we are modeling a probability that the output belongs to class 1, rather the output directly as we were in linear regression. As a result, we do not need to model the error term because our output, which is a probability, incorporates nondeterministic aspects of our model, such as measurement uncertainties, directly. Next, we apply the logistic function to this term in order to produce our model's output: Here, the left term tells us directly that we are computing the probability that our output belongs to class 1 based on our evidence of seeing the values of the input feature X1. For logistic regression, the underlying probability distribution of the output is the Bernoulli distribution. This is the same as the binomial distribution with a single trial and is the distribution we would obtain in an experiment with only two possible outcomes having constant probability, such as a coin flip. The mean of the Bernoulli distribution, μy, is the probability of the (arbitrarily chosen) outcome for success, in this case, class 1. Consequently, the left-hand side in the previous equation is also the mean of our underlying output distribution. For this reason, the function that transforms our linear combination of input features is sometimes known as the mean function, and we just saw that this function is the logistic function for logistic regression. Now, to determine the link function for logistic regression, we can perform some simple algebraic manipulations in order to isolate our linear combination of input features. The term on the left-hand side is known as the log-odds or logit function and is the link function for logistic regression. The denominator of the fraction inside the logarithm is the probability of the output being class 0 given the data. Consequently, this fraction represents the ratio of probability between class 1 and class 0, which is also known as the odds ratio. A good reference for logistic regression along with examples of other GLMs such as Poisson regression is Extending the Linear Model with R, Julian J. Faraway, CRC Press. Interpreting coefficients in logistic regression Looking at the right-hand side of the last equation, we can see that we have almost exactly the same form as we had for simple linear regression, barring the error term. The fact that we have the logit function on the left-hand side, however, means we cannot interpret our regression coefficients in the same way that we did with linear regression. In logistic regression, a unit increase in feature Xi results in multiplying the odds ratio by an amount, { QUOTE  }. When a coefficient βi is positive, then we multiply the odds ratio by a number greater than 1, so we know that increasing the feature Xi will effectively increase the probability of the output being labeled as class 1. Similarly, increasing a feature with a negative coefficient shifts the balance toward predicting class 0. Finally, note that when we change the value of an input feature, the effect is a multiplication on the odds ratio and not on the model output itself, which we saw is the probability of predicting class 1. In absolute terms, the change in the output of our model as a result of a change in the input is not constant throughout but depends on the current value of our input features. This is, again, different from linear regression, where no matter what the values of the input features, the regression coefficients always represent a fixed increase in the output per unit increase of an input feature. Assumptions of logistic regression Logistic regression makes fewer assumptions about the input than linear regression. In particular, the nonlinear transformation of the logistic function means that we can model more complex input-output relationships. We still have a linearity assumption, but in this case, it is between the features and the log-odds. We no longer require a normality assumption for residuals and nor do we need the homoscedastic assumption. On the other hand, our error terms still need to be independent. Strictly speaking, the features themselves no longer need to be independent but in practice, our model will still face issues if the features exhibit a high degree of multicollinearity. Finally, we'll note that just like with unregularized linear regression, feature scaling does not affect the logistic regression model. This means that centering and scaling a particular input feature will simply result in an adjusted coefficient in the output model without any repercussions on the model performance. It turns out that for logistic regression, this is the result of a property known as the invariance property of maximum likelihood, which is the method used to select the coefficients and will be the focus of the next section. It should be noted, however, that centering and scaling features might still be a good idea if they are on very different scales. This is done to assist the optimization procedure during training. In short, we should turn to feature scaling only if we run into model convergence issues. Maximum likelihood estimation When we studied linear regression, we found our coefficients by minimizing the sum of squared error terms. For logistic regression, we do this by maximizing the likelihood of the data. The likelihood of an observation is the probability of seeing that observation under a particular model. In our case, the likelihood of seeing an observation X for class 1 is simply given by the probability P(Y=1|X), the form of which was given earlier in this article. As we only have two classes, the likelihood of seeing an observation for class 0 is given by 1 - P(Y=1|X). The overall likelihood of seeing our entire data set of observations is the product of all the individual likelihoods for each data point as we consider our observations to be independently obtained. As the likelihood of each observation is parameterized by the regression coefficients βi, the likelihood function for our entire data set is also, therefore, parameterized by these coefficients. We can express our likelihood function as an equation, as shown in the following equation: Now, this equation simply computes the probability that a logistic regression model with a particular set of regression coefficients could have generated our training data. The idea is to choose our regression coefficients so that this likelihood function is maximized. We can see that the form of the likelihood function is a product of two large products from the two big π symbols. The first product contains the likelihood of all our observations for class 1, and the second product contains the likelihood of all our observations for class 0. We often refer to the log likelihood of the data, which is computed by taking the logarithm of the likelihood function and using the fact that the logarithm of a product of terms is the sum of the logarithm of each term: We can simplify this even further using a classic trick to form just a single sum: To see why this is true, note that for the observations where the actual value of the output variable y is 1, the right term inside the summation is zero, so we are effectively left with the first sum from the previous equation. Similarly, when the actual value of y is 0, then we are left with the second summation from the previous equation. Note that maximizing the likelihood is equivalent to maximizing the log likelihood. Maximum likelihood estimation is a fundamental technique of parameter fitting and we will encounter it in other models in this book. Despite its popularity, it should be noted that maximum likelihood is not a panacea. Alternative training criteria on which to build a model exist, and there are some well-known scenarios under which this approach does not lead to a good model. Finally, note that the details of the actual optimization procedure that finds the values of the regression coefficients for maximum likelihood are beyond the scope of this book and in general, we can rely on R to implement this for us. Summary In this article, we demonstrated why logistic regression is a better way to approach classification problems compared to linear regression with a threshold by showing that the least squares criterion is not the most appropriate criterion to use when trying to separate two classes. It turns out that logistic regression is not a great choice for multiclass settings in general. To learn more about Predictive Analysis, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Predictive Analytics with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-predictive-analytics-r) Resources for Article: Further resources on this subject: Machine learning in practice [article] Introduction to Machine Learning with R [article] Training and Visualizing a neural network with R [article]
Read more
  • 0
  • 0
  • 4847

article-image-twitter-sentiment-analysis
Packt
19 Feb 2016
30 min read
Save for later

Twitter Sentiment Analysis

Packt
19 Feb 2016
30 min read
In this article, we will cover: Twitter and it's importance Getting hands on with Twitter's data and using various Twitter APIs Use of data to solve business problems—comparison of various businesses based on tweets (For more resources related to this topic, see here.) Twitter and its importance Twitter can be considered as extension of the short messages service or SMS but on an Internet-based platform. In the words of Jack Dorsey, co-founder and co-creator of Twitter: "...We came across the word 'twitter', and it was just perfect. The definition was 'a short burst of inconsequential information,' and 'chirps from birds'. And that's exactly what the product was" Twitter acts as a utility where one can send their SMSs to the whole world. It enables people to instantaneously get heard and get a response. Since the audience of this SMS is so large, many a times responses are very quick. So, Twitter facilitates the basic social instincts of humans. By sharing on Twitter, a user can easily express his/her opinion for just about everything and at anytime. Friends who are connected or, in case of Twitter, followers, immediately get the information about what's going on in someone's life. This in turn severs another humanemotion—the innate need to know about what is going on in someone's life. Apart from being real time, Twitter's UI is really easy to work with. It's naturally and instinctively understood, that is, the UI is very intuitive in nature. Each tweet on Twitter is a short message with maximum of 140 characters. Twitter is an excellent example of a microblogging service. As of July 2014, the Twitter user base reached above 500 million, with more than 271 million active users. Around 23 percent are adult Internet users, which is also about 19 percent of the entire adult population. If we can properly mine what users are tweeting about, Twitter can act as a great tool for advertisement and marketing. But this not the only information Twitter provides. Because of its non-symmetric nature in terms of followers and followings, Twitter assists better in terms of understanding user interests rather than its impact on the social network. An interest graph can be thought of as a method to learn the links between individuals and their diverse interests. Computing the degree of association or correlations between individual's interests and the potential advertisements are one of the most important applications of the interest graphs. Based on these correlations, a user can be targeted so as to attain a maximum response to an advertisement campaign along with followers' recommendations. One interesting fact about Twitter (and Facebook) is that the user does not need to be a real person. A user on Twitter (or on Facebook) can be anything and anyone, for example, an organization, a campaign itself, a famous but imaginary personality (a fictional character recognizable in the media) apart from a real/actual person. If a real person follows these users on Twitter, a lot can be inferred about their personality and hence they can be recommended ads or other followers based on such information. For example, @fakingnews is an Indian blog that publishes news satires ranging from Indian politics to typical Indian mindsets. People who follow @fakingnews are the ones who, in general, like to read sarcasm news. Hence, these people can be thought of as to belonging to the same cluster or a community. If we have another sarcastic blog, we can always recommend it to this community and improve on advertisement return on investment. The chances of getting more hits via people belonging to this community will be higher than a community who don't follows @fakingnews, or any such news, in general. Once you have comprehended that Twitter allows you to create, link, and investigate a community of interest for a random topic, the influence of Twitter and the knowledge one can find from mining it becomes clearer. Understanding Twitter's API Twitter APIs provide a means to access the Twitter data, that is, tweets sent by its millions of users. Let's get to know these APIs a bit better. Twitter vocabulary As described earlier, Twitter is a microblogging service with social aspect associated. It allows its users to express their views/sentiments with the means of Internet SMS, called tweets in the context of Twitter. These tweets are entities formed of maximum of 140 characters. The content of these tweets can be anything ranging from a person's mood to person's location to a person's curiosity. The platform where these tweets are posted is called Timeline. To use Twitter's APIs, one must understand the basic terminology. Tweets are the crux of Twitter. Theoretically, a tweet is just 140 characters of text content tweeted by a user, but there is more to it than just that. There is more metadata associated with the same tweet, which are classified by Twitter as entities and places. The entities constitute of hash tags, URLs, and other media data that users have included in their tweet. The places are nothing but locations from where the tweet originated. It possible the place is a real world location from where the tweet was sent, or it is a location mentioned in the text of the tweet. Take the following tweet as an example: Learn how to consume millions of tweets with @twitterapi at #TDC2014 in São Paulo #bigdata tomorrow at 2:10pm http://t.co/pTBlWzTvVd The preceding tweet was tweeted by @TwitterDev and it's about 132 characters long. The following are the entities mentioned in this tweet: Handle: @twitterapi Hashtags: #TDC2014, #bigdata URL: http://t.co/pTBlWzTvVd São Paulo is the place mentioned in this tweet. This is a one such example of a tweet with a fairly good amount of metadata. Although the actual tweet's length is well within the 140-character limit, it contains more information than one can think of. This actually enables us to figure out that this tweet belongs to a specific community based on the cross referencing the topics presents in the hash tags, the URL to the website, the different users mentioned in it, and so on. The interface (web or mobile) on to which the tweets are displayed is called timeline. The tweets are, in general, arranged in chronological order of posting time. On a specific user's account, only certain number of tweets are displayed by Twitter. This is generally based on users the given user is following and is being followed by. This is the interface a user will see when he/she login his/her Twitter account. A Twitter stream is different from Twitter timeline in the sense that they are not for a specific user. The Tweets on a user's Twitter timeline will be displayed from only certain number of users will be displayed/updated less frequently while the Twitter stream is chronological collection of the all the tweets posted by all the users. The number of active users on Twitter is in orders of hundreds of millions. All the users tweeting during some public events of widespread interest such as presidential debates can achieve speeds of several hundreds of thousands of tweets per minute. The behavior is very similar to a stream; hence the name of such collection is Twitter stream. You can try the following by creating a Twitter account (it would be more insightful if you have less number of followers already with you). Before creating the account, it is advised that you read all the terms and conditions of the same. You can also start reading its API's documentation. Creating a Twitter API connection We need to have an app created at https://dev.twitter.com/apps before making any API requests to Twitter. It's a standard method for developers to gain API access and more important it helps Twitter to observe and restricts developer from making high load API requests. The ROAuth package is the one we are going to use in our experiments. Tokens allow users to authorize third-party apps to access the data from any user account without the need to have their passwords (or other sensitive information). ROAuth basically facilitates the same. Creating new app The first step to getting any kind of token access from twitter is to create an app on it. The user has to go to https://dev.twitter.com/ and log in with their Twitter credentials. With you logged in using your credentials, the step for creating app are as follows: Go to https://apps.twitter.com/app/new. Put the name of your application in the Name field. This name can be anything you like. Similarly, enter the description in the Description field. The Website field needs to be filled with a valid URL, but again that can be any random URL. You can leave the Callback URL field blank. After the creation of this app, we need to find the API Key and API Secret values from the Key and Access Token tab. Consider the example shown in the following figure:   Under the Key and Access Tokens tab, you will find a button to generate access tokens. Click on it and you will be provided with an Access Token and Access Token Secret value. Before using the preceding keys, we need to install twitteRto access the data in R using the app we just created, using following code: Install.packages(c("devtools", "rjson", "bit64", "httr")) library(devtools) install_github("geoffjentry/twitteR"). library(twitteR) Here's sample code that helps us access the tweets posted since any give date and which contain a specific keyword. In this example, we are searching for tweeting containing the word Earthquake in the tweets posted since September 29, 2014. In order to get this information, we provide four special types of information to get the authorization token: key secret access token access token secret We'll show you how to use the preceding information to get an app authorized by the user and access its resources on Twitter. The ROAuh function in twitteR will make our next steps very smooth and clear: api_key<- "your_api_key" api_secret<- "your_api_secret" access_token<- "your_access_token" access_token_secret<- "your_access_token_secret" setup_twitter_oauth (api_key,api_secret,access_token,access_token_secret) EarthQuakeTweets = searchTwitter("EarthQuake", since='2014-09-29') The results of this example should simply display Using direct authentication with 25 tweets loaded in the EarthQuakeTweets variable as shown here. head(EarthQuakeTweets,2) [[1]] [1] "TamamiJapan: RT @HistoricalPics: Japan. Top: One Month After Hiroshima, 1945. Bottom: One Month After The Earthquake and Tsunami, 2011. Incredible. http…" [[2]] [1] "OldhamDs: RT @HistoricalPics: Japan. Top: One Month After Hiroshima, 1945. Bottom: One Month After The Earthquake and Tsunami, 2011. Incredible. http…" We have shown in the first two of the 25 tweets containing the word Earthquake since September 29, 2014. If you closely observe the results, you'll find all the metadata using str(EarthQuakeTweets[1]). Finding trending topics Now that we understand how to create API connections to Twitter and fetch data using it, we will see how to get answer to what is trending on Twitter to list what topic (worldwide or local) is being talked about the most right now. Using the same API, we can easily access the trending information: #return data frame with name, country & woeid. Locs <- availableTrendLocations() # Where woeid is a numerical identification code describing a location ID # Filter the data frame for Delhi (India) and extract the woeid of the same LocsIndia = subset(Locs, country == "India") woeidDelhi = subset(LocsIndia, name == "Delhi")$woeid # getTrends takes a specified woeid and returns the trending topics associated with that woeid trends = getTrends(woeid=woeidDelhi) The function availableTrendLocations() returns R data frame containing the name, country, and woeid parameters. We than filter this data frame for a location of our choosing; in this example, its Delhi, India. The function getTrends() fetches the top 10 trends in the location determined by the woeid. Here are the top four trending hash tags in the region defined by woeid = 20070458, that is, Delhi, India. head(trends) name url query woeid 1 #AntiHinduNGOsExposed http://twitter.com/search?q=%23AntiHinduNGOsExposed %23AntiHinduNGOsExposed 20070458 2 #KhaasAadmi http://twitter.com/search?q=%23KhaasAadmi %23KhaasAadmi 20070458 3 #WinGOSF14 http://twitter.com/search?q=%23WinGOSF14 %23WinGOSF14 20070458 4 #ItsForRealONeBay http://twitter.com/search?q=%23ItsForRealONeBay %23ItsForRealONeBay 20070458 Searching tweets Now, similar to the trends there is one more important function that comes with the TwitteR package: searchTwitter(). This function will return tweets containing the searched string along with the other constraints. Some of the constraints that can be imposed are as follows: lang: This constraints the tweets of given language. since/until: This constraints the tweets to be since the given date or until the given date. geocode: This constraints tweets to be from only those users who are located within certain distance from the given latitude/longitude. For example, extracting tweets about the cricketer Sachin Tendulkar in the month of November 2014: head(searchTwitter('Sachin Tendulkar', since='2014-11-01', until= '2014-11-30')) [[1]] [1] "TendulkarFC: RT @Moulinparikh: Sachin Tendulkar had a long session with the Mumbai Ranji Trophy team after today's loss." [[2]] [1] "tyagi_niharika: @WahidRuba @Anuj_dvn @Neel_D_ @alishatariq3 @VWellwishers @Meenal_Rathore oh... Yaadaaya....hmaraesachuuu sirxedxa0xbdxedxb8x8d..i mean sachin Tendulkar" [[3]] [1] "Meenal_Rathore: @WahidRuba @Anuj_dvn @tyagi_niharika @Neel_D_ @alishatariq3 @AliaaFcc @VWellwishers .. Sachin Tendulkar xedxa0xbdxedxb8x8a☺️" [[4]] [1] "MishraVidyanand: Vidyanand Mishra is following the Interest "The Living Legend SachinTendu..." on http://t.co/tveHXMB4BM - http://t.co/CocNMcxFge" [[5]] [1] "CSKalwaysWin: I have never tried to compare myself to anyone else.n - Sachin Tendulkar" Twitter sentiment analysis Depending on the objective and based on the functionality to search any type of tweets from the public timeline, one can always collect the required corpus. For example, you may want to learn about customer satisfaction levels with various cab services, which are coming in Indian market. These start-ups are offering various discounts and coupons to attract customers but at the end of the day, the service quality determines the business of any organization. These startups are constantly promoting themselves on various social media websites. Customers are showing various levels of sentiments on the same platform. Let's target the following: Meru Cabs: A radio cabs service based in Mumbai, India. Launched in 2007. Ola Cabs: A taxi aggregator company based in Bangalore, India. Launched in 2011. TaxiForSure: A taxi aggregator company based in Bangalore, India. Launched in 2011. Uber India: A taxi aggregator company headquartered in San Francisco, California. Launched in India in 2014. Let's set our goal to get the general sentiments about each of the preceding services providers based on the customer sentiments present in the tweets on Twitter. Collecting tweets as a corpus We'll start with the searchTwitter()function (discussed previously) on the TwitteR package to gather the tweets for each of the preceding organizations. Now, in order to avoid writing same code again and again, we pushed the following authorization code in the file called authenticate.R. library(twitteR) api_key<- "xx" api_secret<- "xx" access_token<- "xx" access_token_secret<- "xx" setup_twitter_oauth(api_key,api_secret,access_token, access_token_secret) We run the following scripts to get the required tweets: # Load the necessary packages source('authenticate.R') Meru_tweets = searchTwitter("MeruCabs", n=2000, lang="en") Ola_tweets = searchTwitter("OlaCabs", n=2000, lang="en") TaxiForSure_tweets = searchTwitter("TaxiForSure", n=2000, lang="en") Uber_tweets = searchTwitter("Uber_Delhi", n=2000, lang="en") Now, as mentioned in Twitter's Rest API documentation, we get the message "Due to capacity constraints, the index currently only covers about a week's worth of tweets". We do not always get the desired number of tweets (for example, here it's 2000). Instead, the following are the size of each of the above Tweet lists we get the following: >length(Meru_tweets) [1] 393 >length(Ola_tweets) [1] 984 > length(TaxiForSure_tweets) [1] 720 > length(Uber_tweets) [1] 2000 As you can see from the preceding code, the length of these tweets is not equal to the number of tweets we had asked for in our query scripts. There are many takeaways from this information. Since these tweets are only from last one week's tweets on Twitter, they suggest there is more discussion about these taxi services in the following order: Uber India Ola Cabs TaxiForSure Meru Cabs A ban was imposed on Uber India after an alleged rape incident by one Uber India driver. The decision to put a ban on the entire organization because one of its drivers committed a crime became a matter of public outcry. Hence, the number of tweets about Uber increased on social media. Now, Meru Cabs have been in India for almost 7 years now. Hence, they are quite a stable organization. They amount of promotion Ola Cabs and TaxiForSure are doing is way higher than that of Meru Cabs. This can be one reason for Meru Cabs having theleast number (393) of tweets in last week. The number of tweets in last week is comparable for Ola Cabs (984) and TaxiForSure (720). There can be several numbers of reasons for the same. They were both started their business in same year and more importantly they follow the same business model. While Meru Cabs is a radio taxi service and they own and manage a fleet of cars while Ola Cabs, TaxiForSure, or Uber are a marketplace for users to compare the offerings of various operators and book easily. Let's dive deep into the data and get more insights. Cleaning the corpus Before applying any intelligent algorithms to gather more insights out of the tweets collected so far, let's first clean it. In order to clean up, we should understand how the list of tweets looks like: head(Meru_tweets) [[1]] [1] "MeruCares: @KapilTwitts 2&gt;...and other details at feedback@merucabs.com We'll check back and reach out soon." [[2]] [1] "vikasraidhan: @MeruCabs really disappointed with @GenieCabs. Cab is never assigned on time. Driver calls after 30 minutes. Why would I ride with Meru?" [[3]] [1] "shiprachowdhary: fallback of #ubershame , #MERUCABS taking customers for a ride" [[4]] [1] "shiprachowdhary: They book Genie, but JIT inform of cancellation &amp; send full fare #MERUCABS . Very disappointed.Always used these guys 4 and recommend them." [[5]] [1] "shiprachowdhary: No choice bt to take the #merucabs premium service. Driver told me that this happens a lot with #merucabs." [[6]] [1] "shiprachowdhary: booked #Merucabsyestrdy. Asked for Meru Genie. 10 mins 4 pick up time, they call to say Genie not available, so sending the full fare cab" The first tweet here is a grievance solution, while the second, fourth and fifth are actually customer sentiments about the services provided by Meru Cabs. We see: Lots of meta information such as @people, URLs and #hashtags Punctuation marks, numbers, and unnecessary spaces Some of these tweets are retweets from other users; for the given application, we would not like to consider retweets (RTs) in sentiment analysis We clean all these data using the following code block: MeruTweets <- sapply(Meru_tweets, function(x) x$getText()) OlaTweets = sapply(Ola_tweets, function(x) x$getText()) TaxiForSureTweets = sapply(TaxiForSure_tweets, function(x) x$getText()) UberTweets = sapply(Uber_tweets, function(x) x$getText()) catch.error = function(x) { # let us create a missing value for test purpose y = NA # Try to catch that error (NA) we just created catch_error = tryCatch(tolower(x), error=function(e) e) # if not an error if (!inherits(catch_error, "error")) y = tolower(x) # check result if error exists, otherwise the function works fine. return(y) } cleanTweets<- function(tweet){ # Clean the tweet for sentiment analysis # remove html links, which are not required for sentiment analysis tweet = gsub("(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", " ", tweet) # First we will remove retweet entities from the stored tweets (text) tweet = gsub("(RT|via)((?:\b\W*@\w+)+)", " ", tweet) # Then remove all "#Hashtag" tweet = gsub("#\w+", " ", tweet) # Then remove all "@people" tweet = gsub("@\w+", " ", tweet) # Then remove all the punctuation tweet = gsub("[[:punct:]]", " ", tweet) # Then remove numbers, we need only text for analytics tweet = gsub("[[:digit:]]", " ", tweet) # finally, we remove unnecessary spaces (white spaces, tabs etc) tweet = gsub("[ t]{2,}", " ", tweet) tweet = gsub("^\s+|\s+$", "", tweet) # if anything else, you feel, should be removed, you can. For example "slang words" etc using the above function and methods. # Next we'll convert all the word in lower case. This makes uniform pattern. tweet = catch.error(tweet) tweet } cleanTweetsAndRemoveNAs<- function(Tweets) { TweetsCleaned = sapply(Tweets, cleanTweets) # Remove the "NA" tweets from this tweet list TweetsCleaned = TweetsCleaned[!is.na(TweetsCleaned)] names(TweetsCleaned) = NULL # Remove the repetitive tweets from this tweet list TweetsCleaned = unique(TweetsCleaned) TweetsCleaned } MeruTweetsCleaned = cleanTweetsAndRemoveNAs(MeruTweets) OlaTweetsCleaned = cleanTweetsAndRemoveNAs(OlaTweets) TaxiForSureTweetsCleaned <- cleanTweetsAndRemoveNAs(TaxiForSureTweets) UberTweetsCleaned = cleanTweetsAndRemoveNAs(UberTweets) Here's the size of each of the cleaned tweet lists: > length(MeruTweetsCleaned) [1] 309 > length(OlaTweetsCleaned) [1] 811 > length(TaxiForSureTweetsCleaned) [1] 574 > length(UberTweetsCleaned) [1] 1355 Estimating sentiment (A) There are many sophisticated resources available to estimate sentiments. Many research papers and software packages are available open source,and they implement very complex algorithms for sentiments analysis. After getting the cleaned Twitter data, we are going to use few of such R packages available to assess the sentiments in the tweets. It's worth mentioning here that not all the tweets represent a sentiment. Few tweets can be just information/facts, while others can be customer care responses. Ideally, they should not be used to assess the customer sentiment about a particular organization. As a first step, we'll use a Naïve algorithm, which gives a score based on the number of times a positive or a negative word occurred in the given sentence (and in our case, in a tweet). Please download the positive and negative opinion/sentiment (nearly 68, 000) words from English language. These opinion lexicon will be used as a first example in our sentiment analysis experiment. The good thing about this approach is that we are relying on a highly researched upon and at the same time customizable input parameters. Here are a few examples of existing positive and negative sentiments words: Positive: Love, best, cool, great, good, and amazing Negative: Hate, worst, sucks, awful, and nightmare >opinion.lexicon.pos = scan('opinion-lexicon-English/positive-words.txt', what='character', comment.char=';') >opinion.lexicon.neg = scan('opinion-lexicon-English/negative-words.txt', what='character', comment.char=';') > head(opinion.lexicon.neg) [1] "2-faced" "2-faces" "abnormal" "abolish" "abominable" "abominably" > head(opinion.lexicon.pos) [1] "a+" "abound" "abounds" "abundance" "abundant" "accessable" We'll add a few industry-specific and/or especially emphatic terms based on our requirements: pos.words = c(opinion.lexicon.pos,'upgrade') neg.words = c(opinion.lexicon.neg,'wait', 'waiting', 'wtf', 'cancellation') Now, we create a function score.sentiment(), which computes the raw sentiment based on the simple matching algorithm: getSentimentScore = function(sentences, words.positive, words.negative, .progress='none') { require(plyr) require(stringr) scores = laply(sentences, function(sentence, words.positive, words.negative) { # Let first remove the Digit, Punctuation character and Control characters: sentence = gsub('[[:cntrl:]]', '', gsub('[[:punct:]]', '', gsub('\d+', '', sentence))) # Then lets convert all to lower sentence case: sentence = tolower(sentence) # Now lets split each sentence by the space delimiter words = unlist(str_split(sentence, '\s+')) # Get the boolean match of each words with the positive & negative opinion-lexicon pos.matches = !is.na(match(words, words.positive)) neg.matches = !is.na(match(words, words.negative)) # Now get the score as total positive sentiment minus the total negatives score = sum(pos.matches) - sum(neg.matches) return(score) }, words.positive, words.negative, .progress=.progress ) # Return a data frame with respective sentence and the score return(data.frame(text=sentences, score=scores)) } Now, we apply the preceding function on the corpus of tweets collected and cleaned so far: MeruResult = getSentimentScore(MeruTweetsCleaned, words.positive , words.negative) OlaResult = getSentimentScore(OlaTweetsCleaned, words.positive , words.negative) TaxiForSureResult = getSentimentScore(TaxiForSureTweetsCleaned, words.positive , words.negative) UberResult = getSentimentScore(UberTweetsCleaned, words.positive , words.negative) Here are some sample results: Tweet for Meru Cabs Score gt and other details at feedback com we ll check back and reach out soon 0 really disappointed with cab is never assigned on time driver calls after minutes why would i ride with meru -1 so after years of bashing today i m pleasantly surprised clean car courteous driver prompt pickup mins efficient route 4 a min drive cost hrs used to cost less ur unreliable and expensive trying to lose ur customers -3 Tweet For Ola Cabs Score the service is going from bad to worse the drivers deny to come after a confirmed booking -3 love the olacabs app give it a swirl sign up with my referral code dxf n and earn rs download the app from 1 crn kept me waiting for mins amp at last moment driver refused pickup so unreliable amp irresponsible -4 this is not the first time has delighted me punctuality and free upgrade awesome that 4 Tweet For TaxiForSure Score great service now i have become a regular customer of tfs thank you for the upgrade as well happy taxi ing saving 5 really disappointed with cab is never assigned on time driver calls after minutes why would i ride with meru -1 horrible taxi service had to wait for one hour with a new born in the chilly weather of new delhi waiting for them -4 what do i get now if you resolve the issue after i lost a crucial business because of the taxi delay -3 Tweet For Uber India Score that s good uber s fares will prob be competitive til they gain local monopoly then will go sky high as in new york amp delhi saving 3 from a shabby backend app stack to daily pr fuck ups its increasingly obvious that is run by child minded blow hards -3 you say that uber is illegally running were you stupid to not ban earlier and only ban it now after the rape -3 perhaps uber biz model does need some looking into it s not just in delhi that this happens but in boston too 0 From the preceding observations, it's clear that this basic sentiment analysis method works fine in normal circumstances, but in case of Uber India the results deviated too much from a subjective score. It's safe to say that basic word matching gives a good indicator of overall customer sentiments, except in the case when the data itself is not reliable. In our case, the tweets from Uber India are not really related to the services that Uber provides, rather the one incident of crime by its driver and whole score went haywire. Let's not compute a point statistic of the scores we have computed so far. Since the numbers of tweets are not equal for each of the four organizations, we compute a mean and standard deviation for each. Organization Mean Sentiment Score Standard Deviation Meru Cabs -0.2218543 1.301846 Ola Cabs 0.197724 1.170334 TaxiForSure -0.09841828 1.154056 Uber India -0.6132666 1.071094 Estimating sentiment (B) Let's now move one step further. Now instead of using simple matching of opinion lexicon, we'll use something called Naive Bayes to decide on the emotion present in any tweet. We would require packages called Rstem and sentiment to assist in this. It's important to mention here that both these packages are no longer available in CRAN and hence we have to provide either the repository location as a parameter install.package() function. Here's the R script to install the required packages: install.packages("Rstem", repos = "http://www.omegahat.org/R", type="source") require(devtools) install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz") require(sentiment) ls("package:sentiment") Now that we have the sentiment and Rstem packages installed in our R workspace, we can build the bayes classifier for sentiment analysis: library(sentiment) # classify_emotion function returns an object of class data frame # with seven columns (anger, disgust, fear, joy, sadness, surprise, # # best_fit) and one row for each document: MeruTweetsClassEmo = classify_emotion(MeruTweetsCleaned, algorithm="bayes", prior=1.0) OlaTweetsClassEmo = classify_emotion(OlaTweetsCleaned, algorithm="bayes", prior=1.0) TaxiForSureTweetsClassEmo = classify_emotion(TaxiForSureTweetsCleaned, algorithm="bayes", prior=1.0) UberTweetsClassEmo = classify_emotion(UberTweetsCleaned, algorithm="bayes", prior=1.0) The following figure shows few results from Bayesian analysis using thesentiment package for Meru Cabs tweets. Similarly, we generated results for other cab-services from our problem setup. The sentiment package was built to use a trained dataset of emotion words (nearly 1500 words). The function classify_emotion() generates results belonging to one of the following six emotions: anger, disgust, fear, joy, sadness, and surprise. Hence, when the system is not able to classify the overall emotion to any of the six,NA is returned: Let's substitute these NA values with the word unknown to make the further analysis easier: # we will fetch emotion category best_fit for our analysis purposes. MeruEmotion = MeruTweetsClassEmo[,7] OlaEmotion = OlaTweetsClassEmo[,7] TaxiForSureEmotion = TaxiForSureTweetsClassEmo[,7] UberEmotion = UberTweetsClassEmo[,7] MeruEmotion[is.na(MeruEmotion)] = "unknown" OlaEmotion[is.na(OlaEmotion)] = "unknown" TaxiForSureEmotion[is.na(TaxiForSureEmotion)] = "unknown" UberEmotion[is.na(UberEmotion)] = "unknown" The best-fit emotions present in these tweets are as follows: Further, we'll use another function classify_polarity() provided by the sentiment package to classify the tweets into two classes, pos (positive sentiment) or neg (negative sentiment). The idea is to compute the log likelihood of a tweet assuming it to belong to either of two classes. Once these likelihoods are calculated, a ratio of the pos-likelihood to neg-likelihood is calculated and based on this ratio the tweets are classified to belong to a particular class. It's important to note that if this ratio turns out to be 1, then the overall sentiment of the tweet is assumed to be "neutral". The code is as follows: MeruTweetsClassPol = classify_polarity(MeruTweetsCleaned, algorithm="bayes") OlaTweetsClassPol = classify_polarity(OlaTweetsCleaned, algorithm="bayes") TaxiForSureTweetsClassPol = classify_polarity(TaxiForSureTweetsCleaned, algorithm="bayes") UberTweetsClassPol = classify_polarity(UberTweetsCleaned, algorithm="bayes") We get the following output: The preceding figure shows few results from obtained using the classify_polarity() function of sentiment package for Meru Cabs tweets. We'll now generate consolidated results from the two functions in a data frame for each cab service for plotting purposes: # we will fetch polarity category best_fit for our analysis purposes, MeruPol = MeruTweetsClassPol[,4] OlaPol = OlaTweetsClassPol[,4] TaxiForSurePol = TaxiForSureTweetsClassPol[,4] UberPol = UberTweetsClassPol[,4] # Let us now create a data frame with the above results MeruSentimentDataFrame = data.frame(text=MeruTweetsCleaned, emotion=MeruEmotion, polarity=MeruPol, stringsAsFactors=FALSE) OlaSentimentDataFrame = data.frame(text=OlaTweetsCleaned, emotion=OlaEmotion, polarity=OlaPol, stringsAsFactors=FALSE) TaxiForSureSentimentDataFrame = data.frame(text=TaxiForSureTweetsCleaned, emotion=TaxiForSureEmotion, polarity=TaxiForSurePol, stringsAsFactors=FALSE) UberSentimentDataFrame = data.frame(text=UberTweetsCleaned, emotion=UberEmotion, polarity=UberPol, stringsAsFactors=FALSE) # rearrange data inside the frame by sorting it MeruSentimentDataFrame = within(MeruSentimentDataFrame, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE)))) OlaSentimentDataFrame = within(OlaSentimentDataFrame, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE)))) TaxiForSureSentimentDataFrame = within(TaxiForSureSentimentDataFrame, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE)))) UberSentimentDataFrame = within(UberSentimentDataFrame, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE)))) plotSentiments1<- function (sentiment_dataframe,title) { library(ggplot2) ggplot(sentiment_dataframe, aes(x=emotion)) + geom_bar(aes(y=..count.., fill=emotion)) + scale_fill_brewer(palette="Dark2") + ggtitle(title) + theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotion Categories') } plotSentiments1(MeruSentimentDataFrame, 'Sentiment Analysis of Tweets on Twitter about MeruCabs') plotSentiments1(OlaSentimentDataFrame, 'Sentiment Analysis of Tweets on Twitter about OlaCabs') plotSentiments1(TaxiForSureSentimentDataFrame, 'Sentiment Analysis of Tweets on Twitter about TaxiForSure') plotSentiments1(UberSentimentDataFrame, 'Sentiment Analysis of Tweets on Twitter about UberIndia') The output is as follows: In the preceding figure, we showed sample results using generated results on Meru Cabs tweets using both the functions. Let's now plot them one by one. First, let's create a single function to be used by each business's tweets. We call it plotSentiments1() and then we plot it for each business: The following dashboard shows the analysis for Ola Cabs: The following dashboard shows the analysis for TaxiForSure: The following dashboard shows the analysis for Uber India: These sentiments basically reflect the more or less the same observations as we did with the basic word-matching algorithm. The number of tweets with joy constitute the largest part of tweets for all these organizations, indicating that these organizations are trying their best to provide good business in the country. The sadness tweets are less numerous than the joy tweets. However, if compared with each other, they indicate the overall market share versus level of customer satisfaction of each service provider in question. Similarly, these graphs can be used to assess the level of dissatisfaction in terms of anger and disgust in the tweets. Let's now consider only the positive and negative sentiments present in the tweets: # Similarly we will plot distribution of polarity in the tweets plotSentiments2 <- function (sentiment_dataframe,title) { library(ggplot2) ggplot(sentiment_dataframe, aes(x=polarity)) + geom_bar(aes(y=..count.., fill=polarity)) + scale_fill_brewer(palette="RdGy") + ggtitle(title) + theme(legend.position='right') + ylab('Number of Tweets') + xlab('Polarity Categories') } plotSentiments2(MeruSentimentDataFrame, 'Polarity Analysis of Tweets on Twitter about MeruCabs') plotSentiments2(OlaSentimentDataFrame, 'Polarity Analysis of Tweets on Twitter about OlaCabs') plotSentiments2(TaxiForSureSentimentDataFrame, 'Polarity Analysis of Tweets on Twitter about TaxiForSure') plotSentiments2(UberSentimentDataFrame, 'Polarity Analysis of Tweets on Twitter about UberIndia') The output is as follows: The following dashboard shows the polarity analysis for Ola Cabs: The following dashboard shows the analysis for TaxiForSure: The following dashboard shows the analysis for Uber India: It's a basic human trait to inform about other's what's wrong rather than informing if there was something right. That is say that we tend to tweets/report if something bad had happened rather reporting/tweeting if the experience was rather good. Hence, the negative tweets are supposed to be larger than the positive tweets in general. Still over a period of time (a week in our case) the ratio of the two easily reflect the overall market share versus the level of customer satisfaction of each service provider in question. Next, we try to get the sense of the overall content of the tweets using the word clouds. removeCustomeWords <- function (TweetsCleaned) { for(i in 1:length(TweetsCleaned)){ TweetsCleaned[i] <- tryCatch({ TweetsCleaned[i] = removeWords(TweetsCleaned[i], c(stopwords("english"), "care", "guys", "can", "dis", "didn", "guy" ,"booked", "plz")) TweetsCleaned[i] }, error=function(cond) { TweetsCleaned[i] }, warning=function(cond) { TweetsCleaned[i] }) } return(TweetsCleaned) } getWordCloud <- function (sentiment_dataframe, TweetsCleaned, Emotion) { emos = levels(factor(sentiment_dataframe$emotion)) n_emos = length(emos) emo.docs = rep("", n_emos) TweetsCleaned = removeCustomeWords(TweetsCleaned) for (i in 1:n_emos){ emo.docs[i] = paste(TweetsCleaned[Emotion == emos[i]], collapse=" ") } corpus = Corpus(VectorSource(emo.docs)) tdm = TermDocumentMatrix(corpus) tdm = as.matrix(tdm) colnames(tdm) = emos require(wordcloud) suppressWarnings(comparison.cloud(tdm, colors = brewer.pal(n_emos, "Dark2"), scale = c(3,.5), random.order = FALSE, title.size = 1.5)) } getWordCloud(MeruSentimentDataFrame, MeruTweetsCleaned, MeruEmotion) getWordCloud(OlaSentimentDataFrame, OlaTweetsCleaned, OlaEmotion) getWordCloud(TaxiForSureSentimentDataFrame, TaxiForSureTweetsCleaned, TaxiForSureEmotion) getWordCloud(UberSentimentDataFrame, UberTweetsCleaned, UberEmotion) The preceding figure shows word cloud from tweets about Meru Cabs. The preceding figure shows word cloud from tweets about Ola Cabs. The preceding figure shows word cloud from tweets about TaxiForSure. The preceding figure shows word cloud from tweets about Uber India. Summary In this article, we gained knowledge of the various Twitter APIs, we discussed how to create a connection with Twitter, and we saw how to retrieve the tweets with various attributes. We saw the power of Twitter in helping us determine the customer attitude toward today's various businesses. The activity can be done on the weekly basis and one can easily get the monthly or quarterly or yearly changes in customer sentiments. This can not only help the customer decide the trending businesses, but the business itself can get a well-defined metric of its own performance. It can use such scores/graphs to improve. We also discussed various methods of sentiment analysis varying from basic word matching to the advanced Bayesian algorithms. Resources for Article: Further resources on this subject: Find Friends on Facebook [article] Supervised learning[article] Warming Up [article]
Read more
  • 0
  • 0
  • 23321

article-image-building-recommendation-system-azure
Packt
19 Feb 2016
7 min read
Save for later

Building A Recommendation System with Azure

Packt
19 Feb 2016
7 min read
Recommender systems are common these days. You may not have noticed, but you might already be a user or receiver of such a system somewhere. Most of the well-performing e-commerce platforms use recommendation systems to recommend items to their users. When you see on the Amazon website that a book is recommended to you based on your earlier preferences, purchases, and browse history, Amazon is actually using such a recommendation system. Similarly, Netflix uses its recommendation system to suggest movies for you. (For more resources related to this topic, see here.) A recommender or recommendation system is used to recommend a product or information often based on user characteristics, preferences, history, and so on. So, a recommendation is always personalized. Until recently, it was not so easy or straightforward to build a recommender, but Azure ML makes it really easy to build one as long as you have your data ready. This article introduces you to the concept of recommendation systems and also the model available in ML Studio for you to build your own recommender system. It then walks you through the process of building a recommendation system with a simple example. The Matchbox recommender Microsoft has developed a large-scale recommender system based on a probabilistic model (Bayesian) called Matchbox. This model can learn about a user's preferences through observations made on how they rate items, such as movies, content, or other products. Based on those observations, it recommends new items to the users when requested. Matchbox uses the available data for each user in the most efficient way possible. The learning algorithm it uses is designed specifically for big data. However, its main feature is that Matchbox takes advantage of metadata available for both users and items. This means that the things it learns about one user or item can be transferred across to other users or items. You can find more information about the Matchbox model at the Microsoft Research project link. Kinds of recommendations The Matchbox recommender supports the building of four kinds of recommenders, which will include most of the scenarios. Let's take a look at the following list: Rating Prediction: This predicts ratings for a given user and item, for example, if a new movie is released, the system will predict what will be your rating for that movie out of 1-5. Item Recommendation: This recommends items to a given user, for example, Amazon suggests you books or YouTube suggests you videos to watch on its home page (especially when you are logged in). Related Users: This finds users that are related to a given user, for example, LinkedIn suggests people that you can get connected to or Facebook suggests friends to you. Related Items: This finds the items related to a given item, for example, a blog site suggests you related posts when you are reading a blog post. Understanding the recommender modules The Matchbox recommender comes with three components; as you might have guessed, a module each to train, score, and evaluate the data. The modules are described as follows. The train Matchbox recommender This module contains the algorithm and generates the trained algorithm, as shown in the following screenshot: This module takes the values for the following two parameters. The number of traits This value decides how many implicit features (traits) the algorithm will learn about that are related to every user and item. The higher this value, the precise it would be as it would lead to better prediction. Typically, it takes a value in the range of 2 to 20. The number of recommendation algorithm iterations It is the number of times the algorithm iterates over the data. The higher this value, the better would the predictions be. Typically, it takes a value in the range of 1 to 10. The score matchbox recommender This module lets you specify the kind of recommendation and corresponding parameters you want: Rating Prediction Item Prediction Related Users Related Items Let's take a look at the following screenshot: The ML Studio help page for the module provides details of all the corresponding parameters. The evaluate recommender This module takes a test and a scored dataset and generates evaluation metrics, as shown in the following screenshot: It also lets you specify the kind of recommendation, such as the score module and corresponding parameters. Building a recommendation system Now, it would be worthwhile that you learn to build one by yourself. We will build a simple recommender system to recommend restaurants to a given user. ML Studio includes three sample datasets, described as follows: Restaurant customer data: This is a set of metadata about customers, including demographics and preferences, for example, latitude, longitude, interest, and personality. Restaurant feature data: This is a set of metadata about restaurants and their features, such as food type, dining style, and location, for example, placeID, latitude, longitude, price. Restaurant ratings: This contains the ratings given by users to restaurants on a scale of 0 to 2. It contains the columns: userID, placeID, and rating. Now, we will build a recommender that will recommend a given number of restaurants to a user (userID). To build a recommender perform the following steps: Create a new experiment. In the Search box in the modules palette, type Restaurant. The preceding three datasets get listed. Drag them all to the canvas one after another. Drag a Split module and connect it to the output port of the Restaurant ratings module. On the properties section to the right, choose Splitting mode as Recommender Split. Leave the other parameters at their default values. Drag a Project Columns module to the canvas and select the columns: userID, latitude, longitude, interest, and personality. Similarly, drag another Project Columns module and connect it to the Restaurant feature data module and select the columns: placeID, latitude, longitude, price, the_geom_meter, and address, zip. Drag a Train Matchbox Recommender module to the canvas and make connections to the three input ports, as shown in the following screenshot: Drag a Score Matchbox Recommender module to the canvas and make connections to the three input ports and set the property's values, as shown in the following screenshot: Run the experiment and when it gets completed, right-click on the output of the Score Matchbox Recommender module and click on Visualize to explore the scored data. You can note the different restaurants (IDs) recommended as items for a user from the test dataset. The next step is to evaluate the scored prediction. Drag the Evaluate Recommender module to the canvas and connect the second output of the Split module to its first input port and connect the output of the Score Matchbox Recommender module to its second input. Leave the module at its default properties. Run the experiment again and when finished, right-click on the output port of the Evaluate Recommender module and click on Visualize to find the evaluation metric. The evaluation metric Normalized Discounted Cumulative Gain (NDCG) is estimated from the ground truth ratings given in the test set. Its value ranges from 0.0 to 1.0, where 1.0 represents the most ideal ranking of the entities. Summary You started with gaining the basic knowledge about a recommender system. You then understood the Matchbox recommender that comes with ML Studio along with its components. You also explored different kinds of recommendations that you can make with it. Finally, you ended up building a simple recommendation system to recommend restaurants to a given user. For more information on Azure, take a look at the following books also by Packt Publishing: Learning Microsoft Azure (https://www.packtpub.com/networking-and-servers/learning-microsoft-azure) Microsoft Windows Azure Development Cookbook (https://www.packtpub.com/application-development/microsoft-windows-azure-development-cookbook) Resources for Article: Further resources on this subject: Introduction to Microsoft Azure Cloud Services[article] Microsoft Azure – Developing Web API for Mobile Apps[article] Security in Microsoft Azure[article]
Read more
  • 0
  • 0
  • 19211
article-image-introduction-machine-learning-r
Packt
18 Feb 2016
7 min read
Save for later

Introduction to Machine Learning with R

Packt
18 Feb 2016
7 min read
If science fiction stories are to be believed, the invention of artificial intelligence inevitably leads to apocalyptic wars between machines and their makers. In the early stages, computers are taught to play simple games of tic-tac-toe and chess. Later, machines are given control of traffic lights and communications, followed by military drones and missiles. The machine's evolution takes an ominous turn once the computers become sentient and learn how to teach themselves. Having no more need for human programmers, humankind is then deleted. (For more resources related to this topic, see here.) Thankfully, at the time of writing this, machines still require user input. Though your impressions of machine learning may be colored by these mass-media depictions, today's algorithms are too application-specific to pose any danger of becoming self-aware. The goal of today's machine learning is not to create an artificial brain, but rather to assist us in making sense of the world's massive data stores. Putting popular misconceptions aside, in this article we will learn the following topics: Installing R packages Loading and unloading R packages Machine learning with R Many of the algorithms needed for machine learning with R are not included as part of the base installation. Instead, the algorithms needed for machine learning are available via a large community of experts who have shared their work freely. These must be installed on top of base R manually. Thanks to R's status as free open source software, there is no additional charge for this functionality. A collection of R functions that can be shared among users is called a package. Free packages exist for each of the machine learning algorithms covered in this book. In fact, this book only covers a small portion of all of R's machine learning packages. If you are interested in the breadth of R packages, you can view a list at Comprehensive R Archive Network (CRAN), a collection of web and FTP sites located around the world to provide the most up-to-date versions of R software and packages. If you obtained the R software via download, it was most likely from CRAN at http://cran.r-project.org/index.html. If you do not already have R, the CRAN website also provides installation instructions and information on where to find help if you have trouble. The Packages link on the left side of the page will take you to a page where you can browse packages in an alphabetical order or sorted by the publication date. At the time of writing this, a total 6,779 packages were available—a jump of over 60% in the time since the first edition was written, and this trend shows no sign of slowing! The Task Views link on the left side of the CRAN page provides a curated list of packages as per the subject area. The task view for machine learning, which lists the packages covered in this book (and many more), is available at http://cran.r-project.org/web/views/MachineLearning.html. Installing R packages Despite the vast set of available R add-ons, the package format makes installation and use a virtually effortless process. To demonstrate the use of packages, we will install and load the RWeka package, which was developed by Kurt Hornik, Christian Buchta, and Achim Zeileis (see Open-Source Machine Learning: R Meets Weka in Computational Statistics 24: 225-232 for more information). The RWeka package provides a collection of functions that give R access to the machine learning algorithms in the Java-based Weka software package by Ian H. Witten and Eibe Frank. More information on Weka is available at http://www.cs.waikato.ac.nz/~ml/weka/ To use the RWeka package, you will need to have Java installed (many computers come with Java preinstalled). Java is a set of programming tools available for free, which allow for the use of cross-platform applications such as Weka. For more information, and to download Java on your system, you can visit http://java.com. The most direct way to install a package is via the install.packages() function. To install the RWeka package, at the R command prompt, simply type: > install.packages("RWeka") R will then connect to CRAN and download the package in the correct format for your OS. Some packages such as RWeka require additional packages to be installed before they can be used (these are called dependencies). By default, the installer will automatically download and install any dependencies. The first time you install a package, R may ask you to choose a CRAN mirror. If this happens, choose the mirror residing at a location close to you. This will generally provide the fastest download speed. The default installation options are appropriate for most systems. However, in some cases, you may want to install a package to another location. For example, if you do not have root or administrator privileges on your system, you may need to specify an alternative installation path. This can be accomplished using the lib option, as follows: > install.packages("RWeka", lib="/path/to/library") The installation function also provides additional options for installation from a local file, installation from source, or using experimental versions. You can read about these options in the help file, by using the following command: > ?install.packages More generally, the question mark operator can be used to obtain help on any R function. Simply type ? before the name of the function. Loading and unloading R packages In order to conserve memory, R does not load every installed package by default. Instead, packages are loaded by users as they are needed, using the library() function. The name of this function leads some people to incorrectly use the terms library and package interchangeably. However, to be precise, a library refers to the location where packages are installed and never to a package itself. To load the RWeka package we installed previously, you can type the following: > library(RWeka) Aside from RWeka, there are several other R packages. To unload an R package, use the detach() function. For example, to unload the RWeka package shown previously use the following command: > detach("package:RWeka", unload = TRUE) This will free up any resources used by the package. Summary Machine learning originated at the intersection of statistics, database science, and computer science. It is a powerful tool, capable of finding actionable insight in large quantities of data. Still, caution must be used in order to avoid common abuses of machine learning in the real world. Conceptually, learning involves the abstraction of data into a structured representation and the generalization of this structure into action that can be evaluated for utility. In practical terms, a machine learner uses data containing examples and features of the concept to be learned and summarizes this data in the form of a model, which is then used for predictive or descriptive purposes. These purposes can be grouped into tasks, including classification, numeric prediction, pattern detection, and clustering. Among the many options, machine learning algorithms are chosen on the basis of the input data and the learning task. R provides support for machine learning in the form of community-authored packages. These powerful tools are free to download; however, they need to be installed before they can be used. To learn more about R, you can refer the following books published by Packt Publishing (https://www.packtpub.com/): Machine Learning with R - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition) R for Data Science (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science) R Data Science Essentials (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science-essentials) R Graphs Cookbook Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/r-graph-cookbook-%E2%80%93-second-edition) Resources for Article: Further resources on this subject: Machine Learning[article] Introducing Test-driven Machine Learning[article] Machine Learning with R[article]
Read more
  • 0
  • 0
  • 7887

article-image-machine-learning-practice
Packt
18 Feb 2016
13 min read
Save for later

Machine learning in practice

Packt
18 Feb 2016
13 min read
In this article, we will learn how we can implement machine learning in practice. To apply the learning process to real-world tasks, we'll use a five-step process. Regardless of the task at hand, any machine learning algorithm can be deployed by following this plan: Data collection: The data collection step involves gathering the learning material an algorithm will use to generate actionable knowledge. In most cases, the data will need to be combined into a single source like a text file, spreadsheet, or database. Data exploration and and preparation: The quality of any machine learning project is based largely on the quality of its input data. Thus, it is important to learn more about the data and its nuances during a practice called data exploration. Additional work is required to prepare the data for the learning process. This involves fixing or cleaning so-called "messy" data, eliminating unnecessary data, and recoding the data to conform to the learner's expected inputs. Model training: By the time the data has been prepared for analysis, you are likely to have a sense of what you are capable of learning from the data. The specific machine learning task chosen will inform the selection of an appropriate algorithm and the algorithm will represent the data in the form of a model. Model evaluation: Because each machine learning model results in a biased solution to the learning problem, it is important to evaluate how well the algorithm learns from its experience. Depending on the type of model used, you might be able to evaluate the accuracy of the model using a test dataset or you may need to develop measures of performance specific to the intended application. Model improvement: If better performance is needed, it becomes necessary to utilize more advanced strategies to augment the performance of the model. Sometimes, it may be necessary to switch to a different type of model altogether. You may need to supplement your data with additional data or perform additional preparatory work as in step two of this process. (For more resources related to this topic, see here.) After these steps are completed, if the model appears to be performing well, it can be deployed for its intended task. As the case may be, you might utilize your model to provide score data for predictions (possibly in real time), for projections of financial data, to generate useful insight for marketing or research, or to automate tasks such as mail delivery or flying aircraft. The successes and failures of the deployed model might even provide additional data to train your next generation learner. Types of input data The practice of machine learning involves matching the characteristics of input data to the biases of the available approaches. Thus, before applying machine learning to real-world problems, it is important to understand the terminology that distinguishes among input datasets. The phrase unit of observation is used to describe the smallest entity with measured properties of interest for a study. Commonly, the unit of observation is in the form of persons, objects or things, transactions, time points, geographic regions, or measurements. Sometimes, units of observation are combined to form units such as person-years, which denote cases where the same person is tracked over multiple years; each person-year comprises of a person's data for one year. The unit of observation is related but not identical to the unit of analysis, which is the smallest unit from which the inference is made. Although it is often the case, the observed and analyzed units are not always the same. For example, data observed from people might be used to analyze trends across countries. Datasets that store the units of observation and their properties can be imagined as collections of data consisting of: Examples: Instances of the unit of observation for which properties have been recorded Features: Recorded properties or attributes of examples that may be useful for learning It is the easiest to understand features and examples through real-world cases. To build a learning algorithm to identify spam e-mail, the unit of observation could be e-mail messages, examples would be specific messages, and its features might consist of the words used in the messages. For a cancer detection algorithm, the unit of observation could be patients, the examples might include a random sample of cancer patients, and its features may be the genomic markers from biopsied cells as well as the characteristics of patient such as weight, height, or blood pressure. While examples and features do not have to be collected in any specific form, they are commonly gathered in the matrix format, which means that each example has exactly the same features. The following spreadsheet shows a dataset in the matrix format. In the matrix data, each row in the spreadsheet is an example and each column is a feature. Here, the rows indicate examples of automobiles, while the columns record various each automobile's feature, such as price, mileage, color, and transmission type. Matrix format data is by far the most common form used in machine learning, though other forms are used occasionally in specialized cases: Features also come in various forms. If a feature represents a characteristic measured in numbers, it is unsurprisingly called numeric. Alternatively, if a feature is an attribute that consists of a set of categories, the feature is called categorical or nominal. A special case of categorical variables is called ordinal, which designates a nominal variable to the categories falling in an ordered list. Some examples of ordinal variables include clothing sizes such as small, medium, and large, or a measurement of customer satisfaction on a scale from "not at all happy" to "very happy." It is important to consider what the features represent, as the type and number of features in your dataset will assist in determining an appropriate machine learning algorithm for your task. Types of machine learning algorithms Machine learning algorithms are divided into categories according to their purpose. Understanding the categories of learning algorithms is an essential first step towards using data to drive the desired action. A predictive model is used for tasks that involve, as the name implies, the prediction of one value using other values in the dataset. The learning algorithm attempts to discover and model the relationship between the target feature (the feature being predicted) and the other features. Despite the common use of the word "prediction" to imply forecasting, predictive models need not necessarily foresee events in the future. For instance, a predictive model could be used to predict past events such as the date of a baby's conception using the mother's present-day hormone levels. Predictive models can also be used in real time to control traffic lights during rush hours. Because predictive models are given clear instruction on what they need to learn and how they are intended to learn it, the process of training a predictive model is known as supervised learning. The supervision does not refer to human involvement, but rather to the fact that the target values provide a way for the learner to know how well it has learned the desired task. Stated more formally, given a set of data, a supervised learning algorithm attempts to optimize a function (the model) to find the combination of feature values that result in the target output. The often used supervised machine learning task of predicting which category an example belongs to is known as classification. It is easy to think of potential uses for a classifier. For instance, you could predict whether: An e-mail message is spam A person has cancer A football team will win or lose An applicant will default on a loan In classification, the target feature to be predicted is a categorical feature known as the class and is divided into categories called levels. A class can have two or more levels, and the levels may or may not be ordinal. Because classification is so widely used in machine learning, there are many types of classification algorithms, with strengths and weaknesses suited for different types of input data. Supervised learners can also be used to predict numeric data such as income, laboratory values, test scores, or counts of items. To predict such numeric values, a common form of numeric prediction fits linear regression models to the input data. Although regression models are not the only type of numeric models, they are, by far, the most widely used. Regression methods are widely used for forecasting, as they quantify in exact terms the association between inputs and the target, including both, the magnitude and uncertainty of the relationship. Since it is easy to convert numbers into categories (for example, ages 13 to 19 are teenagers) and categories into numbers (for example, assign 1 to all males, 0 to all females), the boundary between classification models and numeric prediction models is not necessarily firm. A descriptive model is used for tasks that would benefit from the insight gained from summarizing data in new and interesting ways. As opposed to predictive models that predict a target of interest, in a descriptive model, no single feature is more important than the other. In fact, because there is no target to learn, the process of training a descriptive model is called unsupervised learning. Although it can be more difficult to think of applications for descriptive models—after all, what good is a learner that isn't learning anything in particular—they are used quite regularly for data mining. For example, the descriptive modeling task called pattern discovery is used to identify useful associations within data. Pattern discovery is often used for market basket analysis on retailers' transactional purchase data. Here, the goal is to identify items that are frequently purchased together, such that the learned information can be used to refine marketing tactics. For instance, if a retailer learns that swimming trunks are commonly purchased at the same time as sunscreens, the retailer might reposition the items more closely in the store or run a promotion to "up-sell" customers on associated items. Originally used only in retail contexts, pattern discovery is now starting to be used in quite innovative ways. For instance, it can be used to detect patterns of fraudulent behavior, screen for genetic defects, or identify hot spots for criminal activity. The descriptive modeling task of dividing a dataset into homogeneous groups is called clustering. This is, sometimes, used for segmentation analysis that identifies groups of individuals with similar behavior or demographic information so that advertising campaigns could be tailored for particular audiences. Although the machine is capable of identifying the clusters, human intervention is required to interpret them. For example, given five different clusters of shoppers at a grocery store, the marketing team will need to understand the differences among the groups in order to create a promotion that best suits each group. This is almost certainly easier than trying to create a unique appeal for each customer. Lastly, a class of machine learning algorithms known as meta-learners is not tied to a specific learning task, but is rather focused on learning how to learn more effectively. A meta-learning algorithm uses the result of some learnings to inform additional learning. This can be beneficial for very challenging problems or when a predictive algorithm's performance needs to be as accurate as possible. Matching input data to algorithms The following table lists the general types of machine learning algorithms, each of which may be implemented in several ways. These are the basis on which nearly all the other more advanced methods are based. Although this covers only a fraction of the entire set of machine learning algorithms, learning these methods will provide a sufficient foundation to make sense of any other method you may encounter in the future. Model Learning task Supervised Learning Algorithms Nearest Neighbor Classification Naive Bayes Classification Decision Trees Classification Classification Rule Learners Classification Linear Regression Numeric prediction Regression Trees Numeric prediction Model Trees Numeric prediction Neural Networks Dual use Support Vector Machines Dual use Unsupervised Learning Algorithms Association Rules Pattern detection k-means clustering Clustering Meta-Learning Algorithms Bagging Dual use Boosting Dual use Random Forests Dual use  To begin applying machine learning to a real-world project, you will need to determine which of the four learning tasks your project represents: classification, numeric prediction, pattern detection, or clustering. The task will drive the choice of algorithm. For instance, if you are undertaking pattern detection, you are likely to employ association rules. Similarly, a clustering problem will likely utilize the k-means algorithm and numeric prediction will utilize regression analysis or regression trees. For classification, more thought is needed to match a learning problem to an appropriate classifier. In these cases, it is helpful to consider various distinctions among algorithms—distinctions that will only be apparent by studying each of the classifiers in depth. For instance, within classification problems, decision trees result in models that are readily understood, while the models of neural networks are notoriously difficult to interpret. If you were designing a credit-scoring model, this could be an important distinction because law often requires that the applicant must be notified about the reasons he or she was rejected for the loan. Even if the neural network is better at predicting loan defaults, if its predictions cannot be explained, then it is useless for this application. Although you will sometimes find that these characteristics exclude certain models from consideration. In many cases, the choice of algorithm is arbitrary. When this is true, feel free to use whichever algorithm you are most comfortable with. Other times, when predictive accuracy is primary, you may need to test several algorithms and choose the one that fits the best or use a meta-learning algorithm that combines several different learners to utilize the strengths of each. Summary Machine learning originated at the intersection of statistics, database science, and computer science. It is a powerful tool, capable of finding actionable insight in large quantities of data. Still, caution must be used in order to avoid common abuses of machine learning in the real world. Conceptually, learning involves the abstraction of data into a structured representation and the generalization of this structure into action that can be evaluated for utility. In practical terms, a machine learner uses data containing examples and features of the concept to be learned and summarizes this data in the form of a model, which is then used for predictive or descriptive purposes. These purposes can be grouped into tasks, including classification, numeric prediction, pattern detection, and clustering. Among the many options, machine learning algorithms are chosen on the basis of the input data and the learning task. R provides support for machine learning in the form of community-authored packages. These powerful tools are free to download, but need to be installed before they can be used. Resources for Article:   Further resources on this subject: Introduction to Machine Learning with R [article] Machine Learning with R [article] Spark – Architecture and First Program [article]
Read more
  • 0
  • 0
  • 3238
Modal Close icon
Modal Close icon