Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-getting-started-making-video-games
John Horton
25 Feb 2016
7 min read
Save for later

Getting Started Making Video Games

John Horton
25 Feb 2016
7 min read
If you’re looking to make the jump into the world of making games then be sure to download our new free eBook, “Getting Started Making Video Games”! Here's an extract to get you started... You’ve made the decision to finally enter the world of game development and turn that dream game idea you’ve had at the back of your mind into reality. Where do you begin though, and what options are available to you? John Horton gives us everything we need to get the right mindset when making these essential first few steps. How do I get started making video games? Everybody has at least one game in them. I believe this is the 21st century equivalent of "everybody has a book in them". Games are powerful; they can tell a story, entertain, persuade and bring joy or sadness. To their creator, video games offer satisfaction, educational advancement and even personal wealth. What more powerful reasons do we need to get that game out from within us and onto the Google Play, Apple App store, Steam, XBOXLive Arcade, or where ever we think our video game should be? The problem The problem of course is that you want to make video games but you just don't know where, how, or the best way to start. This brief article was written for you if any of the following 3 questions are going round in your mind and you have so far not managed to find an answer: Which is the best language (C++, C#, Java, Python, Objective C, HTML5, etc.) to learn? Which is the best platform (PC, Android, iOS, Mac, SteamOS etc.)? Which Engine (UnrealEngine, Unity, GameMaker, Cocos, LibGDX, AndEngine, CryEngine, etc.)? The first thing to point out is that there is no "best" platform, engine or language and anyone who tells you there is, is either biased, blinkered or just plain wrong. The solution The answer to all these questions can be much more easily found by talking about yourself and your game. It is desperately important to have this discussion with yourself because if you head off down the wrong path you could blow a serious amount of time before you realise you should have done things differently. If you ever get that sinking feeling knowing you have just burnt an unrecoverable hour of your life on Facebook or Candy Crush, trust me, that is nothing compared to learning a programming language which appeared to offer so much but turns out it can never deliver what you want. Furthermore, using a scatter-gun approach and trying to learn a bit of everything will make progress very slow and possibly cause confusion. Talking about you and your game To make sure you get it right first time, write down on a piece of paper or in a text editor, your answers to all the following questions. Wherever possible, elaborate a little so at the end of this short exercise you will have a few paragraphs that detail everything about you and your game. Make sure to do this before we move on to the next part of the tutorial that will allow you to match your goals to languages, platforms, and engines. Q1: Where are you starting from? Are you already a programming guru in one or more languages or are you a complete beginner with absolutely no programming experience at all? Perhaps you are somewhere in the middle. Write it down and then move on to the next question. Q2: Where do you want to end up and when? What do you see as a successful conclusion to your efforts at learning to make games? Do you want to be the lead programmer at Rockstar or Infinity Ward? Perhaps you have seen Indie Game the Movie and have a passionate drive to become an indie dev'. Maybe you just want to have fun? Perhaps you are just looking for the absolute easiest path to getting published or simply finishing your game for yourself? How much time are you prepared to put in to this? A weekend, a year, as long as it takes? Q3: How do you like to learn? Do you want to learn the absolute 'proper' way without any shortcuts or useful tricks? You want a fully comprehensive A-Z learning pathway with zero shortcuts- no matter how much fun the shortcuts might be. Do you want the polar opposite of this and want to get to the games straightway or maybe your way is somewhere in between the two. Q4: Do you have a preferred target platform? You might not have an answer to this one; you might have several platforms in mind. It is even possible you absolutely must develop your game for every platform. Whatever the case, write it down before moving on. Q5: Do you know what type of game you want to make? There are so many different types of game and which one you want to make will certainly steer you towards different learning pathways, engines and languages. Write a sentence or two about the game you want to make. Be sure to mention the genre, perhaps, 3d, 2d, FPS, RPG, survival, retro arcade, multiplayer sandbox or mobile match-three. Obviously the preceding list is not exhaustive and might not have mentioned the type of game that you want to make. Q6: The important question Which of the above aspects about you and your game is the most important to you? Some choices are occasionally hard to reconcile together. Often some kind of compromise of goals is necessary. For example; how important is it that you make your game for your favourite platform/genre compared to how fast you want to see results, etc. You and your game conclusion Hopefully the above questions will have left you with a statement about you and your game, perhaps something like the following: "I did a little bit of programming at school but it is probably best to start again at the beginning. I have a strong desire to be a successful indie dev' and I am prepared to do whatever it takes to achieve this but I must be able to learn alongside my existing job which pays the bills. I want to learn everything thoroughly but I also want to be building games as fast as possible. I wouldn't mind making games for any or even all platforms but most of all I would like to make my game for desktop PC's and, one-day, get my new game green-lit on Steam. That would be a real buzz! I want to make a 2d game with retro graphics but it must feel new and exciting to play. I don't have all the details yet but I have loads of ideas. Maybe a platform stealth, rogue-like set in a dystopian world run by an evil dictator and the player has to make his way through the world taking on progressively tougher enemies and bosses before the final show-down with the dictator himself. The most important thing is to get it on Steam, anything else is a bonus." Now you’ve got a clear picture of what you want as a game developer it’s time to think about the next steps: what game you want to create and, perhaps more importantly, what language and engine you should focus on. Be sure to continue making the right choices for you by downloading our free eBook “Getting Started Making Video Games” now! Author bio John Horton is a coding and gaming enthusiast based in the UK. He has a passion for writing apps, games, books and blog articles about coding, especially for beginners. He is the founder of Game Code School, which is dedicated to helping complete beginners get started game coding using the language and platform which is best for them. John sincerely believes that anyone can learn to code and that everybody has a game or an app inside of them; but they just need to do enough work to bring it out. He has authored around a dozen technology books most recently the following: Android Programming for Beginners Android Game programming by Example Learning Java Building Android Games
Read more
  • 0
  • 0
  • 2217

article-image-making-web-server-nodejs
Packt
25 Feb 2016
38 min read
Save for later

Making a Web Server in Node.js

Packt
25 Feb 2016
38 min read
In this article, we will cover the following topics: Setting up a router Serving static files Caching content in memory for immediate delivery Optimizing performance with streaming Securing against filesystem hacking exploits (For more resources related to this topic, see here.) One of the great qualities of Node is its simplicity. Unlike PHP or ASP, there is no separation between the web server and code, nor do we have to customize large configuration files to get the behavior we want. With Node, we can create the web server, customize it, and deliver content. All this can be done at the code level. This article demonstrates how to create a web server with Node and feed content through it, while implementing security and performance enhancements to cater for various situations. If we don't have Node installed yet, we can head to http://nodejs.org and hit the INSTALL button appearing on the homepage. This will download the relevant file to install Node on our operating system. Setting up a router In order to deliver web content, we need to make a Uniform Resource Identifier (URI) available. This recipe walks us through the creation of an HTTP server that exposes routes to the user. Getting ready First let's create our server file. If our main purpose is to expose server functionality, it's a general practice to call the server.js file (because the npm start command runs the node server.js command by default). We could put this new server.js file in a new folder. It's also a good idea to install and use supervisor. We use npm (the module downloading and publishing command-line application that ships with Node) to install. On the command-line utility, we write the following command: sudo npm -g install supervisor Essentially, sudo allows administrative privileges for Linux and Mac OS X systems. If we are using Node on Windows, we can drop the sudo part in any of our commands. The supervisor module will conveniently autorestart our server when we save our changes. To kick things off, we can start our server.js file with the supervisor module by executing the following command: supervisor server.js For more on possible arguments and the configuration of supervisor, check out https://github.com/isaacs/node-supervisor. How to do it... In order to create the server, we need the HTTP module. So let's load it and use the http.createServer method as follows: var http = require('http'); http.createServer(function (request, response) {   response.writeHead(200, {'Content-Type': 'text/html'});   response.end('Woohoo!'); }).listen(8080); Now, if we save our file and access localhost:8080 on a web browser or using curl, our browser (or curl) will exclaim Woohoo! But the same will occur at localhost:8080/foo. Indeed, any path will render the same behavior. So let's build in some routing. We can use the path module to extract the basename variable of the path (the final part of the path) and reverse any URI encoding from the client with decodeURI as follows: var http = require('http'); var path = require('path'); http.createServer(function (request, response) {   var lookup=path.basename(decodeURI(request.url)); We now need a way to define our routes. One option is to use an array of objects as follows: var pages = [   {route: '', output: 'Woohoo!'},   {route: 'about', output: 'A simple routing with Node example'},   {route: 'another page', output: function() {return 'Here's     '+this.route;}}, ]; Our pages array should be placed above the http.createServer call. Within our server, we need to loop through our array and see if the lookup variable matches any of our routes. If it does, we can supply the output. We'll also implement some 404 error-related handling as follows: http.createServer(function (request, response) {   var lookup=path.basename(decodeURI(request.url));   pages.forEach(function(page) {     if (page.route === lookup) {       response.writeHead(200, {'Content-Type': 'text/html'});       response.end(typeof page.output === 'function'       ? page.output() : page.output);     }   });   if (!response.finished) {      response.writeHead(404);      response.end('Page Not Found!');   } }).listen(8080); How it works... The callback function we provide to http.createServer gives us all the functionality we need to interact with our server through the request and response objects. We use request to obtain the requested URL and then we acquire its basename with path. We also use decodeURI, without which another page route would fail as our code would try to match another%20page against our pages array and return false. Once we have our basename, we can match it in any way we want. We could send it in a database query to retrieve content, use regular expressions to effectuate partial matches, or we could match it to a filename and load its contents. We could have used a switch statement to handle routing, but our pages array has several advantages—it's easier to read, easier to extend, and can be seamlessly converted to JSON. We loop through our pages array using forEach. Node is built on Google's V8 engine, which provides us with a number of ECMAScript 5 (ES5) features. These features can't be used in all browsers as they're not yet universally implemented, but using them in Node is no problem! The forEach function is an ES5 implementation; the ES3 way is to use the less convenient for loop. While looping through each object, we check its route property. If we get a match, we write the 200 OK status and content-type headers, and then we end the response with the object's output property. The response.end method allows us to pass a parameter to it, which it writes just before finishing the response. In response.end, we have used a ternary operator (?:) to conditionally call page.output as a function or simply pass it as a string. Notice that the another page route contains a function instead of a string. The function has access to its parent object through the this variable, and allows for greater flexibility in assembling the output we want to provide. In the event that there is no match in our forEach loop, response.end would never be called and therefore the client would continue to wait for a response until it times out. To avoid this, we check the response.finished property and if it's false, we write a 404 header and end the response. The response.finished flag is affected by the forEach callback, yet it's not nested within the callback. Callback functions are mostly used for asynchronous operations, so on the surface this looks like a potential race condition; however, the forEach loop does not operate asynchronously; it blocks until all loops are complete. There's more... There are many ways to extend and alter this example. There are also some great non-core modules available that do the legwork for us. Simple multilevel routing Our routing so far only deals with a single level path. A multilevel path (for example, /about/node) will simply return a 404 error message. We can alter our object to reflect a subdirectory-like structure, remove path, and use request.url for our routes instead of path.basename as follows: var http=require('http'); var pages = [   {route: '/', output: 'Woohoo!'},   {route: '/about/this', output: 'Multilevel routing with Node'},   {route: '/about/node', output: 'Evented I/O for V8 JavaScript.'},   {route: '/another page', output: function () {return 'Here's '     + this.route; }} ]; http.createServer(function (request, response) {   var lookup = decodeURI(request.url); When serving static files, request.url must be cleaned prior to fetching a given file. Check out the Securing against filesystem hacking exploits recipe in this article. Multilevel routing could be taken further; we could build and then traverse a more complex object as follows: {route: 'about', childRoutes: [   {route: 'node', output: 'Evented I/O for V8 JavaScript'},   {route: 'this', output: 'Complex Multilevel Example'} ]} After the third or fourth level, this object would become a leviathan to look at. We could alternatively create a helper function to define our routes that essentially pieces our object together for us. Alternatively, we could use one of the excellent noncore routing modules provided by the open source Node community. Excellent solutions already exist that provide helper methods to handle the increasing complexity of scalable multilevel routing. Parsing the querystring module Two other useful core modules are url and querystring. The url.parse method allows two parameters: first the URL string (in our case, this will be request.url) and second a Boolean parameter named parseQueryString. If the url.parse method is set to true, it lazy loads the querystring module (saving us the need to require it) to parse the query into an object. This makes it easy for us to interact with the query portion of a URL as shown in the following code: var http = require('http'); var url = require('url'); var pages = [   {id: '1', route: '', output: 'Woohoo!'},   {id: '2', route: 'about', output: 'A simple routing with Node     example'},   {id: '3', route: 'another page', output: function () {     return 'Here's ' + this.route; }   }, ]; http.createServer(function (request, response) {   var id = url.parse(decodeURI(request.url), true).query.id;   if (id) {     pages.forEach(function (page) {       if (page.id === id) {         response.writeHead(200, {'Content-Type': 'text/html'});         response.end(typeof page.output === 'function'         ? page.output() : page.output);       }     });   }   if (!response.finished) {     response.writeHead(404);     response.end('Page Not Found');   } }).listen(8080); With the added id properties, we can access our object data by, for instance, localhost:8080?id=2. The routing modules There's an up-to-date list of various routing modules for Node at https://github.com/joyent/node/wiki/modules#wiki-web-frameworks-routers. These community-made routers cater to various scenarios. It's important to research the activity and maturity of a module before taking it into a production environment. NodeZoo (http://nodezoo.com) is an excellent tool to research the state of a NODE module. See also The Serving static files and Securing against filesystem hacking exploits recipes discussed in this article Serving static files If we have information stored on disk that we want to serve as web content, we can use the fs (filesystem) module to load our content and pass it through the http.createServer callback. This is a basic conceptual starting point to serve static files; as we will learn in the following recipes, there are much more efficient solutions. Getting ready We'll need some files to serve. Let's create a directory named content, containing the following three files: index.html styles.css script.js Add the following code to the HTML file index.html: <html>   <head>     <title>Yay Node!</title>     <link rel=stylesheet href=styles.css type=text/css>     <script src=script.js type=text/javascript></script>   </head>   <body>     <span id=yay>Yay!</span>   </body> </html> Add the following code to the script.js JavaScript file: window.onload = function() { alert('Yay Node!'); }; And finally, add the following code to the CSS file style.css: #yay {font-size:5em;background:blue;color:yellow;padding:0.5em} How to do it... As in the previous recipe, we'll be using the core modules http and path. We'll also need to access the filesystem, so we'll require fs as well. With the help of the following code, let's create the server and use the path module to check if a file exists: var http = require('http'); var path = require('path'); var fs = require('fs'); http.createServer(function (request, response) {   var lookup = path.basename(decodeURI(request.url)) ||     'index.html';   var f = 'content/' + lookup;   fs.exists(f, function (exists) {     console.log(exists ? lookup + " is there"     : lookup + " doesn't exist");   }); }).listen(8080); If we haven't already done it, then we can initialize our server.js file by running the following command: supervisor server.js Try loading localhost:8080/foo. The console will say foo doesn't exist, because it doesn't. The localhost:8080/script.js URL will tell us that script.js is there, because it is. Before we can serve a file, we are supposed to let the client know the content-type header, which we can determine from the file extension. So let's make a quick map using an object as follows: var mimeTypes = {   '.js' : 'text/javascript',   '.html': 'text/html',   '.css' : 'text/css' }; We could extend our mimeTypes map later to support more types. Modern browsers may be able to interpret certain mime types (like text/javascript), without the server sending a content-type header, but older browsers or less common mime types will rely upon the correct content-type header being sent from the server. Remember to place mimeTypes outside of the server callback, since we don't want to initialize the same object on every client request. If the requested file exists, we can convert our file extension into a content-type header by feeding path.extname into mimeTypes and then pass our retrieved content-type to response.writeHead. If the requested file doesn't exist, we'll write out a 404 error and end the response as follows: //requires variables, mimeType object... http.createServer(function (request, response) {     var lookup = path.basename(decodeURI(request.url)) ||     'index.html';   var f = 'content/' + lookup;   fs.exists(f, function (exists) {     if (exists) {       fs.readFile(f, function (err, data) {         if (err) {response.writeHead(500); response.end('Server           Error!'); return; }         var headers = {'Content-type': mimeTypes[path.extname          (lookup)]};         response.writeHead(200, headers);         response.end(data);       });       return;     }     response.writeHead(404); //no such file found!     response.end();   }); }).listen(8080); At the moment, there is still no content sent to the client. We have to get this content from our file, so we wrap the response handling in an fs.readFile method callback as follows: //http.createServer, inside fs.exists: if (exists) {   fs.readFile(f, function(err, data) {     var headers={'Content-type': mimeTypes[path.extname(lookup)]};     response.writeHead(200, headers);     response.end(data);   });  return; } Before we finish, let's apply some error handling to our fs.readFile callback as follows: //requires variables, mimeType object... //http.createServer,  path exists, inside if(exists):  fs.readFile(f, function(err, data) {     if (err) {response.writeHead(500); response.end('Server       Error!');  return; }     var headers = {'Content-type': mimeTypes[path.extname      (lookup)]};     response.writeHead(200, headers);     response.end(data);   }); return; } Notice that return stays outside of the fs.readFile callback. We are returning from the fs.exists callback to prevent further code execution (for example, sending the 404 error). Placing a return statement in an if statement is similar to using an else branch. However, the pattern of the return statement inside the if loop is encouraged instead of if else, as it eliminates a level of nesting. Nesting can be particularly prevalent in Node due to performing a lot of asynchronous tasks, which tend to use callback functions. So, now we can navigate to localhost:8080, which will serve our index.html file. The index.html file makes calls to our script.js and styles.css files, which our server also delivers with appropriate mime types. We can see the result in the following screenshot: This recipe serves to illustrate the fundamentals of serving static files. Remember, this is not an efficient solution! In a real world situation, we don't want to make an I/O call every time a request hits the server; this is very costly especially with larger files. In the following recipes, we'll learn better ways of serving static files. How it works... Our script creates a server and declares a variable called lookup. We assign a value to lookup using the double pipe || (OR) operator. This defines a default route if path.basename is empty. Then we pass lookup to a new variable that we named f in order to prepend our content directory to the intended filename. Next, we run f through the fs.exists method and check the exist parameter in our callback to see if the file is there. If the file does exist, we read it asynchronously using fs.readFile. If there is a problem accessing the file, we write a 500 server error, end the response, and return from the fs.readFile callback. We can test the error-handling functionality by removing read permissions from index.html as follows: chmod -r index.html Doing so will cause the server to throw the 500 server error status code. To set things right again, run the following command: chmod +r index.html chmod is a Unix-type system-specific command. If we are using Windows, there's no need to set file permissions in this case. As long as we can access the file, we grab the content-type header using our handy mimeTypes mapping object, write the headers, end the response with data loaded from the file, and finally return from the function. If the requested file does not exist, we bypass all this logic, write a 404 error message, and end the response. There's more... The favicon icon file is something to watch out for. We will explore the file in this section. The favicon gotcha When using a browser to test our server, sometimes an unexpected server hit can be observed. This is the browser requesting the default favicon.ico icon file that servers can provide. Apart from the initial confusion of seeing additional hits, this is usually not a problem. If the favicon request does begin to interfere, we can handle it as follows: if (request.url === '/favicon.ico') {   console.log('Not found: ' + f);   response.end();   return; } If we wanted to be more polite to the client, we could also inform it of a 404 error by using response.writeHead(404) before issuing response.end. See also The Caching content in memory for immediate delivery recipe The Optimizing performance with streaming recipe The Securing against filesystem hacking exploits recipe Caching content in memory for immediate delivery Directly accessing storage on each client request is not ideal. For this task, we will explore how to enhance server efficiency by accessing the disk only on the first request, caching the data from file for that first request, and serving all further requests out of the process memory. Getting ready We are going to improve upon the code from the previous task, so we'll be working with server.js and in the content directory, with index.html, styles.css, and script.js. How to do it... Let's begin by looking at our following script from the previous recipe Serving Static Files: var http = require('http'); var path = require('path'); var fs = require('fs');    var mimeTypes = {   '.js' : 'text/javascript',   '.html': 'text/html',   '.css' : 'text/css' };   http.createServer(function (request, response) {   var lookup = path.basename(decodeURI(request.url)) ||     'index.html';   var f = 'content/'+lookup;   fs.exists(f, function (exists) {     if (exists) {       fs.readFile(f, function(err,data) {         if (err) {           response.writeHead(500); response.end('Server Error!');           return;         }         var headers = {'Content-type': mimeTypes[path.extname          (lookup)]};         response.writeHead(200, headers);         response.end(data);       });     return;     }     response.writeHead(404); //no such file found!     response.end('Page Not Found');   }); } We need to modify this code to only read the file once, load its contents into memory, and respond to all requests for that file from memory afterwards. To keep things simple and preserve maintainability, we'll extract our cache handling and content delivery into a separate function. So above http.createServer, and below mimeTypes, we'll add the following: var cache = {}; function cacheAndDeliver(f, cb) {   if (!cache[f]) {     fs.readFile(f, function(err, data) {       if (!err) {         cache[f] = {content: data} ;       }       cb(err, data);     });     return;   }   console.log('loading ' + f + ' from cache');   cb(null, cache[f].content); } //http.createServer A new cache object and a new function called cacheAndDeliver have been added to store our files in memory. Our function takes the same parameters as fs.readFile so we can replace fs.readFile in the http.createServer callback while leaving the rest of the code intact as follows: //...inside http.createServer:   fs.exists(f, function (exists) {   if (exists) {     cacheAndDeliver(f, function(err, data) {       if (err) {         response.writeHead(500);         response.end('Server Error!');         return; }       var headers = {'Content-type': mimeTypes[path.extname(f)]};       response.writeHead(200, headers);       response.end(data);     }); return;   } //rest of path exists code (404 handling)... When we execute our server.js file and access localhost:8080 twice, consecutively, the second request causes the console to display the following output: loading content/index.html from cache loading content/styles.css from cache loading content/script.js from cache How it works... We defined a function called cacheAndDeliver, which like fs.readFile, takes a filename and callback as parameters. This is great because we can pass the exact same callback of fs.readFile to cacheAndDeliver, padding the server out with caching logic without adding any extra complexity visually to the inside of the http.createServer callback. As it stands, the worth of abstracting our caching logic into an external function is arguable, but the more we build on the server's caching abilities, the more feasible and useful this abstraction becomes. Our cacheAndDeliver function checks to see if the requested content is already cached. If not, we call fs.readFile and load the data from disk. Once we have this data, we may as well hold onto it, so it's placed into the cache object referenced by its file path (the f variable). The next time anyone requests the file, cacheAndDeliver will see that we have the file stored in the cache object and will issue an alternative callback containing the cached data. Notice that we fill the cache[f] property with another new object containing a content property. This makes it easier to extend the caching functionality in the future as we would just have to place extra properties into our cache[f] object and supply logic that interfaces with these properties accordingly. There's more... If we were to modify the files we are serving, the changes wouldn't be reflected until we restart the server. We can do something about that. Reflecting content changes To detect whether a requested file has changed since we last cached it, we must know when the file was cached and when it was last modified. To record when the file was last cached, let's extend the cache[f] object as follows: cache[f] = {content: data,timestamp: Date.now() // store a Unix                                                 // time stamp }; That was easy! Now let's find out when the file was updated last. The fs.stat method returns an object as the second parameter of its callback. This object contains the same useful information as the command-line GNU (GNU's Not Unix!) coreutils stat. The fs.stat function supplies three time-related properties: last accessed (atime), last modified (mtime), and last changed (ctime). The difference between mtime and ctime is that ctime will reflect any alterations to the file, whereas mtime will only reflect alterations to the content of the file. Consequently, if we changed the permissions of a file, ctime would be updated but mtime would stay the same. We want to pay attention to permission changes as they happen so let's use the ctime property as shown in the following code: //requires and mimeType object.... var cache = {}; function cacheAndDeliver(f, cb) {   fs.stat(f, function (err, stats) {     if (err) { return console.log('Oh no!, Error', err); }     var lastChanged = Date.parse(stats.ctime),     isUpdated = (cache[f]) && lastChanged  > cache[f].timestamp;     if (!cache[f] || isUpdated) {       fs.readFile(f, function (err, data) {         console.log('loading ' + f + ' from file');         //rest of cacheAndDeliver   }); //end of fs.stat } If we're using Node on Windows, we may have to substitute ctime with mtime, since ctime supports at least Version 0.10.12. The contents of cacheAndDeliver have been wrapped in an fs.stat callback, two variables have been added, and the if(!cache[f]) statement has been modified. We parse the ctime property of the second parameter dubbed stats using Date.parse to convert it to milliseconds since midnight, January 1st, 1970 (the Unix epoch) and assign it to our lastChanged variable. Then we check whether the requested file's last changed time is greater than when we cached the file (provided the file is indeed cached) and assign the result to our isUpdated variable. After that, it's merely a case of adding the isUpdated Boolean to the conditional if(!cache[f]) statement via the || (or) operator. If the file is newer than our cached version (or if it isn't yet cached), we load the file from disk into the cache object. See also The Optimizing performance with streaming recipe discussed in this article Optimizing performance with streaming Caching content certainly improves upon reading a file from disk for every request. However, with fs.readFile, we are reading the whole file into memory before sending it out in a response object. For better performance, we can stream a file from disk and pipe it directly to the response object, sending data straight to the network socket a piece at a time. Getting ready We are building on our code from the last example, so let's get server.js, index.html, styles.css, and script.js ready. How to do it... We will be using fs.createReadStream to initialize a stream, which can be piped to the response object. In this case, implementing fs.createReadStream within our cacheAndDeliver function isn't ideal because the event listeners of fs.createReadStream will need to interface with the request and response objects, which for the sake of simplicity would preferably be dealt with in the http.createServer callback. For brevity's sake, we will discard our cacheAndDeliver function and implement basic caching within the server callback as follows: //...snip... requires, mime types, createServer, lookup and f //  vars...   fs.exists(f, function (exists) {   if (exists) {     var headers = {'Content-type': mimeTypes[path.extname(f)]};     if (cache[f]) {       response.writeHead(200, headers);       response.end(cache[f].content);       return;    } //...snip... rest of server code... Later on, we will fill cache[f].content while we are interfacing with the readStream object. The following code shows how we use fs.createReadStream: var s = fs.createReadStream(f); The preceding code will return a readStream object that streams the file, which is pointed at by variable f. The readStream object emits events that we need to listen to. We can listen with addEventListener or use the shorthand on method as follows: var s = fs.createReadStream(f).on('open', function () {   //do stuff when the readStream opens }); Because createReadStream returns the readStream object, we can latch our event listener straight onto it using method chaining with dot notation. Each stream is only going to open once; we don't need to keep listening to it. Therefore, we can use the once method instead of on to automatically stop listening after the first event occurrence as follows: var s = fs.createReadStream(f).once('open', function () {   //do stuff when the readStream opens }); Before we fill out the open event callback, let's implement some error handling as follows: var s = fs.createReadStream(f).once('open', function () {   //do stuff when the readStream opens }).once('error', function (e) {   console.log(e);   response.writeHead(500);   response.end('Server Error!'); }); The key to this whole endeavor is the stream.pipe method. This is what enables us to take our file straight from disk and stream it directly to the network socket via our response object as follows: var s = fs.createReadStream(f).once('open', function () {   response.writeHead(200, headers);   this.pipe(response); }).once('error', function (e) {   console.log(e);   response.writeHead(500);   response.end('Server Error!'); }); But what about ending the response? Conveniently, stream.pipe detects when the stream has ended and calls response.end for us. There's one other event we need to listen to, for caching purposes. Within our fs.exists callback, underneath the createReadStream code block, we write the following code: fs.stat(f, function(err, stats) {   var bufferOffset = 0;   cache[f] = {content: new Buffer(stats.size)};   s.on('data', function (chunk) {     chunk.copy(cache[f].content, bufferOffset);     bufferOffset += chunk.length;   }); }); //end of createReadStream We've used the data event to capture the buffer as it's being streamed, and copied it into a buffer that we supplied to cache[f].content, using fs.stat to obtain the file size for the file's cache buffer. For this case, we're using the classic mode data event instead of the readable event coupled with stream.read() (see http://nodejs.org/api/stream.html#stream_readable_read_size_1) because it best suits our aim, which is to grab data from the stream as soon as possible. How it works... Instead of the client waiting for the server to load the entire file from disk prior to sending it to the client, we use a stream to load the file in small ordered pieces and promptly send them to the client. With larger files, this is especially useful as there is minimal delay between the file being requested and the client starting to receive the file. We did this by using fs.createReadStream to start streaming our file from disk. The fs.createReadStream method creates a readStream object, which inherits from the EventEmitter class. The EventEmitter class accomplishes the evented part pretty well. Due to this, we'll be using listeners instead of callbacks to control the flow of stream logic. We then added an open event listener using the once method since we want to stop listening to the open event once it is triggered. We respond to the open event by writing the headers and using the stream.pipe method to shuffle the incoming data straight to the client. If the client becomes overwhelmed with processing, stream.pipe applies backpressure, which means that the incoming stream is paused until the backlog of data is handled. While the response is being piped to the client, the content cache is simultaneously being filled. To achieve this, we had to create an instance of the Buffer class for our cache[f].content property. A Buffer class must be supplied with a size (or array or string), which in our case is the size of the file. To get the size, we used the asynchronous fs.stat method and captured the size property in the callback. The data event returns a Buffer variable as its only callback parameter. The default value of bufferSize for a stream is 64 KB; any file whose size is less than the value of the bufferSize property will only trigger one data event because the whole file will fit into the first chunk of data. But for files that are greater than the value of the bufferSize property, we have to fill our cache[f].content property one piece at a time. Changing the default readStream buffer size We can change the buffer size of our readStream object by passing an options object with a bufferSize property as the second parameter of fs.createReadStream. For instance, to double the buffer, you could use fs.createReadStream(f,{bufferSize: 128 * 1024});. We cannot simply concatenate each chunk with cache[f].content because this will coerce binary data into string format, which, though no longer in binary format, will later be interpreted as binary. Instead, we have to copy all the little binary buffer chunks into our binary cache[f].content buffer. We created a bufferOffset variable to assist us with this. Each time we add another chunk to our cache[f].content buffer, we update our new bufferOffset property by adding the length of the chunk buffer to it. When we call the Buffer.copy method on the chunk buffer, we pass bufferOffset as the second parameter, so our cache[f].content buffer is filled correctly. Moreover, operating with the Buffer class renders performance enhancements with larger files because it bypasses the V8 garbage-collection methods, which tend to fragment a large amount of data, thus slowing down Node's ability to process them. There's more... While streaming has solved the problem of waiting for files to be loaded into memory before delivering them, we are nevertheless still loading files into memory via our cache object. With larger files or a large number of files, this could have potential ramifications. Protecting against process memory overruns Streaming allows for intelligent and minimal use of memory for processing large memory items. But even with well-written code, some apps may require significant memory. There is a limited amount of heap memory. By default, V8's memory is set to 1400 MB on 64-bit systems and 700 MB on 32-bit systems. This can be altered by running node with --max-old-space-size=N, where N is the amount of megabytes (the actual maximum amount that it can be set to depends upon the OS, whether we're running on a 32-bit or 64-bit architecture—a 32-bit may peak out around 2 GB and of course the amount of physical RAM available). The --max-old-space-size method doesn't apply to buffers, since it applies to the v8 heap (memory allocated for JavaScript objects and primitives) and buffers are allocated outside of the v8 heap. If we absolutely had to be memory intensive, we could run our server on a large cloud platform, divide up the logic, and start new instances of node using the child_process class, or better still the higher level cluster module. There are other more advanced ways to increase the usable memory, including editing and recompiling the v8 code base. The http://blog.caustik.com/2012/04/11/escape-the-1-4gb-v8-heap-limit-in-node-js link has some tips along these lines. In this case, high memory usage isn't necessarily required and we can optimize our code to significantly reduce the potential for memory overruns. There is less benefit to caching larger files because the slight speed improvement relative to the total download time is negligible, while the cost of caching them is quite significant in ratio to our available process memory. We can also improve cache efficiency by implementing an expiration time on cache objects, which can then be used to clean the cache, consequently removing files in low demand and prioritizing high demand files for faster delivery. Let's rearrange our cache object slightly as follows: var cache = {   store: {},   maxSize : 26214400, //(bytes) 25mb } For a clearer mental model, we're making a distinction between the cache object as a functioning entity and the cache object as a store (which is a part of the broader cache entity). Our first goal is to only cache files under a certain size; we've defined cache.maxSize for this purpose. All we have to do now is insert an if condition within the fs.stat callback as follows: fs.stat(f, function (err, stats) {   if (stats.size<cache.maxSize) {     var bufferOffset = 0;     cache.store[f] = {content: new Buffer(stats.size),       timestamp: Date.now() };     s.on('data', function (data) {       data.copy(cache.store[f].content, bufferOffset);       bufferOffset += data.length;     });   } }); Notice that we also slipped in a new timestamp property into our cache.store[f] method. This is for our second goal—cleaning the cache. Let's extend cache as follows: var cache = {   store: {},   maxSize: 26214400, //(bytes) 25mb   maxAge: 5400 * 1000, //(ms) 1 and a half hours   clean: function(now) {     var that = this;     Object.keys(this.store).forEach(function (file) {       if (now > that.store[file].timestamp + that.maxAge) {         delete that.store[file];       }     });   } }; So in addition to maxSize, we've created a maxAge property and added a clean method. We call cache.clean at the bottom of the server with the help of the following code: //all of our code prior   cache.clean(Date.now()); }).listen(8080); //end of the http.createServer The cache.clean method loops through the cache.store function and checks to see if it has exceeded its specified lifetime. If it has, we remove it from the store. One further improvement and then we're done. The cache.clean method is called on each request. This means the cache.store function is going to be looped through on every server hit, which is neither necessary nor efficient. It would be better if we clean the cache, say, every two hours or so. We'll add two more properties to cache—cleanAfter to specify the time between cache cleans, and cleanedAt to determine how long it has been since the cache was last cleaned, as follows: var cache = {   store: {},   maxSize: 26214400, //(bytes) 25mb   maxAge : 5400 * 1000, //(ms) 1 and a half hours   cleanAfter: 7200 * 1000,//(ms) two hours   cleanedAt: 0, //to be set dynamically   clean: function (now) {     if (now - this.cleanAfter>this.cleanedAt) {       this.cleanedAt = now;       that = this;       Object.keys(this.store).forEach(function (file) {         if (now > that.store[file].timestamp + that.maxAge) {           delete that.store[file];         }       });     }   } }; So we wrap our cache.clean method in an if statement, which will allow a loop through cache.store only if it has been longer than two hours (or whatever cleanAfter is set to) since the last clean. See also The Securing against filesystem hacking exploits recipe discussed in this article Securing against filesystem hacking exploits For a Node app to be insecure, there must be something an attacker can interact with for exploitation purposes. Due to Node's minimalist approach, the onus is on the programmer to ensure that their implementation doesn't expose security flaws. This recipe will help identify some security risk anti-patterns that could occur when working with the filesystem. Getting ready We'll be working with the same content directory as we did in the previous recipes. But we'll start a new insecure_server.js file (there's a clue in the name!) from scratch to demonstrate mistaken techniques. How to do it... Our previous static file recipes tend to use path.basename to acquire a route, but this ignores intermediate paths. If we accessed localhost:8080/foo/bar/styles.css, our code would take styles.css as the basename property and deliver content/styles.css to us. How about we make a subdirectory in our content folder? Call it subcontent and move our script.js and styles.css files into it. We'd have to alter our script and link tags in index.html as follows: <link rel=stylesheet type=text/css href=subcontent/styles.css> <script src=subcontent/script.js type=text/javascript></script> We can use the url module to grab the entire pathname property. So let's include the url module in our new insecure_server.js file, create our HTTP server, and use pathname to get the whole requested path as follows: var http = require('http'); var url = require('url'); var fs = require('fs');   http.createServer(function (request, response) {   var lookup = url.parse(decodeURI(request.url)).pathname;   lookup = (lookup === "/") ? '/index.html' : lookup;   var f = 'content' + lookup;   console.log(f);   fs.readFile(f, function (err, data) {     response.end(data);   }); }).listen(8080); If we navigate to localhost:8080, everything works great! We've gone multilevel, hooray! For demonstration purposes, a few things have been stripped out from the previous recipes (such as fs.exists); but even with them, this code presents the same security hazards if we type the following: curl localhost:8080/../insecure_server.js Now we have our server's code. An attacker could also access /etc/passwd with a few attempts at guessing its relative path as follows: curl localhost:8080/../../../../../../../etc/passwd If we're using Windows, we can download and install curl from http://curl.haxx.se/download.html. In order to test these attacks, we have to use curl or another equivalent because modern browsers will filter these sort of requests. As a solution, what if we added a unique suffix to each file we wanted to serve and made it mandatory for the suffix to exist before the server coughs it up? That way, an attacker could request /etc/passwd or our insecure_server.js file because they wouldn't have the unique suffix. To try this, let's copy the content folder and call it content-pseudosafe, and rename our files to index.html-serve, script.js-serve, and styles.css-serve. Let's create a new server file and name it pseudosafe_server.js. Now all we have to do is make the -serve suffix mandatory as follows: //requires section ...snip... http.createServer(function (request, response) {   var lookup = url.parse(decodeURI(request.url)).pathname;   lookup = (lookup === "/") ? '/index.html-serve'     : lookup + '-serve';   var f = 'content-pseudosafe' + lookup; //...snip... rest of the server code... For feedback purposes, we'll also include some 404 handling with the help of fs.exists as follows: //requires, create server etc fs.exists(f, function (exists) {   if (!exists) {     response.writeHead(404);     response.end('Page Not Found!');     return;   } //read file etc So, let's start our pseudosafe_server.js file and try out the same exploit by executing the following command: curl -i localhost:8080/../insecure_server.js We've used the -i argument so that curl will output the headers. The result? A 404, because the file it's actually looking for is ../insecure_server.js-serve, which doesn't exist. So what's wrong with this method? Well it's inconvenient and prone to error. But more importantly, an attacker can still work around it! Try this by typing the following: curl localhost:8080/../insecure_server.js%00/index.html And voilà! There's our server code again. The solution to our problem is path.normalize, which cleans up our pathname before it gets to fs.readFile as shown in the following code: http.createServer(function (request, response) {   var lookup = url.parse(decodeURI(request.url)).pathname;   lookup = path.normalize(lookup);   lookup = (lookup === "/") ? '/index.html' : lookup;   var f = 'content' + lookup } Prior recipes haven't used path.normalize and yet they're still relatively safe. The path.basename method gives us the last part of the path, thus removing any preceding double dot paths (../) that would take an attacker higher up the directory hierarchy than should be allowed. How it works... Here we have two filesystem exploitation techniques: the relative directory traversal and poison null byte attacks. These attacks can take different forms, such as in a POST request or from an external file. They can have different effects—if we were writing to files instead of reading them, an attacker could potentially start making changes to our server. The key to security in all cases is to validate and clean any data that comes from the user. In insecure_server.js, we pass whatever the user requests to our fs.readFile method. This is foolish because it allows an attacker to take advantage of the relative path functionality in our operating system by using ../, thus gaining access to areas that should be off limits. By adding the -serve suffix, we didn't solve the problem, we put a plaster on it, which can be circumvented by the poison null byte. The key to this attack is the %00 value, which is a URL hex code for the null byte. In this case, the null byte blinds Node to the ../insecure_server.js portion, but when the same null byte is sent through to our fs.readFile method, it has to interface with the kernel. But the kernel gets blinded to the index.html part. So our code sees index.html but the read operation sees ../insecure_server.js. This is known as null byte poisoning. To protect ourselves, we could use a regex statement to remove the ../ parts of the path. We could also check for the null byte and spit out a 400 Bad Request statement. But we don't have to, because path.normalize filters out the null byte and relative parts for us. There's more... Let's further delve into how we can protect our servers when it comes to serving static files. Whitelisting If security was an extreme priority, we could adopt a strict whitelisting approach. In this approach, we would create a manual route for each file we are willing to deliver. Anything not on our whitelist would return a 404 error. We can place a whitelist array above http.createServer as follows: var whitelist = [   '/index.html',   '/subcontent/styles.css',   '/subcontent/script.js' ]; And inside our http.createServer callback, we'll put an if statement to check if the requested path is in the whitelist array, as follows: if (whitelist.indexOf(lookup) === -1) {   response.writeHead(404);   response.end('Page Not Found!');   return; } And that's it! We can test this by placing a file non-whitelisted.html in our content directory and then executing the following command: curl -i localhost:8080/non-whitelisted.html This will return a 404 error because non-whitelisted.html isn't on the whitelist. Node static The module's wiki page (https://github.com/joyent/node/wiki/modules#wiki-web-frameworks-static) has a list of static file server modules available for different purposes. It's a good idea to ensure that a project is mature and active before relying upon it to serve your content. The node-static module is a well-developed module with built-in caching. It's also compliant with the RFC2616 HTTP standards specification, which defines how files should be delivered over HTTP. The node-static module implements all the essentials discussed in this article and more. For the next example, we'll need the node-static module. You could install it by executing the following command: npm install node-static The following piece of code is slightly adapted from the node-static module's GitHub page at https://github.com/cloudhead/node-static: var static = require('node-static'); var fileServer = new static.Server('./content'); require('http').createServer(function (request, response) {   request.addListener('end', function () {     fileServer.serve(request, response);   }); }).listen(8080); The preceding code will interface with the node-static module to handle server-side and client-side caching, use streams to deliver content, and filter out relative requests and null bytes, among other things. Summary To learn more about Node.js and creating web servers, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Node Cookbook Second Edition (https://www.packtpub.com/web-development/node-cookbook-second-edition) Node.js Design Patterns (https://www.packtpub.com/web-development/nodejs-design-patterns) Node Web Development Second Edition (https://www.packtpub.com/web-development/node-web-development-second-edition) Resources for Article: Further resources on this subject: Working With Commands And Plugins [article] Node.js Fundamentals And Asynchronous Javascript [article] Building A Movie API With Express [article]
Read more
  • 0
  • 0
  • 16435

article-image-building-recommendation-engine-spark
Packt
24 Feb 2016
44 min read
Save for later

Building a Recommendation Engine with Spark

Packt
24 Feb 2016
44 min read
In this article, we will explore individual machine learning models in detail, starting with recommendation engines. (For more resources related to this topic, see here.) Recommendation engines are probably among the best types of machine learning model known to the general public. Even if people do not know exactly what a recommendation engine is, they have most likely experienced one through the use of popular websites such as Amazon, Netflix, YouTube, Twitter, LinkedIn, and Facebook. Recommendations are a core part of all these businesses, and in some cases, they drive significant percentages of their revenue. The idea behind recommendation engines is to predict what people might like and to uncover relationships between items to aid in the discovery process (in this way, it is similar and, in fact, often complementary to search engines, which also play a role in discovery). However, unlike search engines, recommendation engines try to present people with relevant content that they did not necessarily search for or that they might not even have heard of. Typically, a recommendation engine tries to model the connections between users and some type of item. If we can do a good job of showing our users movies related to a given movie, we could aid in discovery and navigation on our site, again improving our users' experience, engagement, and the relevance of our content to them. However, recommendation engines are not limited to movies, books, or products. The techniques we will explore in this article can be applied to just about any user-to-item relationship as well as user-to-user connections, such as those found on social networks, allowing us to make recommendations such as people you may know or who to follow. Recommendation engines are most effective in two general scenarios (which are not mutually exclusive). They are explained here: Large number of available options for users: When there are a very large number of available items, it becomes increasingly difficult for the user to find something they want. Searching can help when the user knows what they are looking for, but often, the right item might be something previously unknown to them. In this case, being recommended relevant items, that the user may not already know about, can help them discover new items. A significant degree of personal taste involved: When personal taste plays a large role in selection, recommendation models, which often utilize a wisdom of the crowd approach, can be helpful in discovering items based on the behavior of others that have similar taste profiles. In this article, we will: Introduce the various types of recommendation engines Build a recommendation model using data about user preferences Use the trained model to compute recommendations for a given user as well compute similar items for a given item (that is, related items) Apply standard evaluation metrics to the model that we created to measure how well it performs in terms of predictive capability Types of recommendation models Recommender systems are widely studied, and there are many approaches used, but there are two that are probably most prevalent: content-based filtering and collaborative filtering. Recently, other approaches such as ranking models have also gained in popularity. In practice, many approaches are hybrids, incorporating elements of many different methods into a model or combination of models. Content-based filtering Content-based methods try to use the content or attributes of an item, together with some notion of similarity between two pieces of content, to generate items similar to a given item. These attributes are often textual content (such as titles, names, tags, and other metadata attached to an item), or in the case of media, they could include other features of the item, such as attributes extracted from audio and video content. In a similar manner, user recommendations can be generated based on attributes of users or user profiles, which are then matched to item attributes using the same measure of similarity. For example, a user can be represented by the combined attributes of the items they have interacted with. This becomes their user profile, which is then compared to item attributes to find items that match the user profile. Collaborative filtering Collaborative filtering is a form of wisdom of the crowd approach where the set of preferences of many users with respect to items is used to generate estimated preferences of users for items with which they have not yet interacted. The idea behind this is the notion of similarity. In a user-based approach, if two users have exhibited similar preferences (that is, patterns of interacting with the same items in broadly the same way), then we would assume that they are similar to each other in terms of taste. To generate recommendations for unknown items for a given user, we can use the known preferences of other users that exhibit similar behavior. We can do this by selecting a set of similar users and computing some form of combined score based on the items they have shown a preference for. The overall logic is that if others have tastes similar to a set of items, these items would tend to be good candidates for recommendation. We can also take an item-based approach that computes some measure of similarity between items. This is usually based on the existing user-item preferences or ratings. Items that tend to be rated the same by similar users will be classed as similar under this approach. Once we have these similarities, we can represent a user in terms of the items they have interacted with and find items that are similar to these known items, which we can then recommend to the user. Again, a set of items similar to the known items is used to generate a combined score to estimate for an unknown item. The user- and item-based approaches are usually referred to as nearest-neighbor models, since the estimated scores are computed based on the set of most similar users or items (that is, their neighbors). Finally, there are many model-based methods that attempt to model the user-item preferences themselves so that new preferences can be estimated directly by applying the model to unknown user-item combinations. Matrix factorization Since Spark's recommendation models currently only include an implementation of matrix factorization, we will focus our attention on this class of models. This focus is with good reason; however, these types of models have consistently been shown to perform extremely well in collaborative filtering and were among the best models in well-known competitions such as the Netflix prize. For more information on and a brief overview of the performance of the best algorithms for the Netflix prize, see http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html. Explicit matrix factorization When we deal with data that consists of preferences of users that are provided by the users themselves, we refer to explicit preference data. This includes, for example, ratings, thumbs up, likes, and so on that are given by users to items. We can take these ratings and form a two-dimensional matrix with users as rows and items as columns. Each entry represents a rating given by a user to a certain item. Since in most cases, each user has only interacted with a relatively small set of items, this matrix has only a few non-zero entries (that is, it is very sparse). As a simple example, let's assume that we have the following user ratings for a set of movies: Tom, Star Wars, 5 Jane, Titanic, 4 Bill, Batman, 3 Jane, Star Wars, 2 Bill, Titanic, 3 We will form the following ratings matrix: A simple movie-rating matrix Matrix factorization (or matrix completion) attempts to directly model this user-item matrix by representing it as a product of two smaller matrices of lower dimension. Thus, it is a dimensionality-reduction technique. If we have U users and I items, then our user-item matrix is of dimension U x I and might look something like the one shown in the following diagram: A sparse ratings matrix If we want to find a lower dimension (low-rank) approximation to our user-item matrix with the dimension k, we would end up with two matrices: one for users of size U x k and one for items of size I x k. These are known as factor matrices. If we multiply these two factor matrices, we would reconstruct an approximate version of the original ratings matrix. Note that while the original ratings matrix is typically very sparse, each factor matrix is dense, as shown in the following diagram: The user- and item-factor matrices These models are often also called latent feature models, as we are trying to discover some form of hidden features (which are represented by the factor matrices) that account for the structure of behavior inherent in the user-item rating matrix. While the latent features or factors are not directly interpretable, they might, perhaps, represent things such as the tendency of a user to like movies from a certain director, genre, style, or group of actors, for example. As we are directly modeling the user-item matrix, the prediction in these models is relatively straightforward: to compute a predicted rating for a user and item, we compute the vector dot product between the relevant row of the user-factor matrix (that is, the user's factor vector) and the relevant row of the item-factor matrix (that is, the item's factor vector). This is illustrated with the highlighted vectors in the following diagram: Computing recommendations from user- and item-factor vectors To find out the similarity between two items, we can use the same measures of similarity as we would use in the nearest-neighbor models, except that we can use the factor vectors directly by computing the similarity between two item-factor vectors, as illustrated in the following diagram: Computing similarity with item-factor vectors The benefit of factorization models is the relative ease of computing recommendations once the model is created. However, for very large user and itemsets, this can become a challenge as it requires storage and computation across potentially many millions of user- and item-factor vectors. Another advantage, as mentioned earlier, is that they tend to offer very good performance. Projects such as Oryx (https://github.com/OryxProject/oryx) and Prediction.io (https://github.com/PredictionIO/PredictionIO) focus on model serving for large-scale models, including recommenders based on matrix factorization. On the down side, factorization models are relatively more complex to understand and interpret compared to nearest-neighbor models and are often more computationally intensive during the model's training phase. Implicit matrix factorization So far, we have dealt with explicit preferences such as ratings. However, much of the preference data that we might be able to collect is implicit feedback, where the preferences between a user and item are not given to us, but are, instead, implied from the interactions they might have with an item. Examples include binary data (such as whether a user viewed a movie, whether they purchased a product, and so on) as well as count data (such as the number of times a user watched a movie). There are many different approaches to deal with implicit data. MLlib implements a particular approach that treats the input rating matrix as two matrices: a binary preference matrix, P, and a matrix of confidence weights, C. For example, let's assume that the user-movie ratings we saw previously were, in fact, the number of times each user had viewed that movie. The two matrices would look something like ones shown in the following screenshot. Here, the matrix P informs us that a movie was viewed by a user, and the matrix C represents the confidence weighting, in the form of the view counts—generally, the more a user has watched a movie, the higher the confidence that they actually like it. Representation of an implicit preference and confidence matrix The implicit model still creates a user- and item-factor matrix. In this case, however, the matrix that the model is attempting to approximate is not the overall ratings matrix but the preference matrix P. If we compute a recommendation by calculating the dot product of a user- and item-factor vector, the score will not be an estimate of a rating directly. It will rather be an estimate of the preference of a user for an item (though not strictly between 0 and 1, these scores will generally be fairly close to a scale of 0 to 1). Alternating least squares Alternating Least Squares (ALS) is an optimization technique to solve matrix factorization problems; this technique is powerful, achieves good performance, and has proven to be relatively easy to implement in a parallel fashion. Hence, it is well suited for platforms such as Spark. At the time of writing this, it is the only recommendation model implemented in MLlib. ALS works by iteratively solving a series of least squares regression problems. In each iteration, one of the user- or item-factor matrices is treated as fixed, while the other one is updated using the fixed factor and the rating data. Then, the factor matrix that was solved for is, in turn, treated as fixed, while the other one is updated. This process continues until the model has converged (or for a fixed number of iterations). Spark's documentation for collaborative filtering contains references to the papers that underlie the ALS algorithms implemented each component of explicit and implicit data. You can view the documentation at http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html. Extracting the right features from your data In this section, we will use explicit rating data, without additional user or item metadata or other information related to the user-item interactions. Hence, the features that we need as inputs are simply the user IDs, movie IDs, and the ratings assigned to each user and movie pair. Extracting features from the MovieLens 100k dataset Start the Spark shell in the Spark base directory, ensuring that you provide enough memory via the –driver-memory option: >./bin/spark-shell –driver-memory 4g In this example, we will use the same MovieLens dataset. Use the directory in which you placed the MovieLens 100k dataset as the input path in the following code. First, let's inspect the raw ratings dataset: val rawData = sc.textFile("/PATH/ml-100k/u.data") rawData.first() You will see output similar to these lines of code: 14/03/30 11:42:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/03/30 11:42:41 WARN LoadSnappy: Snappy native library not loaded 14/03/30 11:42:41 INFO FileInputFormat: Total input paths to process : 1 14/03/30 11:42:41 INFO SparkContext: Starting job: first at <console>:15 14/03/30 11:42:41 INFO DAGScheduler: Got job 0 (first at <console>:15) with 1 output partitions (allowLocal=true) 14/03/30 11:42:41 INFO DAGScheduler: Final stage: Stage 0 (first at <console>:15) 14/03/30 11:42:41 INFO DAGScheduler: Parents of final stage: List() 14/03/30 11:42:41 INFO DAGScheduler: Missing parents: List() 14/03/30 11:42:41 INFO DAGScheduler: Computing the requested partition locally 14/03/30 11:42:41 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 11:42:41 INFO SparkContext: Job finished: first at <console>:15, took 0.030533 s res0: String = 196  242  3  881250949 Recall that this dataset consisted of the user id, movie id, rating, timestamp fields separated by a tab ("t") character. We don't need the time when the rating was made to train our model, so let's simply extract the first three fields: val rawRatings = rawData.map(_.split("t").take(3)) We will first split each record on the "t" character, which gives us an Array[String] array. We will then use Scala's take function to keep only the first 3 elements of the array, which correspond to user id, movie id, and rating, respectively. We can inspect the first record of our new RDD by calling rawRatings.first(), which collects just the first record of the RDD back to the driver program. This will result in the following output: 14/03/30 12:24:00 INFO SparkContext: Starting job: first at <console>:21 14/03/30 12:24:00 INFO DAGScheduler: Got job 1 (first at <console>:21) with 1 output partitions (allowLocal=true) 14/03/30 12:24:00 INFO DAGScheduler: Final stage: Stage 1 (first at <console>:21) 14/03/30 12:24:00 INFO DAGScheduler: Parents of final stage: List() 14/03/30 12:24:00 INFO DAGScheduler: Missing parents: List() 14/03/30 12:24:00 INFO DAGScheduler: Computing the requested partition locally 14/03/30 12:24:00 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 12:24:00 INFO SparkContext: Job finished: first at <console>:21, took 0.00391 s res6: Array[String] = Array(196, 242, 3) We will use Spark's MLlib library to train our model. Let's take a look at what methods are available for us to use and what input is required. First, import the ALS model from MLlib: import org.apache.spark.mllib.recommendation.ALS On the console, we can inspect the available methods on the ALS object using tab completion. Type in ALS. (note the dot) and then press the Tab key. You should see the autocompletion of the methods: ALS. asInstanceOf    isInstanceOf    main            toString        train           trainImplicit The method we want to use is train. If we type ALS.train and hit Enter, we will get an error. However, this error will tell us what the method signature looks like: ALS.train <console>:12: error: ambiguous reference to overloaded definition, both method train in object ALS of type (ratings: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating], rank: Int, iterations: Int)org.apache.spark.mllib.recommendation.MatrixFactorizationModel and  method train in object ALS of type (ratings: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating], rank: Int, iterations: Int, lambda: Double)org.apache.spark.mllib.recommendation.MatrixFactorizationModel match expected type ?               ALS.train                   ^ So, we can see that at a minimum, we need to provide the input arguments, ratings, rank, and iterations. The second method also requires an argument called lambda. We'll cover these three shortly, but let's take a look at the ratings argument. First, let's import the Rating class that it references and use a similar approach to find out what an instance of Rating requires, by typing in Rating() and hitting Enter: import org.apache.spark.mllib.recommendation.Rating Rating() <console>:13: error: not enough arguments for method apply: (user: Int, product: Int, rating: Double)org.apache.spark.mllib.recommendation.Rating in object Rating. Unspecified value parameters user, product, rating.               Rating()                     ^ As we can see from the preceding output, we need to provide the ALS model with an RDD that consists of Rating records. A Rating class, in turn, is just a wrapper around user id, movie id (called product here), and the actual rating arguments. We'll create our rating dataset using the map method and transforming the array of IDs and ratings into a Rating object: val ratings = rawRatings.map { case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) } Notice that we need to use toInt or toDouble to convert the raw rating data (which was extracted as Strings from the text file) to Int or Double numeric inputs. Also, note the use of a case statement that allows us to extract the relevant variable names and use them directly (this saves us from having to use something like val user = ratings(0)). For more on Scala case statements and pattern matching as used here, take a look at http://docs.scala-lang.org/tutorials/tour/pattern-matching.html. We now have an RDD[Rating] that we can verify by calling: ratings.first() 14/03/30 12:32:48 INFO SparkContext: Starting job: first at <console>:24 14/03/30 12:32:48 INFO DAGScheduler: Got job 2 (first at <console>:24) with 1 output partitions (allowLocal=true) 14/03/30 12:32:48 INFO DAGScheduler: Final stage: Stage 2 (first at <console>:24) 14/03/30 12:32:48 INFO DAGScheduler: Parents of final stage: List() 14/03/30 12:32:48 INFO DAGScheduler: Missing parents: List() 14/03/30 12:32:48 INFO DAGScheduler: Computing the requested partition locally 14/03/30 12:32:48 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 12:32:48 INFO SparkContext: Job finished: first at <console>:24, took 0.003752 s res8: org.apache.spark.mllib.recommendation.Rating = Rating(196,242,3.0) Training the recommendation model Once we have extracted these simple features from our raw data, we are ready to proceed with model training; MLlib takes care of this for us. All we have to do is provide the correctly-parsed input RDD we just created as well as our chosen model parameters. Training a model on the MovieLens 100k dataset We're now ready to train our model! The other inputs required for our model are as follows: rank: This refers to the number of factors in our ALS model, that is, the number of hidden features in our low-rank approximation matrices. Generally, the greater the number of factors, the better, but this has a direct impact on memory usage, both for computation and to store models for serving, particularly for large number of users or items. Hence, this is often a trade-off in real-world use cases. A rank in the range of 10 to 200 is usually reasonable. iterations: This refers to the number of iterations to run. While each iteration in ALS is guaranteed to decrease the reconstruction error of the ratings matrix, ALS models will converge to a reasonably good solution after relatively few iterations. So, we don't need to run for too many iterations in most cases (around 10 is often a good default). lambda: This parameter controls the regularization of our model. Thus, lambda controls over fitting. The higher the value of lambda, the more is the regularization applied. What constitutes a sensible value is very dependent on the size, nature, and sparsity of the underlying data, and as with almost all machine learning models, the regularization parameter is something that should be tuned using out-of-sample test data and cross-validation approaches. We'll use rank of 50, 10 iterations, and a lambda parameter of 0.01 to illustrate how to train our model: val model = ALS.train(ratings, 50, 10, 0.01) This returns a MatrixFactorizationModel object, which contains the user and item factors in the form of an RDD of (id, factor) pairs. These are called userFeatures and productFeatures, respectively. For example: model.userFeatures res14: org.apache.spark.rdd.RDD[(Int, Array[Double])] = FlatMappedRDD[659] at flatMap at ALS.scala:231 We can see that the factors are in the form of an Array[Double]. Note that the operations used in MLlib's ALS implementation are lazy transformations, so the actual computation will only be performed once we call some sort of action on the resulting RDDs of the user and item factors. We can force the computation using a Spark action such as count: model.userFeatures.count This will trigger the computation, and we will see a quite a bit of output text similar to the following lines of code: 14/03/30 13:10:40 INFO SparkContext: Starting job: count at <console>:26 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 665 (map at ALS.scala:147) 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 664 (map at ALS.scala:146) 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 674 (mapPartitionsWithIndex at ALS.scala:164) ... 14/03/30 13:10:45 INFO SparkContext: Job finished: count at <console>:26, took 5.068255 s res16: Long = 943 If we call count for the movie factors, we will see the following output: model.productFeatures.count 14/03/30 13:15:21 INFO SparkContext: Starting job: count at <console>:26 14/03/30 13:15:21 INFO DAGScheduler: Got job 10 (count at <console>:26) with 1 output partitions (allowLocal=false) 14/03/30 13:15:21 INFO DAGScheduler: Final stage: Stage 165 (count at <console>:26) 14/03/30 13:15:21 INFO DAGScheduler: Parents of final stage: List(Stage 169, Stage 166) 14/03/30 13:15:21 INFO DAGScheduler: Missing parents: List() 14/03/30 13:15:21 INFO DAGScheduler: Submitting Stage 165 (FlatMappedRDD[883] at flatMap at ALS.scala:231), which has no missing parents 14/03/30 13:15:21 INFO DAGScheduler: Submitting 1 missing tasks from Stage 165 (FlatMappedRDD[883] at flatMap at ALS.scala:231) ... 14/03/30 13:15:21 INFO SparkContext: Job finished: count at <console>:26, took 0.030044 s res21: Long = 1682 As expected, we have a factor array for each user (943 factors) and movie (1682 factors). Training a model using implicit feedback data The standard matrix factorization approach in MLlib deals with explicit ratings. To work with implicit data, you can use the trainImplicit method. It is called in a manner similar to the standard train method. There is an additional parameter, alpha, that can be set (and in the same way, the regularization parameter, lambda, should be selected via testing and cross-validation methods). The alpha parameter controls the baseline level of confidence weighting applied. A higher level of alpha tends to make the model more confident about the fact that missing data equates to no preference for the relevant user-item pair. As an exercise, try to take the existing MovieLens dataset and convert it into an implicit dataset. One possible approach is to convert it to binary feedback (0s and 1s) by applying a threshold on the ratings at some level. Another approach could be to convert the ratings' values into confidence weights (for example, perhaps, low ratings could imply zero weights, or even negative weights, which are supported by MLlib's implementation). Train a model on this dataset and compare the results of the following section with those generated by your implicit model. Using the recommendation model Now that we have our trained model, we're ready to use it to make predictions. These predictions typically take one of two forms: recommendations for a given user and related or similar items for a given item. User recommendations In this case, we would like to generate recommended items for a given user. This usually takes the form of a top-K list, that is, the K items that our model predicts will have the highest probability of the user liking them. This is done by computing the predicted score for each item and ranking the list based on this score. The exact method to perform this computation depends on the model involved. For example, in user-based approaches, the ratings of similar users on items are used to compute the recommendations for a user, while in an item-based approach, the computation is based on the similarity of items the user has rated to the candidate items. In matrix factorization, because we are modeling the ratings matrix directly, the predicted score can be computed as the vector dot product between a user-factor vector and an item-factor vector. Generating movie recommendations from the MovieLens 100k dataset As MLlib's recommendation model is based on matrix factorization, we can use the factor matrices computed by our model to compute predicted scores (or ratings) for a user. We will focus on the explicit rating case using MovieLens data; however, the approach is the same when using the implicit model. The MatrixFactorizationModel class has a convenient predict method that will compute a predicted score for a given user and item combination: val predictedRating = model.predict(789, 123) 14/03/30 16:10:10 INFO SparkContext: Starting job: lookup at MatrixFactorizationModel.scala:45 14/03/30 16:10:10 INFO DAGScheduler: Got job 30 (lookup at MatrixFactorizationModel.scala:45) with 1 output partitions (allowLocal=false) ... 14/03/30 16:10:10 INFO SparkContext: Job finished: lookup at MatrixFactorizationModel.scala:46, took 0.023077 s predictedRating: Double = 3.128545693368485 As we can see, this model predicts a rating of 3.12 for user 789 and movie 123. Note that you might see different results than those shown in this section because the ALS model is initialized randomly. So, different runs of the model will lead to different solutions.  The predict method can also take an RDD of (user, item) IDs as the input and will generate predictions for each of these. We can use this method to make predictions for many users and items at the same time. To generate the top-K recommended items for a user, MatrixFactorizationModel provides a convenience method called recommendProducts. This takes two arguments: user and num, where user is the user ID, and num is the number of items to recommend. It returns the top num items ranked in the order of the predicted score. Here, the scores are computed as the dot product between the user-factor vector and each item-factor vector. Let's generate the top 10 recommended items for user 789: val userId = 789 val K = 10 val topKRecs = model.recommendProducts(userId, K) We now have a set of predicted ratings for each movie for user 789. If we print this out, we could inspect the top 10 recommendations for this user: println(topKRecs.mkString("n")) You should see the following output on your console: Rating(789,715,5.931851273771102) Rating(789,12,5.582301095666215) Rating(789,959,5.516272981542168) Rating(789,42,5.458065302395629) Rating(789,584,5.449949837103569) Rating(789,750,5.348768847643657) Rating(789,663,5.30832117499004) Rating(789,134,5.278933936827717) Rating(789,156,5.250959077906759) Rating(789,432,5.169863417126231) Inspecting the recommendations We can give these recommendations a sense check by taking a quick look at the titles of the movies a user has rated and the recommended movies. First, we need to load the movie data. We'll collect this data as a Map[Int, String] method mapping the movie ID to the title: val movies = sc.textFile("/PATH/ml-100k/u.item") val titles = movies.map(line => line.split("\|").take(2)).map(array => (array(0).toInt,  array(1))).collectAsMap() titles(123) res68: String = Frighteners, The (1996) For our user 789, we can find out what movies they have rated, take the 10 movies with the highest rating, and then check the titles. We will do this now by first using the keyBy Spark function to create an RDD of key-value pairs from our ratings RDD, where the key will be the user ID. We will then use the lookup function to return just the ratings for this key (that is, that particular user ID) to the driver: val moviesForUser = ratings.keyBy(_.user).lookup(789) Let's see how many movies this user has rated. This will be the size of the moviesForUser collection: println(moviesForUser.size) We will see that this user has rated 33 movies. Next, we will take the 10 movies with the highest ratings by sorting the moviesForUser collection using the rating field of the Rating object. We will then extract the movie title for the relevant product ID attached to the Rating class from our mapping of movie titles and print out the top 10 titles with their ratings: moviesForUser.sortBy(-_.rating).take(10).map(rating => (titles(rating.product), rating.rating)).foreach(println) You will see the following output displayed: (Godfather, The (1972),5.0) (Trainspotting (1996),5.0) (Dead Man Walking (1995),5.0) (Star Wars (1977),5.0) (Swingers (1996),5.0) (Leaving Las Vegas (1995),5.0) (Bound (1996),5.0) (Fargo (1996),5.0) (Last Supper, The (1995),5.0) (Private Parts (1997),4.0) Now, let's take a look at the top 10 recommendations for this user and see what the titles are using the same approach as the one we used earlier (note that the recommendations are already sorted): topKRecs.map(rating => (titles(rating.product), rating.rating)).foreach(println) (To Die For (1995),5.931851273771102) (Usual Suspects, The (1995),5.582301095666215) (Dazed and Confused (1993),5.516272981542168) (Clerks (1994),5.458065302395629) (Secret Garden, The (1993),5.449949837103569) (Amistad (1997),5.348768847643657) (Being There (1979),5.30832117499004) (Citizen Kane (1941),5.278933936827717) (Reservoir Dogs (1992),5.250959077906759) (Fantasia (1940),5.169863417126231) We leave it to you to decide whether these recommendations make sense. Item recommendations Item recommendations are about answering the following question: for a certain item, what are the items most similar to it? Here, the precise definition of similarity is dependent on the model involved. In most cases, similarity is computed by comparing the vector representation of two items using some similarity measure. Common similarity measures include Pearson correlation and cosine similarity for real-valued vectors and Jaccard similarity for binary vectors. Generating similar movies for the MovieLens 100K dataset The current MatrixFactorizationModel API does not directly support item-to-item similarity computations. Therefore, we will need to create our own code to do this. We will use the cosine similarity metric, and we will use the jblas linear algebra library (a dependency of MLlib) to compute the required vector dot products. This is similar to how the existing predict and recommendProducts methods work, except that we will use cosine similarity as opposed to just the dot product. We would like to compare the factor vector of our chosen item with each of the other items, using our similarity metric. In order to perform linear algebra computations, we will first need to create a vector object out of the factor vectors, which are in the form of an Array[Double]. The JBLAS class, DoubleMatrix, takes an Array[Double] as the constructor argument as follows: import org.jblas.DoubleMatrix val aMatrix = new DoubleMatrix(Array(1.0, 2.0, 3.0)) aMatrix: org.jblas.DoubleMatrix = [1.000000; 2.000000; 3.000000] Note that using jblas, vectors are represented as a one-dimensional DoubleMatrix class, while matrices are a two-dimensional DoubleMatrix class. We will need a method to compute the cosine similarity between two vectors. Cosine similarity is a measure of the angle between two vectors in an n-dimensional space. It is computed by first calculating the dot product between the vectors and then dividing the result by a denominator, which is the norm (or length) of each vector multiplied together (specifically, the L2-norm is used in cosine similarity). In this way, cosine similarity is a normalized dot product. The cosine similarity measure takes on values between -1 and 1. A value of 1 implies completely similar, while a value of 0 implies independence (that is, no similarity). This measure is useful because it also captures negative similarity, that is, a value of -1 implies that not only are the vectors not similar, but they are also completely dissimilar. Let's create our cosineSimilarity function here: def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double = {   vec1.dot(vec2) / (vec1.norm2() * vec2.norm2()) } Note that we defined a return type for this function of Double. We are not required to do this, since Scala features type inference. However, it can often be useful to document return types for Scala functions. Let's try it out on one of our item factors for item 567. We will need to collect an item factor from our model; we will do this using the lookup method in a similar way that we did earlier to collect the ratings for a specific user. In the following lines of code, we also use the head function, since lookup returns an array of values, and we only need the first value (in fact, there will only be one value, which is the factor vector for this item). Since this will be an Array[Double], we will then need to create a DoubleMatrix object from it and compute the cosine similarity with itself: val itemId = 567 val itemFactor = model.productFeatures.lookup(itemId).head val itemVector = new DoubleMatrix(itemFactor) cosineSimilarity(itemVector, itemVector) A similarity metric should measure how close, in some sense, two vectors are to each other. Here, we can see that our cosine similarity metric tells us that this item vector is identical to itself, which is what we would expect: res113: Double = 1.0 Now, we are ready to apply our similarity metric to each item: val sims = model.productFeatures.map{ case (id, factor) =>  val factorVector = new DoubleMatrix(factor)   val sim = cosineSimilarity(factorVector, itemVector)   (id, sim) } Next, we can compute the top 10 most similar items by sorting out the similarity score for each item: // recall we defined K = 10 earlier val sortedSims = sims.top(K)(Ordering.by[(Int, Double), Double] { case (id, similarity) => similarity }) In the preceding code snippet, we used Spark's top function, which is an efficient way to compute top-K results in a distributed fashion, instead of using collect to return all the data to the driver and sorting it locally (remember that we could be dealing with millions of users and items in the case of recommendation models). We need to tell Spark how to sort the (item id, similarity score) pairs in the sims RDD. To do this, we will pass an extra argument to top, which is a Scala Ordering object that tells Spark that it should sort by the value in the key-value pair (that is, sort by similarity). Finally, we can print the 10 items with the highest computed similarity metric to our given item: println(sortedSims.take(10).mkString("n")) You will see output like the following one: (567,1.0000000000000002) (1471,0.6932331537649621) (670,0.6898690594544726) (201,0.6897964975027041) (343,0.6891221044611473) (563,0.6864214133620066) (294,0.6812075443259535) (413,0.6754663844488256) (184,0.6702643811753909) (109,0.6594872765176396) Not surprisingly, we can see that the top-ranked similar item is our item. The rest are the other items in our set of items, ranked in order of our similarity metric. Inspecting the similar items Let's see what the title of our chosen movie is: println(titles(itemId)) Wes Craven's New Nightmare (1994) As we did for user recommendations, we can sense check our item-to-item similarity computations and take a look at the titles of the most similar movies. This time, we will take the top 11 so that we can exclude our given movie. So, we will take the numbers 1 to 11 in the list: val sortedSims2 = sims.top(K + 1)(Ordering.by[(Int, Double), Double] { case (id, similarity) => similarity }) sortedSims2.slice(1, 11).map{ case (id, sim) => (titles(id), sim) }.mkString("n") You will see the movie titles and scores displayed similar to this output: (Hideaway (1995),0.6932331537649621) (Body Snatchers (1993),0.6898690594544726) (Evil Dead II (1987),0.6897964975027041) (Alien: Resurrection (1997),0.6891221044611473) (Stephen King's The Langoliers (1995),0.6864214133620066) (Liar Liar (1997),0.6812075443259535) (Tales from the Crypt Presents: Bordello of Blood (1996),0.6754663844488256) (Army of Darkness (1993),0.6702643811753909) (Mystery Science Theater 3000: The Movie (1996),0.6594872765176396) (Scream (1996),0.6538249646863378) Once again note that you might see quite different results due to random model initialization. Now that you have computed similar items using cosine similarity, see if you can do the same with the user-factor vectors to compute similar users for a given user. Evaluating the performance of recommendation models How do we know whether the model we have trained is a good model? We need to be able to evaluate its predictive performance in some way. Evaluation metrics are measures of a model's predictive capability or accuracy. Some are direct measures of how well a model predicts the model's target variable (such as Mean Squared Error), while others are concerned with how well the model performs at predicting things that might not be directly optimized in the model but are often closer to what we care about in the real world (such as Mean average precision). Evaluation metrics provide a standardized way of comparing the performance of the same model with different parameter settings and of comparing performance across different models. Using these metrics, we can perform model selection to choose the best-performing model from the set of models we wish to evaluate. Here, we will show you how to calculate two common evaluation metrics used in recommender systems and collaborative filtering models: Mean Squared Error and Mean average precision at K. Mean Squared Error The Mean Squared Error (MSE) is a direct measure of the reconstruction error of the user-item rating matrix. It is also the objective function being minimized in certain models, specifically many matrix-factorization techniques, including ALS. As such, it is commonly used in explicit ratings settings. It is defined as the sum of the squared errors divided by the number of observations. The squared error, in turn, is the square of the difference between the predicted rating for a given user-item pair and the actual rating. We will use our user 789 as an example. Let's take the first rating for this user from the moviesForUser set of Ratings that we previously computed: val actualRating = moviesForUser.take(1)(0) actualRating: org.apache.spark.mllib.recommendation.Rating = Rating(789,1012,4.0) We will see that the rating for this user-item combination is 4. Next, we will compute the model's predicted rating: val predictedRating = model.predict(789, actualRating.product) ... 14/04/13 13:01:15 INFO SparkContext: Job finished: lookup at MatrixFactorizationModel.scala:46, took 0.025404 s predictedRating: Double = 4.001005374200248 We will see that the predicted rating is about 4, very close to the actual rating. Finally, we will compute the squared error between the actual rating and the predicted rating: val squaredError = math.pow(predictedRating - actualRating.rating, 2.0) squaredError: Double = 1.010777282523947E-6 So, in order to compute the overall MSE for the dataset, we need to compute this squared error for each (user, movie, actual rating, predicted rating) entry, sum them up, and divide them by the number of ratings. We will do this in the following code snippet. Note the following code is adapted from the Apache Spark programming guide for ALS at http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html. First, we will extract the user and product IDs from the ratings RDD and make predictions for each user-item pair using model.predict. We will use the user-item pair as the key and the predicted rating as the value: val usersProducts = ratings.map{ case Rating(user, product, rating)  => (user, product)} val predictions = model.predict(usersProducts).map{     case Rating(user, product, rating) => ((user, product), rating) } Next, we extract the actual ratings and also map the ratings RDD so that the user-item pair is the key and the actual rating is the value. Now that we have two RDDs with the same form of key, we can join them together to create a new RDD with the actual and predicted ratings for each user-item combination: val ratingsAndPredictions = ratings.map{   case Rating(user, product, rating) => ((user, product), rating) }.join(predictions) Finally, we will compute the MSE by summing up the squared errors using reduce and dividing by the count method of the number of records: val MSE = ratingsAndPredictions.map{     case ((user, product), (actual, predicted)) =>  math.pow((actual - predicted), 2) }.reduce(_ + _) / ratingsAndPredictions.count println("Mean Squared Error = " + MSE) Mean Squared Error = 0.08231947642632852 It is common to use the Root Mean Squared Error (RMSE), which is just the square root of the MSE metric. This is somewhat more interpretable, as it is in the same units as the underlying data (that is, the ratings in this case). It is equivalent to the standard deviation of the differences between the predicted and actual ratings. We can compute it simply as follows: val RMSE = math.sqrt(MSE) println("Root Mean Squared Error = " + RMSE) Root Mean Squared Error = 0.2869137090247319 Mean average precision at K Mean average precision at K (MAPK) is the mean of the average precision at K (APK) metric across all instances in the dataset. APK is a metric commonly used in information retrieval. APK is a measure of the average relevance scores of a set of the top-K documents presented in response to a query. For each query instance, we will compare the set of top-K results with the set of actual relevant documents (that is, a ground truth set of relevant documents for the query). In the APK metric, the order of the result set matters, in that, the APK score would be higher if the result documents are both relevant and the relevant documents are presented higher in the results. It is, thus, a good metric for recommender systems in that typically we would compute the top-K recommended items for each user and present these to the user. Of course, we prefer models where the items with the highest predicted scores (which are presented at the top of the list of recommendations) are, in fact, the most relevant items for the user. APK and other ranking-based metrics are also more appropriate evaluation measures for implicit datasets; here, MSE makes less sense. In order to evaluate our model, we can use APK, where each user is the equivalent of a query, and the set of top-K recommended items is the document result set. The relevant documents (that is, the ground truth) in this case, is the set of items that a user interacted with. Hence, APK attempts to measure how good our model is at predicting items that a user will find relevant and choose to interact with. The code for the following average precision computation is based on https://github.com/benhamner/Metrics.  More information on MAPK can be found at https://www.kaggle.com/wiki/MeanAveragePrecision. Our function to compute the APK is shown here: def avgPrecisionK(actual: Seq[Int], predicted: Seq[Int], k: Int): Double = {   val predK = predicted.take(k)   var score = 0.0   var numHits = 0.0   for ((p, i) <- predK.zipWithIndex) {     if (actual.contains(p)) {       numHits += 1.0       score += numHits / (i.toDouble + 1.0)     }   }   if (actual.isEmpty) {     1.0   } else {     score / scala.math.min(actual.size, k).toDouble   } } As you can see, this takes as input a list of actual item IDs that are associated with the user and another list of predicted ids so that our estimate will be relevant for the user. We can compute the APK metric for our example user 789 as follows. First, we will extract the actual movie IDs for the user: val actualMovies = moviesForUser.map(_.product) actualMovies: Seq[Int] = ArrayBuffer(1012, 127, 475, 93, 1161, 286, 293, 9, 50, 294, 181, 1, 1008, 508, 284, 1017, 137, 111, 742, 248, 249, 1007, 591, 150, 276, 151, 129, 100, 741, 288, 762, 628, 124) We will then use the movie recommendations we made previously to compute the APK score using K = 10: val predictedMovies = topKRecs.map(_.product) predictedMovies: Array[Int] = Array(27, 497, 633, 827, 602, 849, 401, 584, 1035, 1014) val apk10 = avgPrecisionK(actualMovies, predictedMovies, 10) apk10: Double = 0.0 In this case, we can see that our model is not doing a very good job of predicting relevant movies for this user as the APK score is 0. In order to compute the APK for each user and average them to compute the overall MAPK, we will need to generate the list of recommendations for each user in our dataset. While this can be fairly intensive on a large scale, we can distribute the computation using our Spark functionality. However, one limitation is that each worker must have the full item-factor matrix available so that it can compute the dot product between the relevant user vector and all item vectors. This can be a problem when the number of items is extremely high as the item matrix must fit in the memory of one machine. There is actually no easy way around this limitation. One possible approach is to only compute recommendations for a subset of items from the total item set, using approximate techniques such as Locality Sensitive Hashing (http://en.wikipedia.org/wiki/Locality-sensitive_hashing). We will now see how to go about this. First, we will collect the item factors and form a DoubleMatrix object from them: val itemFactors = model.productFeatures.map { case (id, factor) => factor }.collect() val itemMatrix = new DoubleMatrix(itemFactors) println(itemMatrix.rows, itemMatrix.columns) (1682,50) This gives us a matrix with 1682 rows and 50 columns, as we would expect from 1682 movies with a factor dimension of 50. Next, we will distribute the item matrix as a broadcast variable so that it is available on each worker node: val imBroadcast = sc.broadcast(itemMatrix) 14/04/13 21:02:01 INFO MemoryStore: ensureFreeSpace(672960) called with curMem=4006896, maxMem=311387750 14/04/13 21:02:01 INFO MemoryStore: Block broadcast_21 stored as values to memory (estimated size 657.2 KB, free 292.5 MB) imBroadcast: org.apache.spark.broadcast.Broadcast[org.jblas.DoubleMatrix] = Broadcast(21) Now we are ready to compute the recommendations for each user. We will do this by applying a map function to each user factor within which we will perform a matrix multiplication between the user-factor vector and the movie-factor matrix. The result is a vector (of length 1682, that is, the number of movies we have) with the predicted rating for each movie. We will then sort these predictions by the predicted rating: val allRecs = model.userFeatures.map{ case (userId, array) =>   val userVector = new DoubleMatrix(array)   val scores = imBroadcast.value.mmul(userVector)   val sortedWithId = scores.data.zipWithIndex.sortBy(-_._1)   val recommendedIds = sortedWithId.map(_._2 + 1).toSeq   (userId, recommendedIds) } allRecs: org.apache.spark.rdd.RDD[(Int, Seq[Int])] = MappedRDD[269] at map at <console>:29 As we can see, we now have an RDD that contains a list of movie IDs for each user ID. These movie IDs are sorted in order of the estimated rating. Note that we needed to add 1 to the returned movie ids (as highlighted in the preceding code snippet), as the item-factor matrix is 0-indexed, while our movie IDs start at 1. We also need the list of movie IDs for each user to pass into our APK function as the actual argument. We already have the ratings RDD ready, so we can extract just the user and movie IDs from it. If we use Spark's groupBy operator, we will get an RDD that contains a list of (userid, movieid) pairs for each user ID (as the user ID is the key on which we perform the groupBy operation): val userMovies = ratings.map{ case Rating(user, product, rating) => (user, product) }.groupBy(_._1) userMovies: org.apache.spark.rdd.RDD[(Int, Seq[(Int, Int)])] = MapPartitionsRDD[277] at groupBy at <console>:21 Finally, we can use Spark's join operator to join these two RDDs together on the user ID key. Then, for each user, we have the list of actual and predicted movie IDs that we can pass to our APK function. In a manner similar to how we computed MSE, we will sum each of these APK scores using a reduce action and divide by the number of users (that is, the count of the allRecs RDD): val K = 10 val MAPK = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2).toSeq   avgPrecisionK(actual, predicted, K) }.reduce(_ + _) / allRecs.count println("Mean Average Precision at K = " + MAPK) Mean Average Precision at K = 0.030486963254725705 Our model achieves a fairly low MAPK. However, note that typical values for recommendation tasks are usually relatively low, especially if the item set is extremely large. Try out a few parameter settings for lambda and rank (and alpha if you are using the implicit version of ALS) and see whether you can find a model that performs better based on the RMSE and MAPK evaluation metrics. Using MLlib's built-in evaluation functions While we have computed MSE, RMSE, and MAPK from scratch, and it a useful learning exercise to do so, MLlib provides convenience functions to do this for us in the RegressionMetrics and RankingMetrics classes. RMSE and MSE First, we will compute the MSE and RMSE metrics using RegressionMetrics. We will instantiate a RegressionMetrics instance by passing in an RDD of key-value pairs that represent the predicted and true values for each data point, as shown in the following code snippet. Here, we will again use the ratingsAndPredictions RDD we computed in our earlier example: import org.apache.spark.mllib.evaluation.RegressionMetrics val predictedAndTrue = ratingsAndPredictions.map { case ((user, product), (predicted, actual)) => (predicted, actual) } val regressionMetrics = new RegressionMetrics(predictedAndTrue) We can then access various metrics, including MSE and RMSE. We will print out these metrics here: println("Mean Squared Error = " + regressionMetrics.meanSquaredError) println("Root Mean Squared Error = " + regressionMetrics.rootMeanSquaredError) You will see that the output for MSE and RMSE is exactly the same as the metrics we computed earlier: Mean Squared Error = 0.08231947642632852 Root Mean Squared Error = 0.2869137090247319 MAP As we did for MSE and RMSE, we can compute ranking-based evaluation metrics using MLlib's RankingMetrics class. Similarly, to our own average precision function, we need to pass in an RDD of key-value pairs, where the key is an Array of predicted item IDs for a user, while the value is an array of actual item IDs. The implementation of the average precision at the K function in RankingMetrics is slightly different from ours, so we will get different results. However, the computation of the overall mean average precision (MAP, which does not use a threshold at K) is the same as our function if we select K to be very high (say, at least as high as the number of items in our item set): First, we will calculate MAP using RankingMetrics: import org.apache.spark.mllib.evaluation.RankingMetrics val predictedAndTrueForRanking = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2)   (predicted.toArray, actual.toArray) } val rankingMetrics = new RankingMetrics(predictedAndTrueForRanking) println("Mean Average Precision = " + rankingMetrics.meanAveragePrecision) You will see the following output: Mean Average Precision = 0.07171412913757183 Next, we will use our function to compute the MAP in exactly the same way as we did previously, except that we set K to a very high value, say 2000: val MAPK2000 = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2).toSeq   avgPrecisionK(actual, predicted, 2000) }.reduce(_ + _) / allRecs.count println("Mean Average Precision = " + MAPK2000) You will see that the MAP from our own function is the same as the one computed using RankingMetrics: Mean Average Precision = 0.07171412913757186 We will not cover cross validation in this article. However, note that the same techniques for cross-validation can be used to evaluate recommendation models, using the performance metrics such as MSE, RMSE, and MAP, which we covered in this section. Summary In this article, we used Spark's MLlib library to train a collaborative filtering recommendation model, and you learned how to use this model to make predictions for the items that a given user might have a preference for. We also used our model to find items that are similar or related to a given item. Finally, we explored common metrics to evaluate the predictive capability of our recommendation model. To learn more about Spark, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Fast Data Processing with Spark - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/fast-data-processing-spark-second-edition) Spark Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook) Resources for Article: Further resources on this subject: Reactive Programming And The Flux Architecture [article] Spark - Architecture And First Program [article] The Design Patterns Out There And Setting Up Your Environment [article]
Read more
  • 0
  • 1
  • 18029

article-image-magento-theme-development
Packt
24 Feb 2016
7 min read
Save for later

Magento Theme Development

Packt
24 Feb 2016
7 min read
In this article by Fernando J. Miguel, author of the book Magento 2 Development Essentials, we will learn the basics of theme development. Magento can be customized as per your needs because it is based on the Zend framework, adopting the Model-View-Controller (MVC) architecture as a software design pattern. When planning to create your own theme, the Magento theme process flow becomes a subject that needs to be carefully studied. Let's focus on the concepts that help you create your own theme. (For more resources related to this topic, see here.) The Magento base theme The Magento Community Edition (CE) version 1.9 comes with a new theme named rwd that implements the Responsive Web Design (RWD) practices. Magento CE's responsive theme uses a number of new technologies as follows: Sass/Compass: This is a CSS precompiler that provides a reusable CSS that can even be organized well. jQuery: This is used for customization of JavaScript in the responsive theme. jQuery operates in the noConflict() mode, so it doesn't conflict with Magento's existing JavaScript library. Basically, the folders that contain this theme are as follows: app/design/frontend/rwd skin/frontend/rwd The following image represents the folder structure: As you can see, all the files of the rwd theme are included in the app/design/frontend and skin/frontend folders: app/design/frontend: This folder stores all the .phtml visual files and .xml configurations files of all the themes. skin/frontend: This folder stores all JavaScript, CSS, and image files from their respective app/design/frontend themes folders. Inside these folders, you can see another important folder called base. The rwd theme uses some base theme features to be functional. How is it possible? Logically, Magento has distinct folders for every theme, but Magento is very smart to reuse code. Magento takes advantage of fall-back system. Let's check how it works. The fall-back system The frontend of Magento allows the designers to create new themes based on the basic theme, reusing the main code without changing its structure. The fall-back system allows us to create only the files that are necessary for the customization. To create the customization files, we have the following options: Create a new theme directory and write the entire new code Copy the files from the base theme and edit them as you wish The second option could be more productive for study purposes. You will learn basic structure by exercising the code edit. For example, let's say you want to change the header.phtml file. You can copy the header.html file from the app/design/frontend/base/default/template/page/html path to the app/design/frontend/custom_package/custom_theme/template/page/html path. In this example, if you activate your custom_theme on Magento admin panel, your custom_theme inherits all the structure from base theme, and applies your custom header.phtml on the theme. Magento packages and design themes Magento has the option to create design packages and themes as you saw on the previous example of custom_theme. This is a smart functionality because on same packages you can create more than one theme. Now, let's take a deep look at the main folders that manage the theme structure in Magento. The app/design structure In the app/design structure, we have the following folders: The folder details are as follows: adminhtml: In this folder, Magento keeps all the layout configuration files and .phtml structure of admin area. frontend: In this folder, Magento keeps all the theme's folders and their respective .phtml structure of site frontend. install: This folder stores all the files of installation Magento screen. The layout folder Let's take a look at the rwd theme folder: As you can see, the rwd is a theme folder and has a template folder called default. In Magento, you can create as many template folders as you wish. The layout folders allow you to define the structure of the Magento pages through the XML files. The layout XML files has the power to manage the behavior of your .phtml file: you can incorporate CSS or JavaScript to be loaded on specific pages. Every page on Magento is defined by a handle. A handle is a reference name that Magento uses to refer to a particular page. For example, the <cms_page> handle is used to control the layout of the pages in your Magento. In Magento, we have two main type of handles: Default handles: These manage the whole site Non-default handles: These manage specific parts of the site In the rwd theme, the .xml files are located in app/design/frontend/rwd/default/layout. Let's take a look at an .xml layout file example: This piece of code belongs to the page.xml layout file. We can see the <default> handle defining the .css and .js files that will be loaded on the page. The page.xml file has the same name as its respective folder in app/design/frontend/rwd/default/template/page. This is an internal Magento control. Please keep this in mind: Magento works with a predefined naming file pattern. Keeping this in your mind can avoid unnecessary errors. The template folder The template folder, taking rwd as a reference, is located at app/design/frontend/rwd/default/template. Every subdirectory of template controls a specific page of Magento. The template files are the .phtml files, a mix of HTML and PHP, and they are the layout structure files. Let's take a look at a page/1column.phtml example: The locale folder The locale folder has all the specific translation of the theme. Let's imagine that you want to create a specific translation file for the rwd theme. You can create a locale file at app/design/frontend/rwd/locale/en_US/translate.csv. The locale folder structure basically has a folder of the language (en_US), and always has the translate.csv filename. The app/locale folder in Magento is the main translation folder. You can take a look at it to better understand. But the locale folder inside the theme folder has priority in Magento loading. For example, if you want to create a Brazilian version of the theme, you have to duplicate the translate.csv file from app/design/frontend/rwd/locale/en_US/ to app/design/frontend/rwd/locale/pt_BR/. This will be very useful to those who use the theme and will have to translate it in the future. Creating new entries in translate If you want to create a new entry in your translate.csv, first of all put this code in your PHTML file: <?php echo $this->__('Translate test'); ?> In CSV file, you can put the translation in this format: 'Translate test', 'Translate test'. The SKIN structure The skin folder basically has the css and js files and images of the theme, and is located in skin/frontend/rwd/default. Remember that Magento has a filename/folder naming pattern. The skin folder named rwd will work with rwd theme folder. If Magento has rwd as a main theme and is looking for an image that is not in the skin folder, Magento will search this image in skin/base folder. Remember also that Magento has a fall-back system. It is keeping its search in the main themes folder to find the correct file. Take advantage of this! CMS blocks and pages Magento has a flexible theme system. Beyond Magento code customization, the admin can create blocks and content on Magento admin panel. CMS (Content Management System) pages and blocks on Magento give you the power to embed HTML code in your page. Summary In this article, we covered the basic concepts of Magento theme. These may be used to change the display of the website or its functionality. These themes are interchangeable with Magento installations. Resources for Article: Further resources on this subject: Preparing and Configuring Your Magento Website [article] Introducing Magento Extension Development [article] Installing Magento [article]
Read more
  • 0
  • 0
  • 14663

article-image-getting-started-react
Packt
24 Feb 2016
7 min read
Save for later

Getting Started with React

Packt
24 Feb 2016
7 min read
In this article by Vipul Amler and Prathamesh Sonpatki, author of the book ReactJS by Example- Building Modern Web Applications with React, we will learn how web development has seen a huge advent of Single Page Application (SPA) in the past couple of years. Early development was simple—reload a complete page to perform a change in the display or perform a user action. The problem with this was a huge round-trip time for the complete request to reach the web server and back to the client. Then came AJAX, which sent a request to the server, and could update parts of the page without reloading the current page. Moving in the same direction, we saw the emergence of the SPAs. Wrapping up the heavy frontend content and delivering it to the client browser just once, while maintaining a small channel for communication with the server based on any event; this is usually complemented by thin API on the web server. The growth in such apps has been complemented by JavaScript libraries and frameworks such as Ext JS, KnockoutJS, BackboneJS, AngularJS, EmberJS, and more recently, React and Polymer. (For more resources related to this topic, see here.) Let's take a look at how React fits in this ecosystem and get introduced to it in this article. What is React? ReactJS tries to solve the problem from the View layer. It can very well be defined and used as the V in any of the MVC frameworks. It's not opinionated about how it should be used. It creates abstract representations of views. It breaks down parts of the view in the Components. These components encompass both the logic to handle the display of view and the view itself. It can contain data that it uses to render the state of the app. To avoid complexity of interactions and subsequent render processing required, React does a full render of the application. It maintains a simple flow of work. React is founded on the idea that DOM manipulation is an expensive operation and should be minimized. It also recognizes that optimizing DOM manipulation by hand will result in a lot of boilerplate code, which is error-prone, boring, and repetitive. React solves this by giving the developer a virtual DOM to render to instead of the actual DOM. It finds difference between the real DOM and virtual DOM and conducts the minimum number of DOM operations required to achieve the new state. React is also declarative. When the data changes, React conceptually hits the refresh button and knows to only update the changed parts. This simple flow of data, coupled with dead simple display logic, makes development with ReactJS straightforward and simple to understand. Who uses React? If you've used any of the services such as Facebook, Instagram, Netflix, Alibaba, Yahoo, E-Bay, Khan-Academy, AirBnB, Sony, and Atlassian, you've already come across and used React on the Web. In just under a year, React has seen adoption from major Internet companies in their core products. In its first-ever conference, React also announced the development of React Native. React Native allows the development of mobile applications using React. It transpiles React code to the native application code, such as Objective-C for iOS applications. At the time of writing this, Facebook already uses React Native in its Groups iOS app. In this article, we will be following a conversation between two developers, Mike and Shawn. Mike is a senior developer at Adequate Consulting and Shawn has just joined the company. Mike will be mentoring Shawn and conducting pair programming with him. When Shawn meets Mike and ReactJS It's a bright day at Adequate Consulting. Its' also Shawn's first day at the company. Shawn had joined Adequate to work on its amazing products and also because it uses and develops exciting new technologies. After onboarding the company, Shelly, the CTO, introduced Shawn to Mike. Mike, a senior developer at Adequate, is a jolly man, who loves exploring new things. "So Shawn, here's Mike", said Shelly. "He'll be mentoring you as well as pairing with you on development. We follow pair programming, so expect a lot of it with him. He's an excellent help." With that, Shelly took leave. "Hey Shawn!" Mike began, "are you all set to begin?" "Yeah, all set! So what are we working on?" "Well we are about to start working on an app using https://openlibrary.org/. Open Library is collection of the world's classic literature. It's an open, editable library catalog for all the books. It's an initiative under https://archive.org/ and lists free book titles. We need to build an app to display the most recent changes in the record by Open Library. You can call this the Activities page. Many people contribute to Open Library. We want to display the changes made by these users to the books, addition of new books, edits, and so on, as shown in the following screenshot: "Oh nice! What are we using to build it?" "Open Library provides us with a neat REST API that we can consume to fetch the data. We are just going to build a simple page that displays the fetched data and format it for display. I've been experimenting and using ReactJS for this. Have you used it before?" "Nope. However, I have heard about it. Isn't it the one from Facebook and Instagram?" "That's right. It's an amazing way to define our UI. As the app isn't going to have much of logic on the server or perform any display, it is an easy option to use it." "As you've not used it before, let me provide you a quick introduction." "Have you tried services such as JSBin and JSFiddle before?" "No, but I have seen them." "Cool. We'll be using one of these, therefore, we don't need anything set up on our machines to start with." "Let's try on your machine", Mike instructed. "Fire up http://jsbin.com/?html,output" "You should see something similar to the tabs and panes to code on and their output in adjacent pane." "Go ahead and make sure that the HTML, JavaScript, and Output tabs are clicked and you can see three frames for them so that we are able to edit HTML and JS and see the corresponding output." "That's nice." "Yeah, good thing about this is that you don't need to perform any setups. Did you notice the Auto-run JS option? Make sure its selected. This option causes JSBin to reload our code and see its output so that we don't need to keep saying Run with JS to execute and see its output." "Ok." Requiring React library "Alright then! Let's begin. Go ahead and change the title of the page, to say, React JS Example. Next, we need to set up and we require the React library in our file." "React's homepage is located at http://facebook.github.io/react/. Here, we'll also locate the downloads available for us so that we can include them in our project. There are different ways to include and use the library. We can make use of bower or install via npm. We can also just include it as an individual download, directly available from the fb.me domain. There are development versions that are full version of the library as well as production version which is its minified version. There is also its version of add-on. We'll take a look at this later though." "Let's start by using the development version, which is the unminified version of the React source. Add the following to the file header:" <script src="http://fb.me/react-0.13.0.js"></script> "Done". "Awesome, let's see how this looks." <!DOCTYPE html> <html> <head> <script src="http://fb.me/react-0.13.0.js"></script> <meta charset="utf-8"> <title>React JS Example</title> </head> <body> </body> </html> Summary In this article, we started with React and built our first component. In the process we studied top level API of React for constructing components and elements. Resources for Article: Further resources on this subject: Create Your First React Element [article] An Introduction to ReactJs [article] An Introduction to Reactive Programming [article]
Read more
  • 0
  • 0
  • 13735

article-image-make-things-pretty-ggplot2
Packt
24 Feb 2016
30 min read
Save for later

Make Things Pretty with ggplot2

Packt
24 Feb 2016
30 min read
 The objective of this article is to provide you with a general overview of the plotting environments in R and of the most efficient way of coding your graphs in it. We will go through the most important Integrated Development Environment (IDE) available for R as well as the most important packages available for plotting data; this will help you to get an overview of what is available in R and how those packages are compared with ggplot2. Finally, we will dig deeper into the grammar of graphics, which represents the basic concepts on which ggplot2 was designed. But first, let's make sure that you have a working version of R on your computer. (For more resources related to this topic, see here.) Getting ggplot2 up and running You can download the most up-to-date version of R from the R project website (http://www.r-project.org/). There, you will find a direct connection to the Comprehensive R Archive Network (CRAN), a network of FTP and web servers around the world that store identical, up-to-date versions of code and documentation for R. In addition to access to the CRAN servers, on the website of the R project, you may also find information about R, a few technical manuals, the R journal, and details about the packages developed for R and stored in the CRAN repositories. At the time of writing, the current version of R is 3.1.2. If you have already installed R on your computer, you can check the actual version with the R.Version() code, or for a more concise result, you can use the R.version.string code that recalls only part of the output of the previous function. Packages in R In the next few pages of this article, we will quickly go through the most important visualization packages available in R, so in order to try the code, you will also need to have additional packages as well as ggplot2 up and running in your R installation. In the basic R installation, you will already have the graphics package available and loaded in the session; the lattice package is already available among the standard packages delivered with the basic installation, but it is not loaded by default. ggplot2, on the other hand, will need to be installed. You can install and load a package with the following code: > install.packages(“ggplot2”) > library(ggplot2) Keep in mind that every time R is started, you will need to load the package you need with the library(name_of_the_package) command to be able to use the functions contained in the package. In order to get a list of all the packages installed on your computer, you can use the call to the library() function without arguments. If, on the other hand, you would like to have a list of the packages currently loaded in the workspace, you can use the search() command. One more function that can turn out to be useful when managing your library of packages is .libPaths(), which provides you with the location of your R libraries. This function is very useful to trace back the package libraries you are currently using, if any, in addition to the standard library of packages, which on Windows is located by default in a path of the kind C:/Program Files/R/R-3.1.2/library. The following list is a short recap of the functions just discussed: .libPaths()   # get library location library()   # see all the packages installed search()   # see the packages currently loaded Integrated Development Environment (IDE) You will definitely be able to run the code and the examples explained in the article directly from the standard R Graphical User Interface (GUI), especially if you are frequently working with R in more complex projects or simply if you like to keep an eye on the different components of your code, such as scripts, plots, and help pages, you may well think about the possibility of using an IDE. The number of specific IDEs that get integrated with R is still limited, but some of them are quite efficient, well-designed and open source. RStudio RStudio (http://www.rstudio.com/) is a very nice and advanced programming environment developed specifically for R, and this would be my recommended choice of IDE as the R programming environment in most cases. It is available for all the major platforms (Windows, Linux, and Mac OS X), and it can be run on a local machine, such as your computer, or even over the Web, using RStudio Server. With RStudio Server, you can connect a browser-based interface (the RStudio IDE) to a version of R running on a remote Linux server. RStudio allows you to integrate several useful functionalities, in particular if you use R for a more complex project. The way the software interface is organized allows you to keep an eye on the different activities you very often deal with in R, such as working on different scripts, overviewing the installed packages, as well as having easy access to the help pages and the plots generated. This last feature is particularly interesting for ggplot2 since in RStudio, you will be able to easily access the history of the plots created instead of visualizing only the last created plot, as is the case in the default R GUI. One other very useful feature of RStudio is code completion. You can, in fact, start typing a comment, and upon pressing the Tab key, the interface will provide you with functions matching what you have written . This feature will turn out to be very useful in ggplot2, so you will not necessarily need to remember all the functions and you will also have guidance for the arguments of the functions as well. In Figure 1.1, you can see a screenshot from the current version of RStudio (v 0.98.1091): Figure 1.1: This is a screenshot of RStudio on Windows 8 The environment is composed of four different areas: Scripting area: In this area you can open, create, and write the scripts. Console area: This area is the actual R console in which the commands are executed. It is possible to type commands directly here in the console or write them in a script and then run them on the console (I would recommend the last option). Workspace/History area: In this area, you can find a practical summary of all the objects created in the workspace in which you are working and the history of the typed commands. Visualization area: Here, you can easily load packages, open R help files, and, even more importantly, visualize plots. The RStudio website provides a lot of material on how to use the program, such as manuals, tutorials, and videos, so if you are interested, refer to the website for more details. Eclipse and StatET Eclipse (http://www.eclipse.org/) is a very powerful IDE that was mainly developed in Java and initially intended for Java programming. Subsequently, several extension packages were also developed to optimize the programming environment for other programming languages, such as C++ and Python. Thanks to its original objective of being a tool for advanced programming, this IDE is particularly intended to deal with very complex programming projects, for instance, if you are working on a big project folder with many different scripts. In these circumstances, Eclipse could help you to keep your programming scripts in order and have easy access to them. One drawback of such a development environment is probably its big size (around 200 MB) and a slightly slow-starting environment. Eclipse does not support interaction with R natively, so in order to be able to write your code and execute it directly in the R console, you need to add StatET to your basic Eclipse installation. StatET (http://www.walware.de/goto/statet) is a plugin for the Eclipse IDE, and it offers a set of tools for R coding and package building. More detailed information on how to install Eclipse and StatET and how to configure the connections between R and Eclipse/StatET can be found on the websites of the related projects. Emacs and ESS Emacs (http://www.gnu.org/software/emacs/) is a customizable text editor and is very popular, particularly in the Linux environment. Although this text editor appears with a very simple GUI, it is an extremely powerful environment, particularly thanks to the numerous keyboard shortcuts that allow interaction with the environment in a very efficient manner after getting some practice. Also, if the user interface of a typical IDE, such as RStudio, is more sophisticated and advanced, Emacs may be useful if you need to work with R on systems with a poor graphical interface, such as servers and terminal windows. Like Eclipse, Emacs does not support interfacing with R by default, so you will need to install an add-on package on your Emacs that will enable such a connection, Emacs Speaks Statistics (ESS). ESS (http://ess.r-project.org/) is designed to support the editing of scripts and interacting with various statistical analysis programs including R. The objective of the ESS project is to provide efficient text editor support to statistical software, which in some cases comes with a more or less defined GUI, but for which the real power of the language is only accessible through the original scripting language. The plotting environments in R R provides a complete series of options to realize graphics, which makes it quite advanced with regard to data visualization. Along the next few sections of this article, we will go through the most important R packages for data visualization by quickly discussing some high-level differences and analogies. If you already have some experience with other R packages for data visualization, in particular graphics or lattice, the following sections will provide you with some references and examples of how the code used in such packages appears in comparison with that used in ggplot2. Moreover, you will also have an idea of the typical layout of the plots created with a certain package, so you will be able to identify the tool used to realize the plots you will come across. The core of graphics visualization in R is within the grDevices package, which provides the basic structure of data plotting, such as the colors and fonts used in the plots. Such a graphic engine was then used as the starting point in the development of more advanced and sophisticated packages for data visualization, the most commonly used being graphics and grid. The graphics package is often referred to as the base or traditional graphics environment since, historically, it was the first package for data visualization available in R, and it provides functions that allow the generation of complete plots. The grid package, on the other hand, provides an alternative set of graphics tools. This package does not directly provide functions that generate complete plots, so it is not frequently used directly to generate graphics, but it is used in the development of advanced data visualization packages. Among the grid-based packages, the most widely used are lattice and ggplot2, although they are built by implementing different visualization approaches—Trellis plots in the case of lattice and the grammar of graphics in the case of ggplot2. We will describe these principles in more detail in the coming sections. A diagram representing the connections between the tools just mentioned is shown in Figure 1.2. Just keep in mind that this is not a complete overview of the packages available but simply a small snapshot of the packages we will discuss. Many other packages are built on top of the tools just mentioned, but in the following sections, we will focus on the most relevant packages used in data visualization, namely graphics, lattice, and, of course, ggplot2. If you would like to get a more complete overview of the graphics tools available in R, you can have a look at the web page of the R project summarizing such tools, http://cran.r-project.org/web/views/Graphics.html. Figure 1.2: This is an overview of the most widely used R packages for graphics In order to see some examples of plots in graphics, lattice and ggplot2, we will go through a few examples of different plots over the following pages. The objective of providing these examples is not to do an exhaustive comparison of the three packages but simply to provide you with a simple comparison of how the different codes as well as the default plot layouts appear for these different plotting tools. For these examples, we will use the Orange dataset available in R; to load it in the workspace, simply write the following code: >data(Orange) This dataset contains records of the growth of orange trees. You can have a look at the data by recalling its first lines with the following code: >head(Orange) You will see that the dataset contains three columns. The first one, Tree, is an ID number indicating the tree on which the measurement was taken, while age and circumference refer to the age in days and the size of the tree in millimeters, respectively. If you want to have more information about this data, you can have a look at the help page of the dataset by typing the following code: ?Orange Here, you will find the reference of the data as well as a more detailed description of the variables included. Standard graphics and grid-based graphics The existence of these two different graphics environments brings these questions  to most users' minds—which package to use and under which circumstances? For simple and basic plots, where the data simply needs to be represented in a standard plot type (such as a scatter plot, histogram, or boxplot) without any additional manipulation, then all the plotting environments are fairly equivalent. In fact, it would probably be possible to produce the same type of plot with graphics as well as with lattice or ggplot2. Nevertheless, in general, the default graphic output of ggplot2 or lattice will be most likely superior compared to graphics since both these packages are designed considering the principles of human perception deeply and to make the evaluation of data contained in plots easier. When more complex data should be analyzed, then the grid-based packages, lattice and ggplot2, present a more sophisticated support in the analysis of multivariate data. On the other hand, these tools require greater effort to become proficient because of their flexibility and advanced functionalities. In both cases, lattice and ggplot2, the package provides a full set of tools for data visualization, so you will not need to use grid directly in most cases, but you will be able to do all your work directly with one of those packages. Graphics and standard plots The graphics package was originally developed based on the experience of the graphics environment in R. The approach implemented in this package is based on the principle of the pen-on-paper model, where the plot is drawn in the first function call and once content is added, it cannot be deleted or modified. In general, the functions available in this package can be divided into high-level and low-level functions. High-level functions are functions capable of drawing the actual plot, while low-level functions are functions used to add content to a graph that was already created with a high-level function. Let's assume that we would like to have a look at how age is related to the circumference of the trees in our dataset Orange; we could simply plot the data on a scatter plot using the high-level function plot() as shown in the following code: plot(age~circumference, data=Orange) This code creates the graph in Figure 1.3. As you would have noticed, we obtained the graph directly with a call to a function that contains the variables to plot in the form of y~x, and the dataset to locate them. As an alternative, instead of using a formula expression, you can use a direct reference to x and y, using code in the form of plot(x,y). In this case, you will have to use a direct reference to the data instead of using the data argument of the function. Type in the following code: plot(Orange$circumference, Orange$age) The preceding code results in the following output: Figure 1.3: Simple scatterplot of the dataset Orange using graphics For the time being, we are not interested in the plot's details, such as the title or the axis, but we will simply focus on how to add elements to the plot we just created. For instance, if we want to include a regression line as well as a smooth line to have an idea of the relation between the data, we should use a low-level function to add the just-created additional lines to the plot; this is done with the lines() function: plot(age~circumference, data=Orange)   ###Create basic plot abline(lm(Orange$age~Orange$circumference), col=”blue”) lines(loess.smooth(Orange$circumference,Orange$age), col=”red”) The graph generated as the output of this code is shown in Figure 1.4: Figure 1.4: This is a scatterplot of the Orange data with a regression line (in blue) and a smooth line (in red) realized with graphics As illustrated, with this package, we have built a graph by first calling one function, which draws the main plot frame, and then additional elements were included using other functions. With graphics, only additional elements can be included in the graph without changing the overall plot frame defined by the plot() function. This ability to add several graphical elements together to create a complex plot is one of the fundamental elements of R, and you will notice how all the different graphical packages rely on this principle. If you are interested in getting other code examples of plots in graphics, there is also some demo code available in R for this package, and it can be visualized with demo(graphics). In the coming sections, you will find a quick reference to how you can generate a similar plot using graphics and ggplot2. As will be described in more detail later on, in ggplot2, there are two main functions to realize plots, ggplot() and qplot(). The function qplot() is a wrapper function that is designed to easily create basic plots with ggplot2, and it has a similar code to the plot() function of graphics. Due to its simplicity, this function is the easiest way to start working with ggplot2, so we will use this function in the examples in the following sections. The code in these sections also uses our example dataset Orange; in this way, you can run the code directly on your console and see the resulting output. Scatterplot with individual data points To generate the plot generated using graphics, use the following code: plot(age~circumference, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange) The preceding code results in the following output: Scatterplots with the line of one tree To generate the plot using graphics, use the following code: plot(age~circumference, data=Orange[Orange$Tree==1,], type=”l”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange[Orange$Tree==1,], geom=”line”) The preceding code results in the following output: Scatterplots with the line and points of one tree To generate the plot using graphics, use the following code: plot(age~circumference, data=Orange[Orange$Tree==1,], type=”b”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange[Orange$Tree==1,], geom=c(“line”,”point”)) The preceding code results in the following output: Boxplot of orange dataset To generate the plot using graphics, use the following code: boxplot(circumference~Tree, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=”boxplot”) The preceding code results in the following output: Boxplot with individual observations To generate the plot using graphics, use the following code: boxplot(circumference~Tree, data=Orange) points(circumference~Tree, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=c(“boxplot”,”point”)) The preceding code results in the following output: Histogram of orange dataset To generate the plot using graphics, use the following code: hist(Orange$circumference) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”) The preceding code results in the following output: Histogram with reference line at median value in red To generate the plot using graphics, use the following code: hist(Orange$circumference) abline(v=median(Orange$circumference), col=”red”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”)+geom_vline(xintercept = median(Orange$circumference), colour=”red”) The preceding code results in the following output: Lattice and the Trellis plots Along with with graphics, the base R installation also includes the lattice package. This package implements a family of techniques known as Trellis graphics, proposed by William Cleveland to visualize complex datasets with multiple variables. The objective of those design principles was to ensure the accurate and faithful communication of data information. These principles are embedded into the package and are already evident in the default plot design settings. One interesting feature of Trellis plots is the option of multipanel conditioning, which creates multiple plots by splitting the data on the basis of one variable. A similar option is also available in ggplot2, but in that case, it is called faceting. In lattice, we also have functions that are able to generate a plot with one single call, but once the plot is drawn, it is already final. Consequently, plot details as well as additional elements that need to be included in the graph, need to be specified already within the call to the main function. This is done by including all the specifications in the panel function argument. These specifications can be included directly in the main body of the function or specified in an independent function, which is then called; this last option usually generates more readable code, so this will be the approach used in the following examples. For instance, if we want to draw the same plot we just generated in the previous section with graphics, containing the age and circumference of trees and also the regression and smooth lines, we need to specify such elements within the function call. You may see an example of the code here; remember that lattice needs to be loaded in the workspace: require(lattice)              ##Load lattice if needed myPanel <- function(x,y){ panel.xyplot(x,y)            # Add the observations panel.lmline(x,y,col=”blue”)   # Add the regression panel.loess(x,y,col=”red”)      # Add the smooth line } xyplot(age~circumference, data=Orange, panel=myPanel) This code produces the plot in Figure 1.5: Figure 1.5: This is a scatter plot of the Orange data with the regression line (in blue) and the smooth line (in red) realized with lattice As you would have noticed, taking aside the code differences, the plot generated does not look very different from the one obtained with graphics. This is because we are not using any special visualization feature of lattice. As mentioned earlier, with this package, we have the option of multipanel conditioning, so let's take a look at this. Let's assume that we want to have the same plot but for the different trees in the dataset. Of course, in this case, you would not need the regression or the smooth line, since there will only be one tree in each plot window, but it could be nice to have the different observations connected. This is shown in the following code: myPanel <- function(x,y){ panel.xyplot(x,y, type=”b”) #the observations } xyplot(age~circumference | Tree, data=Orange, panel=myPanel) This code generates the graph shown in Figure 1.6: Figure 1.6: This is a scatterplot of the Orange data realized with lattice, with one subpanel representing the individual data of each tree. The number of trees in each panel is reported in the upper part of the plot area As illustrated, using the vertical bar |, we are able to obtain the plot conditional to the value of the variable Tree. In the upper part of the panels, you would notice the reference to the value of the conditional variable, which, in this case, is the column Tree. As mentioned before, ggplot2 offers this option too; we will see one example of that in the next section. In the next section, You would find a quick reference to how to convert a typical plot type from lattice to ggplot2. In this case, the examples are adapted to the typical plotting style of the lattice plots. Scatterplot with individual observations To plot the graph using lattice, use the following code: xyplot(age~circumference, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange) The preceding code results in the following output: Scatterplot of orange dataset with faceting To plot the graph using lattice, use the following code: xyplot(age~circumference|Tree, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange, facets=~Tree) The preceding code results in the following output: Faceting scatterplot with line and points To plot the graph using lattice, use the following code: xyplot(age~circumference|Tree, data=Orange, type=”b”) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange, geom=c(“line”,”point”), facets=~Tree) The preceding code results in the following output: Scatterplots with grouping data To plot the graph using lattice, use the following code: xyplot(age~circumference, data=Orange, groups=Tree, type=”b”) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange,color=Tree, geom=c(“line”,”point”)) The preceding code results in the following output: Boxplot of orange dataset To plot the graph using lattice, use the following code: bwplot(circumference~Tree, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=”boxplot”) The preceding code results in the following output: Histogram of orange dataset To plot the graph using lattice, use the following code: histogram(Orange$circumference, type = “count”) To plot the graph using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”) The preceding code results in the following output: Histogram with reference line at median value in red To plot the graph using lattice, use the following code: histogram(~circumference, data=Orange, type = “count”, panel=function(x,...){panel.histogram(x, ...);panel.abline(v=median(x), col=”red”)}) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”)+geom_vline(xintercept = median(Orange$circumference), colour=”red”) The preceding code results in the following output: ggplot2 and the grammar of graphics The ggplot2 package was developed by Hadley Wickham by implementing a completely different approach to statistical plots. As is the case with lattice, this package is also based on grid, providing a series of high-level functions that allow the creation of complete plots. The Grammar of Graphics by Leland Wilkinson. Briefly, The Grammar of Graphics assumes that a statistical graphic is a mapping of data to the aesthetic attributes and geometric objects used to represent data, such as points, lines, bars, and so on. Besides the aesthetic attributes, the plot can also contain statistical transformation or grouping of data. As in lattice, in ggplot2, we have the possibility of splitting data by a certain variable, obtaining a representation of each subset of data in an independent subplot; such representation in ggplot2 is called faceting. In a more formal way, the main components of the grammar of graphics are the data and its mapping, aesthetics, geometric objects, statistical transformations, scales, coordinates, and faceting: The data that must be visualized is mapped to aesthetic attributes, which define how the data should be perceived Geometric objects describe what is actually displayed on the plot, such as lines, points, or bars; the geometric objects basically define which kind of plot you are going to draw Statistical transformations are applied to the data to group them; examples of statistical transformations would be the smooth line or the regression lines of the previous examples or the binning of the histograms Scales represent the connection between the aesthetic spaces and the actual values that should be represented. Scales may also be used to draw legends Coordinates represent the coordinate system in which the data is drawn Faceting, which we have already mentioned, is the grouping of data in subsets defined by a value of one variable In ggplot2, there are two main high-level functions capable of directly creating a plot, qplot(), and ggplot(); qplot() stands for quick plot, and it is a simple function that serves a purpose similar to that served by the plot() function in graphics. The ggplot()function, on the other hand, is a much more advanced function that allows the user to have more control of the plot layout and details. In our journey into the world of ggplot2, we will see some examples of qplot(), in particular when we go through the different kinds of graphs, but we will dig a lot deeper into ggplot() since this last function is more suited to advanced examples. If you have a look at the different forums based on R programming, there is quite a bit of discussion as to which of these two functions would be more convenient to use. My general recommendation would be that it depends on the type of graph you are drawing more frequently. For simple and standard plots, where only the data should be represented and only the minor modification of standard layouts are required, the qplot() function will do the job. On the other hand, if you need to apply particular transformations to the data or if you would just like to keep the freedom of controlling and defining the different details of the plot layout, I would recommend that you focus on ggplot(). As you will see, the code between these functions is not completely different since they are both based on the same underlying philosophy, but the way in which the options are set is quite different, so if you want to adapt a plot from one function to the other, you will essentially need to rewrite your code. If you just want to focus on learning only one of them, I would definitely recommend that you learn ggplot(). In the following code, you will see an example of a plot realized with ggplot2, where you can identify some of the components of the grammar of graphics. The example is realized with the ggplot() function, which allows a more direct comparison with the grammar of graphics, but coming just after the following code, you could also find the corresponding qplot() code useful. Both codes generate the graph depicted in Figure 1.7: require(ggplot2)                             ## Load ggplot2 data(Orange)                                 ## Load the data   ggplot(data=Orange,                          ## Data used   aes(x=circumference,y=age, color=Tree))+   ## Aesthetic geom_point()+                                ## Geometry stat_smooth(method=”lm”,se=FALSE)            ## Statistics   ### Corresponding code with qplot() qplot(circumference,age,data=Orange,         ## Data used   color=Tree,                                ## Aesthetic mapping   geom=c(“point”,”smooth”),method=”lm”,se=FALSE) This simple example can give you an idea of the role of each portion of code in a ggplot2 graph; you have seen how the main function body creates the connection between the data and the aesthetics we are interested to represent and how, on top of this, you add the components of the plot, as in this case, we added the geometry element of points and the statistical element of regression. You can also notice how the components that need to be added to the main function call are included using the + sign. One more thing worth mentioning at this point is that if you run just the main body function in the ggplot() function, you will get an error message. This is because this call is not able to generate an actual plot. The step during which the plot is actually created is when you include the geometric attribute, which, in this case is geom_point(). This is perfectly in line with the grammar of graphics since, as we have seen, the geometry represents the actual connection between the data and what is represented on the plot. This is the stage where we specify that the data should be represented as points; before that, nothing was specified about which plot we were interested in drawing. Figure 1.7: This is an example of plotting the Orange dataset with ggplot2 Summary To learn more about the similar technology, the following books/videos published by Packt Publishing (https://www.packtpub.com/) are recommended: ggplot2 Essentials (https://www.packtpub.com/big-data-and-business-intelligence/ggplot2-essentials) Video: Building Interactive Graphs with ggplot2 and Shiny (https://www.packtpub.com/big-data-and-business-intelligence/building-interactive-graphs-ggplot2-and-shiny-video) Resources for Article: Further resources on this subject: Refresher [article] Interactive Documents [article] Driving Visual Analyses Automobile Data (Python) [article]
Read more
  • 0
  • 0
  • 6191
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-game-world
Packt
23 Feb 2016
39 min read
Save for later

The Game World

Packt
23 Feb 2016
39 min read
In this article, we will cover the basics of creating immersive areas where players can walk around and interact, as well as some of the techniques used to manage those areas. This article will give you some practical tips and tricks of the spritesheet system introduced with Unity 4.3 and how to get it to work for you. Lastly, we will also have a cursory look at how shaders work in the 2D world and the considerations you need to keep in mind when using them. However, we won't be implementing shaders as that could be another book in itself. The following is the list of topics that will be covered in this article: Working with environments Looking at sprite layers Handling multiple resolutions An overview of parallaxing and effects Shaders in 2D – an overview (For more resources related to this topic, see here.) Backgrounds and layers Now that we have our hero in play, it would be nice to give him a place to live and walk around, so let's set up the home town and decorate it. Firstly, we are going to need some more assets. So, from the asset pack you downloaded earlier, grab the following assets from the Environments pack, place them in the AssetsSpritesEnvironment folder, and name them as follows: Name the ENVIRONMENTS STEAMPUNKbackground01.png file Assets SpritesEnvironmentbackground01 Name the ENVIRONMENTSSTEAMPUNKenvironmentalAssets.png file AssetsSpritesEnvironmentenvironmentalAssets Name the ENVIRONMENTSFANTASYenvironmentalAssets.png file Assets SpritesEnvironmentenvironmentalAssets2 To slice or not to slice It is always better to pack many of the same images on to a single asset/atlas and then use the Sprite Editor to define the regions on that texture for each sprite, as long as all the sprites on that sheet are going to get used in the same scene. The reason for this is when Unity tries to draw to the screen, it needs to send the images to draw to the graphics card; if there are many images to send, this can take some time. If, however, it is just one image, it is a lot simpler and more performant with only one file to send. There needs to be a balance; too large an image and the upload to the graphics card can take up too many resources, too many individual images and you have the same problem. The basic rule of thumb is as follows: If the background is a full screen background or large image, then keep it separately. If you have many images and all are for the same scene, then put them into a spritesheet/atlas. If you have many images but all are for different scenes, then group them as best you can—common items on one sheet and scene-specific items on different sheets. You'll have several spritesheets to use. You basically want to keep as much stuff together as makes sense and not send unnecessary images that won't get used to the graphics card. Find your balance. The town background First, let's add a background for the town using the AssetsSpritesEnvironmentbackground01 texture. It is shown in the following screenshot: With the background asset, we don't need to do anything else other than ensure that it has been imported as a sprite (in case your project is still in 3D mode), as shown in the following screenshot: The town buildings For the steampunk environmental assets (AssetsSpritesEnvironmentenvironmentalAssets) that are shown in the following screenshot, we need a bit more work; once these assets are imported, change the Sprite Mode to Multiple and load up the Sprite Editor using the Sprite Editor button. Next, click on the Slice button, leave the settings at their default options, and then click on the Slice button in the new window as shown in the following screenshot: Click on Apply and close the Sprite Editor. You will have four new sprite textures available as seen in the following screenshot: The extra scenery We saw what happens when you use a grid type split on a spritesheet and when the automatic split works well, so what about when it doesn't go so well? If we look at the Fantasy environment pack (AssetsSpritesEnvironmentenvironmentalAssets2), we will see the following: After you have imported it and run the Split in Sprite Editor, you will notice that one of the sprites does not get detected very well; altering the automatic split settings in this case doesn't help, so we need to do some manual manipulation as shown in the following screenshot: In the previous screenshot, you can see that just two of the rocks in the top-right sprite have been identified by the splicing routine. To fix this, just delete one of the selections and then expand the other manually using the selection points in the corner of the selection box (after clicking on the sprite box). Here's how it will look before the correction: After correction, you should see something like the following screenshot: This gives us some nice additional assets to scatter around our towns and give it a more homely feel, as shown in the following screenshot: Building the scene So, now that we have some nice assets to build with, we can start building our first town. Adding the town background Returning to the scene view, you should see the following: If, however, we add our town background texture (AssetsSpritesBackgroundsBackground.png) to the scene by dragging it to either the project hierarchy or the scene view, you will end up with the following: Be sure to set the background texture position appropriately once you add it to the scene; in this case, be sure the position of the transform is centered in the view at X = 0, Y = 0, Z = 0. Unity does have a tendency to set the position relative to where your 3D view is at the time of adding it—almost never where you want it. Our player has vanished! The reason for this is simple: Unity's sprite system has an ordering system that comes in two parts. Sprite sorting layers Sorting Layers (Edit | Project Settings | Tags and Layers) are a collection of sprites, which are bulked together to form a single group. Layers can be configured to be drawn in a specific order on the screen as shown in the following screenshot: Sprite sorting order Sprites within an individual layer can be sorted, allowing you to control the draw order of sprites within that layer. The sprite Inspector is used for this purpose, as shown in the following screenshot: Sprite's Sorting Layers should not be confused with Unity's rendering layers. Layers are a separate functionality used to control whether groups of game objects are drawn or managed together, whereas Sorting Layers control the draw order of sprites in a scene. So the reason our player is no longer seen is that it is behind the background. As they are both in the same layer and have the same sort order, they are simply drawn in the order that they are in the project hierarchy. Updating the scene Sorting Layers To resolve the update of the scene's Sorting Layers, let's organize our sprite rendering by adding some sprite Sorting Layers. So, open up the Tags and Layers inspector pane as shown in the following screenshot (by navigating to Edit | Project settings | Tags and Layers), and add the following Sorting Layers: Background Player Foreground GUI You can reorder the layers underneath the default anytime by selecting a row and dragging it up and down the sprite's Sorting Layers list. With the layers set up, we can now configure our game objects accordingly. So, set the Sorting Layer on our background01 sprite to the Background layer as shown in the following screenshot: Then, update the PlayerSprite layer to Player; our character will now be displayed in front of the background. You can just keep both objects on the same layer and set the Sort Order value appropriately, keeping the background to a Sort Order value of 0 and the player to 10, which will draw the player in front. However, as you add more items to the scene, things will get tricky quickly, so it is better to group them in a layer accordingly. Now when we return to the scene, our hero is happily displayed but he is seen hovering in the middle of our village. So let's fix that next by simply changing its position transform in the Inspector window. Setting the Y position transform to -2 will place our hero nicely in the middle of the street (provided you have set the pivot for the player sprite to bottom), as shown in the following screenshot: Feel free at this point to also add some more background elements such as trees and buildings to fill out the scene using the environment assets we imported earlier. Working with the camera If you try and move the player left and right at the moment, our hero happily bobs along. However, you will quickly notice that we run into a problem: the hero soon disappears from the edge of the screen. To solve this, we need to make the camera follow the hero. When creating new scripts to implement something, remember that just about every game that has been made with Unity has most likely implemented either the same thing or something similar. Most just get on with it, but others and the Unity team themselves are keen to share their scripts to solve these challenges. So in most cases, we will have something to work from. Don't just start a script from scratch (unless it is a very small one to solve a tiny issue) if you can help it; here's some resources to get you started: Unity sample projects: http://Unity3d.com/learn/tutorials/projects Unity Patterns: http://unitypatterns.com/ Unity wiki scripts section: http://wiki.Unity3d.com/index.php/Scripts (also check other stuff for detail) Once you become more experienced, it is better to just use these scripts as a reference and try to create your own and improve on them, unless they are from a maintained library such as https://github.com/nickgravelyn/UnityToolbag. To make the camera follow the players, we'll take the script from the Unity 2D sample and modify it to fit in our game. This script is nice because it also includes a Mario style buffer zone, which allows the players to move without moving the camera until they reach the edge of the screen. Create a new script called FollowCamera in the AssetsScripts folder, remove the Start and Update functions, and then add the following properties: using UnityEngine;   public class FollowCamera : MonoBehavior {     // Distance in the x axis the player can move before the   // camera follows.   public float xMargin = 1.5f;     // Distance in the y axis the player can move before the   // camera follows.   public float yMargin = 1.5f;     // How smoothly the camera catches up with its target   // movement in the x axis.   public float xSmooth = 1.5f;     // How smoothly the camera catches up with its target   // movement in the y axis.   public float ySmooth = 1.5f;     // The maximum x and y coordinates the camera can have.   public Vector2 maxXAndY;     // The minimum x and y coordinates the camera can have.   public Vector2 minXAndY;     // Reference to  the player's transform.   public Transform player; } The variables are all commented to explain their purpose, but we'll cover each as we use them. First off, we need to get the player object's position so that we can track the camera to it by discovering it from the object it is attached to. This is done by adding the following code in the Awake function: void Awake()     {         // Setting up the reference.         player = GameObject.Find("Player").transform;   if (player == null)   {     Debug.LogError("Player object not found");   }       } An alternative to discovering the player this way is to make the player property public and then assign it in the editor. There is no right or wrong way—just your preference. It is also a good practice to add some element of debugging to let you know if there is a problem in the scene with a missing reference, else all you will see are errors such as object not initialized or variable was null. Next, we need a couple of helper methods to check whether the player has moved near the edge of the camera's bounds as defined by the Max X and Y variables. In the following code, we will use the settings defined in the preceding code to control how close you can get to the end result:   bool CheckXMargin()     {         // Returns true if the distance between the camera and the   // player in the x axis is greater than the x margin.         return Mathf.Abs (transform.position.x - player.position.x) > xMargin;     }       bool CheckYMargin()     {         // Returns true if the distance between the camera and the   // player in the y axis is greater than the y margin.         return Mathf.Abs (transform.position.y - player.position.y) > yMargin;     } To finish this script, we need to check each frame when the scene is drawn to see whether the player is close to the edge and update the camera's position accordingly. Also, we need to check if the camera bounds have reached the edge of the screen and not move it beyond. Comparing Update, FixedUpdate, and LateUpdate There is usually a lot of debate about which update method should be used within a Unity game. To put it simply, the FixedUpdate method is called on a regular basis throughout the lifetime of the game and is generally used for physics and time sensitive code. The Update method, however, is only called after the end of each frame that is drawn to the screen, as the time taken to draw the screen can vary (due to the number of objects to be drawn and so on). So, the Update call ends up being fairly irregular. For more detail on the difference between Update and FixedUpdate see the Unity Learn tutorial video at http://unity3d.com/learn/tutorials/modules/beginner/scripting/update-and-fixedupdate. As the player is being moved by the physics system, it is better to update the camera in the FixedUpdate method: void FixedUpdate()     {         // By default the target x and y coordinates of the camera         // are it's current x and y coordinates.         float targetX = transform.position.x;         float targetY = transform.position.y;           // If the player has moved beyond the x margin...         if (CheckXMargin())             // the target x coordinate should be a Lerp between             // the camera's current x position and the player's  // current x position.             targetX = Mathf.Lerp(transform.position.x,  player.position.x, xSmooth * Time.fixedDeltaTime );           // If the player has moved beyond the y margin...         if (CheckYMargin())             // the target y coordinate should be a Lerp between             // the camera's current y position and the player's             // current y position.             targetY = Mathf.Lerp(transform.position.y,  player.position.y, ySmooth * Time. fixedDeltaTime );           // The target x and y coordinates should not be larger         // than the maximum or smaller than the minimum.         targetX = Mathf.Clamp(targetX, minXAndY.x, maxXAndY.x);         targetY = Mathf.Clamp(targetY, minXAndY.y, maxXAndY.y);           // Set the camera's position to the target position with         // the same z component.         transform.position =          new Vector3(targetX, targetY, transform.position.z);     } As they say, every game is different and how the camera acts can be different for every game. In a lot of cases, the camera should be updated in the LateUpdate method after all drawing, updating, and physics are complete. This, however, can be a double-edged sword if you rely on math calculations that are affected in the FixedUpdate method, such as Lerp. It all comes down to tweaking your camera system to work the way you need it to do. Once the script is saved, just attach it to the Main Camera element by dragging the script to it or by adding a script component to the camera and selecting the script. Finally, we just need to configure the script and the camera to fit our game size as follows: Set the orthographic Size of the camera to 2.7 and the Min X and Max X sizes to 5 and -5 respectively. The perils of resolution When dealing with cameras, there is always one thing that will trip us up as soon as we try to build for another platform—resolution. By default, the Unity player in the editor runs in the Free Aspect mode as shown in the following screenshot: The Aspect mode (from the Aspect drop-down) can be changed to represent the resolutions supported by each platform you can target. The following is what you get when you switch your build target to each platform: To change the build target, go into your project's Build Settings by navigating to File | Build Settings or by pressing Ctrl + Shift + B, then select a platform and click on the Switch Platform button. This is shown in the following screenshot: When you change the Aspect drop-down to view in one of these resolutions, you will notice how the aspect ratio for what is drawn to the screen changes by either stretching or compressing the visible area. If you run the editor player in full screen by clicking on the Maximize on Play button () and then clicking on the play icon, you will see this change more clearly. Alternatively, you can run your project on a target device to see the proper perspective output. The reason I bring this up here is that if you used fixed bounds settings for your camera or game objects, then these values may not work for every resolution, thereby putting your settings out of range or (in most cases) too undersized. You can handle this by altering the settings for each build or using compiler predirectives such as #if UNITY_METRO to force the default depending on the build (in this example, Windows 8). To read more about compiler predirectives, check the Unity documentation at http://docs.unity3d.com/Manual/PlatformDependentCompilation.html. A better FollowCamera script If you are only targeting one device/resolution or your background scrolls indefinitely, then the preceding manual approach works fine. However, if you want it to be a little more dynamic, then we need to know what resolution we are working in and how much space our character has to travel. We will perform the following steps to do this: We will change the min and max variables to private as we no longer need to configure them in the Inspector window. The code is as follows:   // The maximum x and y coordinates the camera can have.     private Vector2 maxXAndY;       // The minimum x and y coordinates the camera can have.     private Vector2 minXAndY; To work out how much space is available in our town, we need to interrogate the rendering size of our background sprite. So, in the Awake function, we add the following lines of code: // Get the bounds for the background texture - world       size     var backgroundBounds = GameObject.Find("background")      .renderer.bounds; In the Awake function, we work out our resolution and viewable space by interrogating the ViewPort method on the camera and converting it to the same coordinate type as the sprite. This is done using the following code:   // Get the viewable bounds of the camera in world     // coordinates     var camTopLeft = camera.ViewportToWorldPoint      (new Vector3(0, 0, 0));     var camBottomRight = camera.ViewportToWorldPoint      (new Vector3(1, 1, 0)); Finally, in the Awake function, we update the min and max values using the texture size and camera real-world bounds. This is done using the following lines of code: // Automatically set the min and max values     minXAndY.x = backgroundBounds.min.x - camTopLeft.x;     maxXAndY.x = backgroundBounds.max.x - camBottomRight.x; In the end, it is up to your specific implementation for the type of game you are making to decide which pattern works for your game. Transitioning and bounds So our camera follows our player, but our hero can still walk off the screen and keep going forever, so let us stop that from happening. Towns with borders As you saw in the preceding section, you can use Unity's camera logic to figure out where things are on the screen. You can also do more complex ray testing to check where things are, but I find these are overly complex unless you depend on that level of interaction. The simpler answer is just to use the native Box2D physics system to keep things in the scene. This might seem like overkill, but the 2D physics system is very fast and fluid, and it is simple to use. Once we add the physics components, Rigidbody 2D (to apply physics) and a Box Collider 2D (to detect collisions) to the player, we can make use of these components straight away by adding some additional collision objects to stop the player running off. To do this and to keep things organized, we will add three empty game objects (either by navigating to GameObject | Create Empty, or by pressing Ctrl + Shift +N) to the scene (one parent and two children) to manage these collision points, as shown in the following screenshot: I've named them WorldBounds (parent) and LeftBorder and RightBorder (children) for reference. Next, we will position each of the child game objects to the left- and right-hand side of the screen, as shown in the following screenshot: Next, we will add a Box Collider 2D to each border game object and increase its height just to ensure that it works for the entire height of the scene. I've set the Y value to 5 for effect, as shown in the following screenshot: The end result should look like the following screenshot with the two new colliders highlighted in green: Alternatively, you could have just created one of the children, added the box collider, duplicated it (by navigating to Edit | Duplicate or by pressing Ctrl + D), and moved it. If you have to create multiples of the same thing, this is a handy tip to remember. If you run the project now, then our hero can no longer escape this town on his own. However, as we want to let him leave, we can add a script to the new Boundary game object so that when the hero reaches the end of the town, he can leave. Journeying onwards Now that we have collision zones on our town's borders, we can hook into this by using a script to activate when the hero approaches. Create a new C# script called NavigationPrompt, clear its contents, and populate it with the following code: using UnityEngine;   public class NavigationPrompt : MonoBehavior {     bool showDialog;     void OnCollisionEnter2D(Collision2D col)   {     showDialog = true;   }     void OnCollisionExit2D(Collision2D col)   {     showDialog = false;   } } The preceding code gives us the framework of a collision detection script that sets a flag on and off if the character interacts with what the script is attached to, provided it has a physics collision component. Without it, this script would do nothing and it won't cause an error. Next, we will do something with the flag and display some GUI when the flag is set. So, add the following extra function to the preceding script: void OnGUI()     {       if (showDialog)       {         //layout start         GUI.BeginGroup(new Rect(Screen.width / 2 - 150, 50, 300,           250));           //the menu background box         GUI.Box(new Rect(0, 0, 300, 250), "");           // Information text         GUI.Label(new Rect(15, 10, 300, 68), "Do you want to           travel?");           //Player wants to leave this location         if (GUI.Button(new Rect(55, 100, 180, 40), "Travel"))         {           showDialog = false;             // The following line is commented out for now           // as we have nowhere to go :D           //Application.LoadLevel(1);}           //Player wants to stay at this location         if (GUI.Button(new Rect(55, 150, 180, 40), "Stay"))         {           showDialog = false;         }           //layout end         GUI.EndGroup();       }     } The function itself is very simple and only activates if the showDialog flag is set to true by the collision detection. Then, we will perform the following steps: In the OnGUI method, we set up a dialog window region with some text and two buttons. One button asks if the player wants to travel, which would load the next area (commented out for now as we only have one scene), and close the dialog. One button simply closes the dialog if the hero didn't actually want to leave. As we haven't stopped moving the player, the player can also do this by moving away. If you now add the NavigationPrompt script to the two world border (LeftBorder and RightBorder) game objects, this will result in the following simple UI whenever the player collides with the edges of our world: We can further enhance this by tagging or naming our borders to indicate a destination. I prefer tagging, as it does not interfere with how my scene looks in the project hierarchy; also, I can control what tags are available and prevent accidental mistyping. To tag a game object, simply select a Tag using the drop-down list in the Inspector when you select the game object in the scene or project. This is shown in the following screenshot: If you haven't set up your tags yet or just wish to add a new one, select Add Tag in the drop-down menu; this will open up the Tags and Layers window of Inspector. Alternatively, you can call up this window by navigating to Edit | Project Settings | Tags and layers in the menu. It is shown in the following screenshot: You can only edit or change user-defined tags. There are several other tags that are system defined. You can use these as well; you just cannot change, remove, or edit them. These include Player, Respawn, Finish, Editor Only, Main Camera, and GameController. As you can see from the preceding screenshot, I have entered two new tags called The Cave and The World, which are the two main exit points from our town. Unity also adds an extra item to the arrays in the editor. This helps you when you want to add more items; it's annoying when you want a fixed size but it is meant to help. When the project runs, however, the correct count of items will be exposed. Once these are set up, just return to the Inspector for the two borders, and set the right one to The World and the left to The Cave. Now, I was quite specific in how I named these tags, as you can now reuse these tags in the script to both aid navigation and also to notify the player where they are going. To do this, simply update the Do you want to travel to line to the following: //Information text GUI.Label(new Rect(15, 10, 300, 68), "Do you want to travel to " +   this.tag + "?"); Here, we have simply appended the dialog as it is presented to the user with the name of the destination we set in the tag. Now, we'll get a more personal message, as shown in the following screenshot: Planning for the larger picture Now for small games, the preceding implementation is fine; however, if you are planning a larger world with a large number of interactions, provide complex decisions to prevent the player continuing unless they are ready. As the following diagram shows, there are several paths the player can take and in some cases, these is only one way. Now, we could just build up the logic for each of these individually as shown in the screenshot, but it is better if we build a separate navigation system so that we have everything in one place; it's just easier to manage that way. This separation is a fundamental part of any good game design. Keeping the logic and game functionality separate makes it easier to maintain in the future, especially when you need to take internationalization into account (but we will learn more about that later). Now, we'll change to using a manager to handle all the world/scene transitions, and simplify the tag names we use as they won't need to be displayed. So, The Cave will be renamed as just Cave, and we will get the text to display from the navigation manager instead of the tag. So, by separating out the core decision making functionality out of the prompt script, we can build the core manager for navigation. Its primary job is to maintain where a character can travel and information about that destination. First, we'll update the tags we created earlier to simpler identities that we can use in our navigation manager (update The Cave to Cave01 and The World to World). Next, we'll create a new C# script called NavigationManager in our AssetsScripts folder, and then replace its contents with the following lines of code: public static class NavigationManager {       public static Dictionary<string,string> RouteInformation =     new Dictionary<string,string>()   {     { "World", "The big bad world"},     { "Cave", "The deep dark cave"},   };     public static string GetRouteInfo(string destination)   {     return RouteInformation.ContainsKey(destination) ?     RouteInformation[destination] : null;   }     public static bool CanNavigate(string destination)   {     return true;   }     public static void NavigateTo(string destination)   {     // The following line is commented out for now     // as we have nowhere to go :D     //Application.LoadLevel(destination);   } } Notice the ? and : operators in the following statement: RouteInformation.ContainsKey(destination) ?   RouteInformation[destination] : null; These operators are C# conditional operators. They are effectively the shorthand of the following: if(RouteInformation.ContainsKey(destination)) {   return RouteInformation[destination]; } else {   return null; } Shorter, neater, and much nicer, don't you think? For more information, see the MSDN C# page at http://bit.ly/csharpconditionaloperator. The script is very basic for now, but contains several following key elements that can be expanded to meet the design goals of your game: RouteInformation: This is a list of all the possible destinations in the game in a dictionary. A static list of possible destinations in the game, and it is a core part of the manager as it knows everywhere you can travel in the game in one place. GetRouteInfo: This is a basic information extraction function. A simple controlled function to interrogate the destination list. In this example, we just return the text to be displayed in the prompt, which allows for more detailed descriptions that we could use in tags. You could use this to provide alternate prompts depending on what the player is carrying and whether they have a lit torch, for example. CanNavigate: This is a test to see if navigation is possible. If you are going to limit a player's travel, you need a way to test if they can move, allowing logic in your game to make alternate choices if the player cannot. You could use a different system for this by placing some sort of block in front of a destination to limit choice (as used in the likes of Zelda), such as an NPC or rock. As this is only an example, we can always travel and add logic to control it if you wish. NavigateTo: This is a function to instigate navigation. Once a player can travel, you can control exactly what happens in the game: does navigation cause the next scene to load straight away (as in the script currently), or does the current scene fade out and then a traveling screen is shown before fading the next level in? Granted, this does nothing at present as we have nowhere to travel to. The script you will notice is different to the other scripts used so far, as it is a static class. This means it sits in the background, only exists once in the game, and is accessible from anywhere. This pattern is useful for fixed information that isn't attached to anything; it just sits in the background waiting to be queried. Later, we will cover more advanced types and classes to provide more complicated scenarios. With this class in place, we just need to update our previous script (and the tags) to make use of this new manager. Update the NavigationPrompt script as follows: Update the collision function to only show the prompt if we can travel. The code is as follows: void OnCollisionEnter2D(Collision2D col) {   //Only allow the player to travel if allowed   if (NavigationManager.CanNavigate(this.tag))   {     showDialog = true;   } } When the dialog shows, display the more detailed destination text provided by the manager for the intended destination. The code is as follows: //Dialog detail - updated to get better detail GUI.Label(new Rect(15, 10, 300, 68), "Do you want to travel   to " + NavigationManager.GetRouteInfo(this.tag) + "?"); If the player wants to travel, let the manager start the travel process. The code is as follows: //Player wants to leave this location if (GUI.Button(new Rect(55, 100, 180, 40), "Travel")) {   showDialog = false;   NavigationManager.NavigateTo(this.tag); } The functionality I've shown here is very basic and it is intended to make you think about how you would need to implement it for your game. With so many possibilities available, I could fill several articles on this kind of subject alone. Backgrounds and active elements A slightly more advanced option when building game worlds is to add a level of immersive depth to the scene. Having a static image to show the village looks good, especially when you start adding houses and NPCs to the mix; but to really make it shine, you should layer the background and add additional active elements to liven it up. We won't add them to the sample project at this time, but it is worth experimenting with in your own projects (or try adding it to this one)—it is a worthwhile effect to look into. Parallaxing If we look at the 2D sample provided by Unity, the background is split into several panes—each layered on top of one another and each moving at a different speed when the player moves around. There are also other elements such as clouds, birds, buses, and taxes driving/flying around, as shown in the following screenshot: Implementing these effects is very easy technically. You just need to have the art assets available. There are several scripts in the wiki I described earlier, but the one in Unity's own 2D sample is the best I've seen. To see the script, just download the Unity Projects: 2D Platformer asset from https://www.assetstore.unity3d.com/en/#!/content/11228, and check out the BackgroundParallax script in the AssetsScripts folder. The BackgroundParallax script in the platformer sample implements the following: An array of background images, which is layered correctly in the scene (which is why the script does not just discover the background sprites) A scaling factor to control how much the background moves in relation to the camera target, for example, the camera A reducing factor to offset how much each layer moves so that they all don't move as one (or else what is the point, might as well be a single image) A smoothing factor so that each background moves smoothly with the target and doesn't jump around Implementing this same model in your game would be fairly simple provided you have texture assets that could support it. Just replicate the structure used in the platformer 2D sample and add the script. Remember to update the FollowCamera script to be able to update the base background, however, to ensure that it can still discover the size of the main area. Foreground objects The other thing you can do to liven up your game is to add random foreground objects that float across your scene independently. These don't collide with anything and aren't anything to do with the game itself. They are just eye candy to make your game look awesome. The process to add these is also fairly simple, but it requires some more advanced Unity features such as coroutines, which we are not going to cover here. So, we will come back to these later. In short, if you examine the BackgroundPropSpawner.cs script from the preceding Unity platformer 2D sample, you will have to perform the following steps: Create/instantiate an object to spawn. Set a random position and direction for the object to travel. Update the object over its lifetime. Once it's out of the scene, destroy or hide it. Wait for a time, and then start again. This allows them to run on their own without impacting the gameplay itself and just adds that extra bit of depth. In some cases, I've seen particle effects are also used to add effect, but they are used sparingly. Shaders and 2D Believe it or not, all 2D elements (even in their default state) are drawn using a shader—albeit a specially written shader designed to light and draw the sprite in a very specific way. If you look at the player sprite in the inspector, you will see that it uses a special Material called Sprites-Default, as shown in the following screenshot: This section is purely meant to highlight all the shading options you have in the 2D system. Shaders have not changed much in this update except for the addition of some 2D global lighting found in the default sprite shader. For more detail on shaders in general, I suggest a dedicated Unity shader book such as https://www.packtpub.com/game-development/unity-shaders-and-effects-cookbook. Clicking on the button next to Material field will bring up the material selector, which also shows the two other built-in default materials, as shown in the following screenshot: However, selecting either of these will render your sprite invisible as they require a texture and lighting to work; they won't inherit from the Sprite Renderer texture. You can override this by creating your own material and assigning alternate sprite style shaders. To create a new material, just select the AssetsMaterials folder (this is not crucial, but it means we create the material in a sensible place in our project folder structure) and then right click on and select Create | Material. Alternatively, do the same using the project view's Edit... menu option, as shown in the following screenshot: This gives us a basic default Diffuse shader, which is fine for basic 3D objects. However, we also have two default sprite rendering shaders available. Selecting the shader dropdown gives us the screen shown in the following screenshot: Now, these shaders have the following two very specific purposes: Default: This shader inherits its texture from the Sprite Renderer texture to draw the sprite as is. This is a very basic functionality—just enough to draw the sprite. (It contains its own static lighting.) Diffuse: This shader is the same as the Default shader; it inherits the texture of Default, but it requires an external light source as it does not contain any lighting—this has to be applied separately. It is a slightly more advanced shader, which includes offsets and other functions. Creating one of these materials and applying it to the Sprite Renderer texture of a sprite will override its default constrained behavior. This opens up some additional shader options in the Inspector, as shown in the following screenshot: These options include the following: Sprite Texture: Although changing the Tiling and Offset values causes a warning to appear, they still display a function (even though the actual displayed value resets). Tint: This option allows changing the default light tint of the rendered sprite. It is useful to create different colored objects from the same sprite. Pixel snap: This option makes the rendered sprite crisper but narrows the drawn area. It is a trial and error feature (see the following sections for more information). Achieving pixel perfection in your game in Unity can be a challenge due to the number of factors that can affect it, such as the camera view size, whether the image texture is a Power Of Two (POT) size, and the import setting for the image. This is basically a trial and error game until you are happy with the intended result. If you are feeling adventurous, you can extend these default shaders (although this is out of the scope of this article). The full code for these shaders can be found at http://Unity3d.com/unity/download/archive. If you are writing your own shaders though, be sure to add some lighting to the scene; otherwise, they are just going to appear dark and unlit. Only the default sprite shader is automatically lit by Unity. Alternatively, you can use the default sprite shader as a base to create your new custom shader and retain the 2D basic lighting. Another worthy tip is to check out the latest version of the Unity samples (beta) pack. In it, they have added logic to have two sets of shaders in your project: one for mobile and one for desktop, and a script that will swap them out at runtime depending on the platform. This is very cool; check out on the asset store at https://www.assetstore.unity3d.com/#/content/14474 and the full review of the pack at http://darkgenesis.zenithmoon.com/unity3dsamplesbeta-anoverview/. Going further If you are the adventurous sort, try expanding your project to add the following: Add some buildings to the town Set up some entry points for a building and work that into your navigation system, for example, a shop Add some rocks to the scene and color each differently using a manual material, maybe even add a script to randomly set the pixel color in the shader instead of creating several materials Add a new scene for the cave using another environment background, and get the player to travel between them Summary This certainly has been a very busy article just to add a background to our scene, but working out how each scene will work is a crucial design element for the entire game; you have to pick a pattern that works for you and your end result once as changing it can be very detrimental (and a lot of work) in the future. In this article, we covered the following topics: Some more practice with the Sprite Editor and sprite slicer including some tips and tricks when it doesn't work (or you want to do it yourself) Some camera tips, tricks, and scripts An overview of sprite layers and sprite sorting Defining boundaries in scenes Scene navigation management and planning levels in your game Some basics of how shaders work for 2D For learning Unity 2D from basic you can refer to https://www.packtpub.com/game-development/learning-unity-2d-game-development-example. Resources for Article:   Further resources on this subject: Build a First Person Shooter [article] Let's Get Physical – Using GameMaker's Physics System [article] Using the Tiled map editor [article]
Read more
  • 0
  • 0
  • 26947

article-image-salt-configuration
Packt
23 Feb 2016
23 min read
Save for later

Salt Configuration

Packt
23 Feb 2016
23 min read
In this article, Joseph Hall, the author of the book Extending SaltStack, explains thatwhile setting static configuration is fine and well, it can be very useful to be able to supply that data from an external source. You'll learn about: Writing dynamic grains and external pillars Troubleshooting grains and pillars Writing and using SDB modules (For more resources related to this topic, see here.) Setting grains dynamically Grains hold variables that describe certain aspects of a Minion. This could be information about the operating system, hardware, network, and so on. It can also contain statically-defined user data, which is configured either in /etc/salt/minion or /etc/salt/grains. It is also possible to define grains dynamically using grains modules. Setting some basic grains Grains modules are interesting in that as long as the module is loaded, all public functions will be executed. As each function is executed, it will return a dictionary, which contains items to be merged into the Minion's grains. Let's go ahead and set up a new grains module to demonstrate. We'll prepend the names of the return data with a z so that it is easy to find. ''' Test module for Extending SaltStack This module should be saved as salt/grains/testdata.py ''' def testdata(): ''' Return some test data ''' return {'ztest1': True} Go ahead and save this file as salt/grains/testdata.py, and then use salt-call to display all of the grains, including this one: # salt-call --local grains.items local: ---------- ... virtual: physical zmqversion: 4.1.3 ztest1: True Keep in mind that you can also use grains.item to display only a single grain: # salt-call --local grains.item ztest local: ---------- ztest1: True It may not look like this module is very good, since this is still just static data that could be defined in the minion or grains files. But keep in mind that as with other modules, grains modules can be gated using a __virtual__() function. Let's go ahead and set this up, along with a flag of sorts that will determine whether or not this module will load in the first place: import os.path def __virtual__(): ''' Only load these grains if /tmp/ztest exists ''' if os.path.exists('/tmp/ztest'): return True return False Now run the following commands to see this in action: # salt-call --local grains.item ztest local: ---------- ztest: # touch /tmp/ztest # salt-call --local grains.item ztest local: ---------- ztest: True This is very useful for gating the return data from an entire module, whether dynamic or, as this module currently is, static. You may be wondering why the example checked for the existence of a file rather than checking the existing Minion configuration. This is to illustrate that the detection of certain system properties is likely to dictate how grains are set. If you want to just set a flag inside the minion file, you can pull it out of __opts__. Let's go ahead and add this to the __virtual__() function: def __virtual__(): ''' Only load these grains if /tmp/ztest exists ''' if os.path.exists('/tmp/ztest'): return True if __opts__.get('ztest', False): return True return False Remove the old flag, and set the new one: # rm /tmp/ztest # echo 'ztest: True' >> /etc/salt/minion # salt-call --local grains.item ztest local: ---------- ztest: True Let's set up this module to return dynamic data as well. Because YAML is so prevalent in Salt, let's go ahead and set up a function that returns the contents of a YAML file: import yaml import salt.utils def yaml_test(): ''' Return sample data from /etc/salt/test.yaml ''' with salt.utils.fopen('/etc/salt/yamltest.yaml', 'r') as fh_: return yaml.safe_load(fh_) Save your module, and then issue the following commands to see the result: # echo 'yamltest: True' > /etc/salt/yamltest.yaml # salt-call --local grains.item yamltest local: ---------- yamltest: True (Not) Cross-calling execution modules You may be tempted to try and cross-call an execution module from inside a grains module. Unfortunately, this won't work. The __virtual__() function in many execution modules relies heavily on grains. Allowing grains to cross-call execution modules before Salt has decided whether or not to even call the execution module in the first place would cause circular dependencies. Just remember: grains are loaded first, then pillars, and then execution modules. If you have code in which you plan to use two or more of these types of modules, consider setting up a library for it in the salt/utils/ directory. The final grains module With all of the code we've put together, the resulting module should look like the following: ''' Test module for Extending SaltStack. This module should be saved as salt/grains/testdata.py ''' import os.path import yaml import salt.utils def __virtual__(): ''' Only load these grains if /tmp/ztest exists ''' if os.path.exists('/tmp/ztest'): return True if __opts__.get('ztest', False): return True return False def testdata(): ''' Return some test data ''' return {'ztest1': True} def yaml_test(): ''' Return sample data from /etc/salt/test.yaml ''' with salt.utils.fopen('/etc/salt/yamltest.yaml', 'r') as fh_: return yaml.safe_load(fh_) Creating external pillars As you know, pillars are like grains but with a key difference: grains are defined on the Minion, while pillars are defined for individual Minions, from the Master. As far as users are concerned, there's not a whole lot of difference here, except that pillars must be mapped to targets on the Master using the top.sls file in pillar_roots. One such mapping might look like this: # cat /srv/pillar/top.sls base: '*': - test In this example, we would have a pillar called test defined, which might look like this: # cat /srv/pillar/test.sls test_pillar: True Dynamic pillars are still mapped in the top.sls file, but that's where the similarities end as far as configuration is concerned. Configuring external pillars Unlike dynamic grains, which will run so long as their __virtual__() function allows them to do so, pillars must be explicitly enabled in the master configuration file or, if running in local mode as we will be, in the minion configuration file. Let's go ahead and add the following lines to the end of /etc/salt/minion: ext_pillar: - test_pillar: True If we were testing this on the Master, we would have needed to restart the salt-master service. However, since we're testing in local mode on the Minion, this will not be required. Adding an external pillar We'll also need to create a simple external pillar to get started with. Go ahead and create salt/pillar/test_pillar.py with the following content: ''' This is a test external pillar ''' def ext_pillar(minion_id, pillar, config): ''' Return the pillar data ''' return {'test_pillar': minion_id} Save your work, and then test it to make sure it works: # salt-call --local pillar.item test_pillar local: ---------- test_pillar: dufresne Let's go over what's happened here. First off, we have a function called ext_pillar(). This function is required in all external pillars. It is also the only function that is required. Any others, whether named with a preceding underscore or not, will be private to this module. This function will always be passed three pieces of data. The first is the ID of the Minion that is requesting this pillar. You can see this in our example already: the minion_id parameter where the above example was run was dufresne. The second is a copy of the static pillars defined for this Minion. The third is an extra piece of data that was passed to this external pillar in the master (or in this case, minion) configuration file. Let's go ahead and update our pillar to show us what each component looks like. Change your ext_pillar() function to look like this: def ext_pillar(minion_id, pillar, command): ''' Return the pillar data ''' return {'test_pillar': { 'minion_id': minion_id, 'pillar': pillar, 'config': config, }} Save it, and then modify the ext_pillar configuration in your minion (or master) file: ext_pillar: - test_pillar: Alas, poor Yorik. I knew him, Horatio. Take a look at your pillar data again: # salt-call --local pillar.item test_pillar local: ---------- test_pillar: ---------- config: Alas, poor Yorik. I knew him, Horatio. minion_id: dufresne pillar: ---------- test_pillar: True You can see the test_pillar method that we referenced a couple of pages ago. And, of course, you can see the minion_id method, just like before. The important part here is the config. This example was chosen to make clear where the config argument came from. When an external pillar is added to the ext_pillar list, it is entered as a dictionary, with a single item as its value. The item that is specified can be a string, boolean, integer, or float. It cannot be a dictionary or a list. This argument is normally used to pass arguments to the pillar from the configuration file. For instance, the cmd_yaml pillar that ships with Salt uses it to define a command that is expected to return data in the YAML format, as follows: ext_pillar: - cmd_yaml: cat /etc/salt/testyaml.yaml If the only thing that your pillar requires is to be enabled; then, you can just set this to True and then ignore it. However, you must still set it! Salt will expect this data to be there, and you will receive an error like this if it is not: [CRITICAL] The "ext_pillar" option is malformed While minion_id, pillar, and config are all passed into the ext_pillar() function (in that order), Salt doesn't actually care what you call the variables in your function definition. You could call them emeril, mario, and alton if you wanted (not that you would). But whatever you call them, they must all still be there. Another external pillar Let's put together another external pillar so that it doesn't get confused with our first one. This one's job is to check the status of a web service. First, let's write our pillar code: ''' Get status from HTTP service in JSON format. This file should be saved as salt/pillar/http_status.py ''' import salt.utils.http def ext_pillar(minion_id, pillar, config): ''' Call a web service which returns status in JSON format ''' comps = config.split() key = comps[0] url = comps[1] status = salt.utils.http.query(url, decode=True) return {key: status['dict']} Save this file as salt/pillar/http_status.py. Then, go ahead and update your ext_pillar configuration to point to it. For now, we'll use GitHub's status URL: ext_pillar - http_status: github https://status.github.com/api/status.json Save the configuration, and then test the pillar: # salt-call --local pillar.item github local: ---------- github: ---------- last_updated: 2015-12-02T05:22:16Z status: good If you need to be able to check the status on multiple services, you can use the same external pillar multiple times but with different configurations. Try updating your ext_pillar definition to contain these two entries: ext_pillar - http_status: github https://status.github.com/api/status.json - http_status: github2 https://status.github.com/api/status.json Now, this can quickly become a problem. GitHub won't be happy with you if you're constantly hitting their status API. So, as nice as it is to get real-time status updates, you may want to do something to throttle your queries. Let's save the status in a file and return it from there. We will check the file's timestamp to make sure it doesn't get updated more than once a minute. Let's go ahead and update the entire external pillar: ''' Get status from HTTP service in JSON format. This file should be saved as salt/pillar/http_status.py ''' import json import time import datetime import os.path import salt.utils.http def ext_pillar(minion_id, # pylint: disable=W0613 pillar, # pylint: disable=W0613 config): ''' Return the pillar data ''' comps = config.split() key = comps[0] url = comps[1] refresh = False status_file = '/tmp/status-{0}.json'.format(key) if not os.path.exists(status_file): refresh = True else: stamp = os.path.getmtime(status_file) now = int(time.mktime(datetime.datetime.now().timetuple())) if now - 60 >= stamp: refresh = True if refresh: salt.utils.http.query(url, decode=True, decode_out=status_file) with salt.utils.fopen(status_file, 'r') as fp_: return {key: json.load(fp_)} Now we've set a flag called refresh, and the URL will only be hit when that flag is True. We've also defined a file that will cache the content obtained from that URL. The file will contain the name given to the pillar, so it will end up having a name like /tmp/status-github.json. The following two lines will retrieve the last modified time of the file and the current time in seconds: stamp = os.path.getmtime(status_file) now = int(time.mktime(datetime.datetime.now().timetuple())) And comparing the two, we can determine whether the file is more than 60 seconds old. If we wanted to make the pillar even more configurable, we could even move that 60 to the config parameter and pull it from comps[2]. Troubleshooting grains and pillars While writing grains and pillars, you may encounter some difficulties. Let's take a look at the most common problems you might have. Dynamic grains not showing up You may find that when you issue a grains.items command from the Master, your dynamic grains don't show up. This can be difficult to track down, because grains are evaluated on the Minion, and any errors aren't likely to make it back over the wire to you. When you find that dynamic grains aren't showing up as you expect, it's usually easiest to log in to the Minion directly to troubleshoot. Open up a shell and try issuing a salt-call command to see whether any errors manifest themselves. If they don't immediately, try adding --log-level=debug to your command to see whether any errors have been hiding at that level. Using a trace log level might also be necessary. External pillars not showing up External pillars can be a little more difficult to pick out. Using salt-call is effective for finding errors in grains because all of the code can be executed without starting up or contacting a service. But pillars come from the Master, unless you're running salt-call in local mode. If you are able to install your external pillar code on a Minion for testing, then the steps are the same as for checking for grains errors. But if you find yourself in a situation where the Master's environment cannot be duplicated on a Minion, you will need to use a different tactic: Stop the salt-master service on the Master and then start it back up in the foreground with a debug log level: # salt-master --log-level debug Then, open up another shell and check the pillars for an affected Minion: # salt <minionid> pillar.items Any errors in the pillar code should manifest themselves in the window with salt-master running in the foreground. Writing SDB modules SDB is a relatively new type of module, and is ripe for development. It stands for Simple Database, and it is designed to allow data to be simple to query, using a very short URI. The underlying configuration could be as complex as necessary so long as the URI that is used to query it is as simple as possible. Another design goal of SDB is that URIs can mask sensitive pieces of information to prevent them being stored directly inside a configuration file. For instance, passwords are often required for other types of modules, such as the mysql modules, but it is a poor practice to store passwords in files that are then stored inside a revision control system such as Git. Using SDB to look up passwords on the fly allows references to the passwords to be stored but not the passwords themselves. This makes it much safer to store files that reference sensitive data inside revision control systems. There is one supposed function that could be tempting to use SDB for: storing encrypted data on the Minion that cannot be read by the Master. It is possible to run agents on a Minion that require local authentication, such as typing in a password from the Minion's keyboard or using a hardware encryption device. SDB modules can be made that make use of these agents, and due to their very nature, the authentication credentials themselves cannot be retrieved by the Master. The problem is that the Master can access anything that a Minion subscribing to it can. While the data may be stored in an encrypted database on the Minion and while its transfer to the Master is certainly encrypted, once it gets to the Master, it can still be read in plaintext. Getting SDB data There are only two public functions that are used for SDB: get and set. And in truth, the only important one of these is get, since set can usually be done outside of Salt entirely. Let's go ahead and take a look at get. For our example, we'll create a module that reads a JSON file and then returns the requested key from it. First, let's set up our JSON file: { "user": "larry", "password": "123pass" } Save this file as /root/mydata.json. Then, edit the minion configuration file and add a configuration profile: myjson: driver: json json_file: /root/mydata.json With these two things in place, we're ready to start writing our module. JSON has a very simple interface, so there won't be much here: ''' SDB module for JSON This file should be saved as salt/sdb/json.py ''' from __future__ import absolute_import import salt.utils import json def get(key, profile=None): ''' Get a value from a JSON file ''' with salt.utils.fopen(profile['json_file'], 'r') as fp_: json_data = json.load(fp_) return json_data.get(key, None) You've probably noticed that we're added a couple of extra things outside of the necessary JSON code. First, we imported something called absolute_import. This is because this file is called json.py, and it's importing another library called json. Without absolute_import, the file would try to import itself and be unable to find the necessary functions from the actual json library. We've also imported salt.utils so that we can make use of the fopen() function that ships with Salt. This is a wrapper around Python's own open() built-in, which adds some extra error handling that Salt makes use of. The get() function takes two arguments: key and profile. The key argument refers to the key that will be used to access the data that we need, while profile is a copy of the profile data that we save as myjson in the minion configuration file. The SDB URI makes use of these two items. When we build this URI, it will be formatted like this: sdb://<profile_name>/<key> For instance, if we were to use the sdb execution module to retrieve the value of key1, our command would look like this: # salt-call --local sdb.get sdb://myjson/user local: larry With this module and profile in place, we can now add lines to the minion configuration (or to grains, pillars, or even the master configuration) that look like this: username: sdb://myjson/user password: sdb://myjson/password When a module that uses config.get comes across an SDB URI, it will automatically translate it on the fly to the appropriate data. Before we move on, let's update this function a little bit to perform some error handling. If the user makes a typo in the profile (such as json_fle instead of json_file), the file being referenced doesn't exist, or the JSON isn't formatted correctly, then this module will start spitting out traceback messages. Let's go ahead and handle all of this using Salt's own CommandExecutionError: from __future__ import absolute_import from salt.exceptions import CommandExecutionError import salt.utils import json def get(key, profile=None): ''' Get a value from a JSON file ''' try: with salt.utils.fopen(profile['json_file'], 'r') as fp_: json_data = json.load(fp_) return json_data.get(key, None) except IOError as exc: raise CommandExecutionError (exc) except KeyError as exc: raise CommandExecutionError ('{0} needs to be configured'.format(exc)) except ValueError as exc: raise CommandExecutionError ( 'There was an error with the JSON data: {0}'.format(exc) ) The IOError method will catch problems with a path that doesn't point to a real file. The KeyError method will catch errors with missing profile configuration (which would happen if one of the items were misspelled). The ValueError method will catch problems with an improperly formatted JSON file. This is what errors look like originally: Traceback (most recent call last): File "/usr/bin/salt-call", line 11, in <module> salt_call() File "/usr/lib/python2.7/site-packages/salt/scripts.py", line 333, in salt_call client.run() File "/usr/lib/python2.7/site-packages/salt/cli/call.py", line 58, in run caller.run() File "/usr/lib/python2.7/site-packages/salt/cli/caller.py", line 133, in run ret = self.call() File "/usr/lib/python2.7/site-packages/salt/cli/caller.py", line 196, in call ret['return'] = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/salt/modules/sdb.py", line 28, in get return salt.utils.sdb.sdb_get(uri, __opts__) File "/usr/lib/python2.7/site-packages/salt/utils/sdb.py", line 37, in sdb_get return loaded_db[fun](query, profile=profile) File "/usr/lib/python2.7/site-packages/salt/sdb/json_sdb.py", line 49, in get with salt.utils.fopen(profile['json_fil']) as fp_: KeyError: 'json_fil' Our code will turn them into something like this: Error running 'sdb.get': 'json_fil' needs to be configured Setting SDB data The function used for set may look strange because set is a Python built-in. This means that the function cannot be called set(); it must be called something else and then given an alias using the __func_alias__ dictionary. Let's now create a function that does nothing except returning the value to be set: __func_alias__ = { 'set_': 'set' } def set_(key, value, profile=None): ''' Set a key/value pair in a JSON file ''' return value This will be enough for your purposes with read-only data, but in our case, we're going to modify the JSON file. First, let's look at the arguments that are passed to our function. You already know that key points to the data to be referenced and that profile contains a copy of the profile data from the minion configuration file. And you can probably guess that value contains a copy of the data to be applied. The value doesn't change the actual URI; that will always be the same whether you're getting or setting data. The execution module itself is what accepts the data to be set and then sets it. You can see that with this command: # salt-call --local sdb.set sdb://myjson/password 321pass local: 321pass With that in mind, let's go ahead and make our module read the JSON file, apply the new value, and then write it back out again. For now, we'll skip error handling in order to make it easier to read: def set_(key, value, profile=None): ''' Set a key/value pair in a JSON file ''' with salt.utils.fopen(profile['json_file'], 'r') as fp_: json_data = json.load(fp_) json_data[key] = value with salt.utils.fopen(profile['json_file'], 'w') as fp_: json.dump(json_data, fp_) return get(key, profile) This function reads the JSON file as before, updates the specific value (creating it if necessary), and then writes the file back out. When it's finished, it returns the data using the get() function so that the user knows whether it was set properly. If it returns the wrong data, then the user will know that something went wrong. It won't necessarily tell them what went wrong, but it will raise a red flag. Let's add some error handling to help the user know what went wrong. We'll add in the error handling from the get() function now too: def set_(key, value, profile=None): # pylint: disable=W0613 ''' Set a key/value pair in a JSON file ''' try: with salt.utils.fopen(profile['json_file'], 'r') as fp_: json_data = json.load(fp_) except IOError as exc: raise CommandExecutionError (exc) except KeyError as exc: raise CommandExecutionError ('{0} needs to be configured'.format(exc)) except ValueError as exc: raise CommandExecutionError ( 'There was an error with the JSON data: {0}'.format(exc) ) json_data[key] = value try: with salt.utils.fopen(profile['json_file'], 'w') as fp_: json.dump(json_data, fp_) except IOError as exc: raise CommandExecutionError (exc) return get(key, profile) Because we did all of this error handling when reading the file, by the time we get to writing it back again, we already know that the path is value, the JSON is valid, and there are no profile errors. However, there could still be errors in saving the file. Try the following: # chattr +i /root/mydata.json # salt-call --local sdb.set sdb://myjson/password 456pass You'll get this output: Error running 'sdb.set': [Errno 13] Permission denied: '/root/mydata.json' We've changed the attribute of the file to make it immutable (read-only), and we can no longer write to the file. Without the IOError method, we would get an ugly traceback message just like before. Removing the immutable attribute will allow our function to run properly: # chattr -i /root/mydata.json # salt-call --local sdb.set sdb://myjson/password 456pass local: 456pass Summary The three areas of Salt configuration that can be hooked into using the loader system are dynamic grains, external pillars, and SDB. Grains are generated on the Minion, pillars are generated on the Master, and SDB URIs can be configured at either place. SDB modules allow configuration to be stored outside, but referenced from, the various parts of the Salt configuration. When accessed from execution modules, they are resolved on the Minion. When accessed from Salt Cloud, they are resolved on whichever system is running Salt Cloud. Resources for Article: Further resources on this subject: Introducing Salt [article] Diving into Salt Internals [article] Networking [article]
Read more
  • 0
  • 0
  • 4008

article-image-probability-r
Packt
23 Feb 2016
17 min read
Save for later

Probability of R?

Packt
23 Feb 2016
17 min read
It's time for us to put descriptive statistics down for the time being. It was fun for a while, but we're no longer content just determining the properties of observed data; now we want to start making deductions about data we haven't observed. This leads us to the realm of inferential statistics. In data analysis, probability is used to quantify uncertainty of our deductions about unobserved data. In the land of inferential statistics, probability reigns queen. Many regard her as a harsh mistress, but that's just a rumor. (For more resources related to this topic, see here.) Basic probability Probability measures the likeliness that a particular event will occur. When mathematicians (us, for now!) speak of an event, we are referring to a set of potential outcomes of an experiment, or trial, to which we can assign a probability of occurrence. Probabilities are expressed as a number between 0 and 1 (or as a percentage out of 100). An event with a probability of 0 denotes an impossible outcome, and a probability of 1 describes an event that is certain to occur. The canonical example of probability at work is a coin flip. In the coin flip event, there are two outcomes: the coin lands on heads, or the coin lands on tails. Pretending that coins never land on their edge (they almost never do), those two outcomes are the only ones possible. The sample space (the set of all possible outcomes), therefore, is {heads, tails}. Since the entire sample space is covered by these two outcomes, they are said to be collectively exhaustive. The sum of the probabilities of collectively exhaustive events is always 1. In this example, the probability that the coin flip will yield heads or yield tails is 1; it is certain that the coin will land on one of those. In a fair and correctly balanced coin, each of those two outcomes is equally likely. Therefore, we split the probability equally among the outcomes: in the event of a coin flip, the probability of obtaining heads is 0.5, and the probability of tails is 0.5 as well. This is usually denoted as follows: The probability of a coin flip yielding either heads or tails looks like this: And the probability of a coin flip yielding both heads and tails is denoted as follows: The two outcomes, in addition to being collectively exhaustive, are also mutually exclusive. This means that they can never co-occur. This is why the probability of heads and tails is 0; it just can't happen. The next obligatory application of beginner probability theory is in the case of rolling a standard six-sided die. In the event of a die roll, the sample space is {1, 2, 3, 4, 5, 6}. With every roll of the die, we are sampling from this space. In this event, too, each outcome is equally likely, except now we have to divide the probability across six outcomes. In the following equation, we denote the probability of rolling a 1 as P(1): Rolling a 1 or rolling a 2 is not collectively exhaustive (we can still roll a 3, 4, 5, or 6), but they are mutually exclusive; we can't roll a 1 and 2. If we want to calculate the probability of either one of two mutually exclusive events occurring, we add the probabilities: While rolling a 1 or rolling a 2 aren't mutually exhaustive, rolling 1 and not rolling a 1 are. This is usually denoted in this manner: These two events—and all events that are both collectively exhaustive and mutually exclusive—are called complementary events. Our last pedagogical example in the basic probability theory is using a deck of cards. Our deck has 52 cards—4 for each number from 2 to 10 and 4 each of Jack, Queen, King, and Ace (no Jokers!). Each of these 4 cards belong to one suit, either a Heart, Club, Spade or Diamond. There are, therefore, 13 cards in each suit. Further, every Heart and Diamond card is colored red, and every Spade and Club are black. From this, we can deduce the following probabilities for the outcome of randomly choosing a card: What, then, is the probability of getting a black card and an Ace? Well, these events are conditionally independent, meaning that the probability of either outcome does not affect the probability of the other. In cases like these, the probability of event A and event B is the product of the probability of A and the probability of B. Therefore: Intuitively, this makes sense, because there are two black Aces out of a possible 52. What about the probability that we choose a red card and a Heart? These two outcomes are not conditionally independent, because knowing that the card is red has a bearing on the likelihood that the card is also a Heart. In cases like these, the probability of event A and B is denoted as follows: Where P(A|B) means the probability of A given B. For example, if we represent A as drawing a Heart and B as drawing a red card, P(A | B) means what's the probability of drawing a heart if we know that the card we drew was red?. Since a red card is equally likely to be a Heart or a Diamond, P(A|B) is 0.5. Therefore: In the preceding equation, we used the form P(B) P(A|B). Had we used the form P(A) P(B|A), we would have got the same answer: So, these two forms are equivalent: For kicks, let's divide both sides of the equation by P(B). That yields the following equivalence: This equation is known as Bayes' Theorem. This equation is very easy to derive, but its meaning and influence is profound. In fact, it is one of the most famous equations in all of mathematics. Bayes' Theorem has been applied to and proven useful in an enormous amount of different disciplines and contexts. It was used to help crack the German Enigma code during World War II, saving the lives of millions. It was also used recently, and famously, by Nate Silver to help correctly predict the voting patterns of 49 states in the 2008 US presidential election. At its core, Bayes' Theorem tells us how to update the probability of a hypothesis in light of new evidence. Due to this, the following formulation of Bayes' Theorem is often more intuitive: where H is the hypothesis and E is the evidence. Let's see an example of Bayes' Theorem in action! There's a hot new recreational drug on the scene called Allighate (or Ally for short). It's named as such because it makes its users go wild and act like an alligator. Since the effect of the drug is so deleterious, very few people actually take the drug. In fact, only about 1 in every thousand people (0.1%) take it. Frightened by fear-mongering late-night news, Daisy Girl, Inc., a technology consulting firm, ordered an Allighate testing kit for all of its 200 employees so that it could offer treatment to any employee who has been using it. Not sparing any expense, they bought the best kit on the market; it had 99% sensitivity and 99% specificity. This means that it correctly identified drug users 99 out of 100 times, and only falsely identified a non-user as a user once in every 100 times. When the results finally came back, two employees tested positive. Though the two denied using the drug, their supervisor, Ronald, was ready to send them off to get help. Just as Ronald was about to send them off, Shanice, a clever employee from the statistics department, came to their defense. Ronald incorrectly assumed that each of the employees who tested positive were using the drug with 99% certainty and, therefore, the chances that both were using it was 98%. Shanice explained that it was actually far more likely that neither employee was using Allighate. How so? Let's find out by applying Bayes' theorem! Let's focus on just one employee right now; let H be the hypothesis that one of the employees is using Ally, and E represent the evidence that the employee tested positive. We want to solve the left side of the equation, so let's plug in values. The first part of the right side of the equation, P(Positive Test | Ally User), is called the likelihood. The probability of testing positive if you use the drug is 99%; this is what tripped up Ronald—and most other people when they first heard of the problem. The second part, P(Ally User), is called the prior. This is our belief that any one person has used the drug before we receive any evidence. Since we know that only .1% of people use Ally, this would be a reasonable choice for a prior. Finally, the denominator of the equation is a normalizing constant, which ensures that the final probability in the equation will add up to one of all possible hypotheses. Finally, the value we are trying to solve, P(Ally user | Positive Test), is the posterior. It is the probability of our hypothesis updated to reflect new evidence. In many practical settings, computing the normalizing factor is very difficult. In this case, because there are only two possible hypotheses, being a user or not, the probability of finding the evidence of a positive test is given as follows: Which is: (.99 * .001) + (.01 * .999) = 0.01098 Plugging that into the denominator, our final answer is calculated as follows: Note that the new evidence, which favored the hypothesis that the employee was using Ally, shifted our prior belief from .001 to .09. Even so, our prior belief about whether an employee was using Ally was so extraordinarily low, it would take some very very strong evidence indeed to convince us that an employee was an Ally user. Ignoring the prior probability in cases like these is known as base-rate fallacy. Shanice assuaged Ronald's embarrassment by assuring him that it was a very common mistake. Now to extend this to two employees: the probability of any two employees both using the drug is, as we now know, .01 squared, or 1 million to one. Squaring our new posterior yields, we get .0081. The probability that both employees use Ally, even given their positive results, is less than 1%. So, they are exonerated. Sally is a different story, though. Her friends noticed her behavior had dramatically changed as of late—she snaps at co-workers and has taken to eating pencils. Her concerned cubicle-mate even followed her after work and saw her crawl into a sewer, not to emerge until the next day to go back to work. Even though Sally passed the drug test, we know that it's likely (almost certain) that she uses Ally. Bayes' theorem gives us a way to quantify that probability! Our prior is the same, but now our likelihood is pretty much as close to 1 as you can get - after all, how many non-Ally users do you think eat pencils and live in sewers? A tale of two interpretations Though it may seem strange to hear, there is actually a hot philosophical debate about what probability really is. Though there are others, the two primary camps into which virtually all mathematicians fall are the frequentist camp and the Bayesian camp. The frequentist interpretation describes probability as the relative likelihood of observing an outcome in an experiment when you repeat the experiment multiple times. Flipping a coin is a perfect example; the probability of heads converges to 50% as the number of times it is flipped goes to infinity. The frequentist interpretation of probability is inherently objective; there is a true probability out there in the world, which we are trying to estimate. The Bayesian interpretation, however, views probability as our degree of belief about something. Because of this, the Bayesian interpretation is subjective; when evidence is scarce, there are sometimes wildly different degrees of belief among different people. Described in this manner, Bayesianism may scare many people off, but it is actually quite intuitive. For example, when a meteorologist describes the probability of rain as 70%, people rarely bat an eyelash. But this number only really makes sense within a Bayesian framework because exact meteorological conditions are not repeatable, as is required by frequentist probability. Not simply a heady academic exercise, these two interpretations lead to different methodologies in solving problems in data analysis. Many times, both approaches lead to similar results. Though practitioners may strongly align themselves with one side over another, good statisticians know that there's a time and a place for both approaches. Though Bayesianism as a valid way of looking at probability is debated, Bayes theorem is a fact about probability and is undisputed and non-controversial. Sampling from distributions Observing the outcome of trials that involve a random variable, a variable whose value changes due to chance, can be thought of as sampling from a probability distribution—one that describes the likelihood of each member of the sample space occurring. That sentence probably sounds much scarier than it needs to be. Take a die roll for example. Figure 4.1: Probability distribution of outcomes of a die roll Each roll of a die is like sampling from a discrete probability distribution for which each outcome in the sample space has a probability of 0.167 or 1/6. This is an example of a uniform distribution, because all the outcomes are uniformly as likely to occur. Further, there are a finite number of outcomes, so this is a discrete uniform distribution (there also exist continuous uniform distributions). Flipping a coin is like sampling from a uniform distribution with only two outcomes. More specifically, the probability distribution that describes coin-flip events is called a Bernoulli distribution—it's a distribution describing only two events. Parameters We use probability distributions to describe the behavior of random variables because they make it easy to compute with and give us a lot of information about how a variable behaves. But before we perform computations with probability distributions, we have to specify the parameters of those distributions. These parameters will determine exactly what the distribution looks like and how it will behave. For example, the behavior of both a 6-sided die and a 12-sided die is modeled with a uniform distribution. Even though the behavior of both the dice is modeled as uniform distributions, the behavior of each is a little different. To further specify the behavior of each distribution, we detail its parameter; in the case of the (discrete) uniform distribution, the parameter is called n. A uniform distribution with parameter n has n equally likely outcomes of probability 1 / n. The n for a 6-sided die and a 12-sided die is 6 and 12 respectively. For a Bernoulli distribution, which describes the probability distribution of an event with only two outcomes, the parameter is p. Outcome 1 occurs with probability p, and the other outcome occurs with probability 1 - p, because they are collectively exhaustive. The flip of a fair coin is modeled as a Bernoulli distribution with p = 0.5. Imagine a six-sided die with one side labeled 1 and the other five sides labeled 2. The outcome of the die roll trials can be described with a Bernoulli distribution, too! This time, p = 0.16 (1/6). Therefore, the probability of not rolling a 1 is 5/6. The binomial distribution The binomial distribution is a fun one. Like our uniform distribution described in the previous section, it is discrete. When an event has two possible outcomes, success or failure, this distribution describes the number of successes in a certain number of trials. Its parameters are n, the number of trials, and p, the probability of success. Concretely, a binomial distribution with n=1 and p=0.5 describes the behavior of a single coin flip—if we choose to view heads as successes (we could also choose to view tails as successes). A binomial distribution with n=30 and p=0.5 describes the number of heads we should expect. Figure 4.2: A binomial distribution (n=30, p=0.5) On average, of course, we would expect to have 15 heads. However, randomness is the name of the game, and seeing more or fewer heads is totally expected. How can we use the binomial distribution in practice?, you ask. Well, let's look at an application. Larry the Untrustworthy Knave—who can only be trusted some of the time—gives us a coin that he alleges is fair. We flip it 30 times and observe 10 heads. It turns out that the probability of getting exactly 10 heads on 30 flips is about 2.8%*. We can use R to tell us the probability of getting 10 or fewer heads using the pbinom function:   > pbinom(10, size=30, prob=.5)   [1] 0.04936857 It appears as if the probability of this occurring, in a correctly balanced coin, is roughly 5%. Do you think we should take Larry at his word? *If you're interested The way we determined the probability of getting exactly 10 heads is by using the probability formula for Bernoulli trials. The probability of getting k successes in n trials is equal to: where p is the probability of getting one success and: The normal distribution When we described the normal distribution and how ubiquitous it is? The behavior of many random variables in real life is very well described by a normal distribution with certain parameters. The two parameters that uniquely specify a normal distribution are µ (mu) and σ (sigma). µ, the mean, describes where the distribution's peak is located and σ, the standard deviation, describes how wide or narrow the distribution is. Figure 4.3: Normal distributions with different parameters The distribution of heights of American females is approximately normally distributed with parameters µ= 65 inches and σ= 3.5 inches. Figure 4.4: Normal distributions with different parameters With this information, we can easily answer questions about how probable it is to choose, at random, US women of certain heights. We can't really answer the question What is the probability that we choose a person who is exactly 60 inches?, because virtually no one is exactly 60 inches. Instead, we answer questions about how probable it is that a random person is within a certain range of heights. What is the probability that a randomly chosen woman is 70 inches or taller? If you recall, the probability of a height within a range is the area under the curve, or the integral over that range. In this case, the range we will integrate looks like this: Figure 4.5: Area under the curve of the height distribution from 70 inches to positive infinity    > f <- function(x){ dnorm(x, mean=65, sd=3.5) }   > integrate(f, 70, Inf)   0.07656373 with absolute error < 2.2e-06 The preceding R code indicates that there is a 7.66% chance of randomly choosing a woman who is 70 inches or taller. Luckily for us, the normal distribution is so popular and well studied, that there is a function built into R, so we don't need to use integration ourselves.   > pnorm(70, mean=65, sd=3.5)   [1] 0.9234363  The pnorm function tells us the probability of choosing a woman who is shorter than 70 inches. If we want to find P (> 70 inches), we can either subtract this value by 1 (which gives us the complement) or use the optional argument lower.tail=FALSE. If you do this, you'll see that the result matches the 7.66% chance we arrived at earlier. Summary You can check out similar books published by Packt Publishing on R (https://www.packtpub.com/tech/r): Unsupervised Learning with R by Erik Rodríguez Pacheco (https://www.packtpub.com/big-data-and-business-intelligence/unsupervised-learning-r) R Data Science Essentials by Raja B. Koushik and Sharan Kumar Ravindran (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science-essentials) Resources for Article: Further resources on this subject: Dealing With A Mess [article] Navigating The Online Drupal Community [article] Design With Spring AOP [article]
Read more
  • 0
  • 0
  • 3191

article-image-dealing-mess
Packt
23 Feb 2016
59 min read
Save for later

Dealing with a Mess

Packt
23 Feb 2016
59 min read
Analyzing data in the real world often requires some know-how outside of the typical introductory data analysis curriculum. For example, rarely do we get a neatly formatted, tidy dataset with no errors, junk, or missing values. Rather, we often get messy, unwieldy datasets. What makes a dataset messy? Different people in different roles have different ideas about what constitutes messiness. Some regard any data that invalidates the assumptions of the parametric model as messy. Others see messiness in datasets with a grievously imbalanced number of observations in each category for a categorical variable. Some examples of things that I would consider messy are: Many missing values (NAs) Misspelled names in categorical variables Inconsistent data coding Numbers in the same column being in different units Mis-recorded data and data entry mistakes Extreme outliers Since there are an infinite number of ways that data can be messy, there's simply no chance of enumerating every example and their respective solutions. Instead, we are going to talk about two tools that help combat the bulk of the messiness issues that I cited just now. (For more resources related to this topic, see here.) Analysis with missing data Missing data is another one of those topics that are largely ignored in most introductory texts. Probably, part of the reason why this is the case is that many myths about analysis with missing data still abound. Additionally, some of the research into cutting-edge techniques is still relatively new. A more legitimate reason for its absence in introductory texts is that most of the more principled methodologies are fairly complicated—mathematically speaking. Nevertheless, the incredible ubiquity of problems related to missing data in real life data analysis necessitates some broaching of the subject. This section serves as a gentle introduction into the subject and one of the more effective techniques for dealing with it. A common refrain on the subject is something along the lines of the best way to deal with missing data is not to have any. It's true that missing data is a messy subject, and there are a lot of ways to do it wrong. It's important not to take this advice to the extreme, though. In order to bypass missing data problems, some have disallowed survey participants, for example, to go on without answering all the questions on a form. You can coerce the participants in a longitudinal study to not drop out, too. Don't do this. Not only is it unethical, it is also prodigiously counter-productive; there are treatments for missing data, but there are no treatments for bad data. The standard treatment to the problem of missing data is to replace the missing data with non-missing values. This process is called imputation. In most cases, the goal of imputation is not to recreate the lost completed dataset but to allow valid statistical estimates or inferences to be drawn from incomplete data. Because of this, the effectiveness of different imputation techniques can't be evaluated by their ability to most accurately recreate the data from a simulated missing dataset; they must, instead, be judged by their ability to support the same statistical inferences as would be drawn from the analysis on the complete data. In this way, filling in the missing data is only a step towards the real goal—the analysis. The imputed dataset is rarely considered the final goal of imputation. There are many different ways that missing data is dealt with in practice—some are good, some are not so good. Some are okay under certain circumstances, but not okay in others. Some involve missing data deletion, while some involve imputation. We will briefly touch on some of the more common methods. The ultimate goal of this article, though, is to get you started on what is often described as the gold-standard of imputation techniques: multiple imputation. Visualizing missing data In order to demonstrate the visualizing patterns of missing data, we first have to create some missing data. This will also be the same dataset that we perform analysis on later in the article. To showcase how to use multiple imputation for a semi-realistic scenario, we are going to create a version of the mtcars dataset with a few missing values: Okay, let's set the seed (for deterministic randomness), and create a variable to hold our new marred dataset. set.seed(2) miss_mtcars <- mtcars First, we are going to create seven missing values in drat (about 20 percent), five missing values in the mpg column (about 15 percent), five missing values in the cyl column, three missing values in wt (about 10 percent), and three missing values in vs: some_rows <- sample(1:nrow(miss_mtcars), 7) miss_mtcars$drat[some_rows] <- NA  some_rows <- sample(1:nrow(miss_mtcars), 5) miss_mtcars$mpg[some_rows] <- NA  some_rows <- sample(1:nrow(miss_mtcars), 5) miss_mtcars$cyl[some_rows] <- NA  some_rows <- sample(1:nrow(miss_mtcars), 3) miss_mtcars$wt[some_rows] <- NA  some_rows <- sample(1:nrow(miss_mtcars), 3) miss_mtcars$vs[some_rows] <- NA Now, we are going to create four missing values in qsec, but only for automatic cars: only_automatic <- which(miss_mtcars$am==0) some_rows <- sample(only_automatic, 4) miss_mtcars$qsec[some_rows] <- NA Now, let's take a look at the dataset: > miss_mtcars                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 Datsun 710          22.8   4 108.0  93 3.85    NA 18.61  1  1    4    1 Hornet 4 Drive      21.4   6 258.0 110   NA 3.215 19.44  1  0    3    1 Hornet Sportabout   18.7   8 360.0 175   NA 3.440 17.02  0  0    3    2 Valiant             18.1  NA 225.0 105   NA 3.460    NA  1  0    3    1 Great, now let's visualize the missingness. The first way we are going to visualize the pattern of missing data is by using the md.pattern function from the mice package (which is also the package that we are ultimately going to use for imputing our missing data). If you don't have the package already, install it. > library(mice) > md.pattern(miss_mtcars)    disp hp am gear carb wt vs qsec mpg cyl drat   12    1  1  1    1    1  1  1    1   1   1    1  0  4    1  1  1    1    1  1  1    1   0   1    1  1  2    1  1  1    1    1  1  1    1   1   0    1  1  3    1  1  1    1    1  1  1    1   1   1    0  1  3    1  1  1    1    1  0  1    1   1   1    1  1  2    1  1  1    1    1  1  1    0   1   1    1  1  1    1  1  1    1    1  1  1    1   0   1    0  2  1    1  1  1    1    1  1  1    0   1   0    1  2  1    1  1  1    1    1  1  0    1   1   0    1  2  2    1  1  1    1    1  1  0    1   1   1    0  2  1    1  1  1    1    1  1  1    0   1   0    0  3       0  0  0    0    0  3  3    4   5   5    7 27 A row-wise missing data pattern refers to the columns that are missing for each row. This function aggregates and counts the number of rows with the same missing data pattern. This function outputs a binary (0 and 1) matrix. Cells with a 1 represent non-missing data; 0s represent missing data. Since the rows are sorted in an increasing-amount-of-missingness order, the first row always refers to the missing data pattern containing the least amount of missing data. In this case, the missing data pattern with the least amount of missing data is the pattern containing no missing data at all. Because of this, the first row has all 1s in the columns that are named after the columns in the miss_mtcars dataset. The left-most column is a count of the number of rows that display the missing data pattern, and the right-most column is a count of the number of missing data points in that pattern. The last row contains a count of the number of missing data points in each column. As you can see, 12 of the rows contain no missing data. The next most common missing data pattern is the one with missing just mpg; four rows fit this pattern. There are only six rows that contain more than one missing value. Only one of these rows contains more than two missing values (as shown in the second-to-last row). As far as datasets with missing data go, this particular one doesn't contain much. It is not uncommon for some datasets to have more than 30 percent of its data missing. This data set doesn't even hit 3 percent. Now let's visualize the missing data pattern graphically using the VIM package. You will probably have to install this, too. library(VIM) aggr(miss_mtcars, numbers=TRUE) Figure11.1: The output of VIM's visual aggregation of missing data. The left plot shows the proportion on missing values for each column. The right plot depicts the prevalence of row-wise missing data patterns, like md.pattern At a glance, this representation shows us, effortlessly, that the drat column accounts for the highest proportion of missingness, column-wise, followed by mpg, cyl, qsec, vs, and wt. The graphic on the right shows us information similar to that of the output of md.pattern. This representation, though, makes it easier to tell if there is some systematic pattern of missingness. The blue cells represent non-missing data, and the red cells represent missing data. The numbers on the right of the graphic represent the proportion of rows displaying that missing data pattern. 37.5 percent of the rows contain no missing data whatsoever. Types of missing data The VIM package allowed us to visualize the missing data patterns. A related term, the missing data mechanism, describes the process that determines each data point's likelihood of being missing. There are three main categories of missing data mechanisms: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Discrimination based on missing data mechanism is crucial, since it informs us about the options for handling the missingness. The first mechanism, MCAR, occurs when data's missingness is unrelated to the data. This would occur, for example, if rows were deleted from a database at random, or if a gust of wind took a random sample of a surveyor's survey forms off into the horizon. The mechanism that governs the missingness of drat, mpg, cyl, wt, and vs' is MCAR, because we randomly selected elements to go missing. This mechanism, while being the easiest to work with, is seldom tenable in practice. MNAR, on the other hand, occurs when a variable's missingness is related to the variable itself. For example, suppose the scale that weighed each car had a capacity of only 3,700 pounds, and because of this, the eight cars that weighed more than that were recorded as NA. This is a classic example of the MNAR mechanism—it is the weight of the observation itself that is the cause for its being missing. Another example would be if during the course of trial of an anti-depressant drug, participants who were not being helped by the drug became too depressed to continue with the trial. At the end of the trial, when all the participants' level of depression is accessed and recorded, there would be missing values for participants whose reason for absence is related to their level of depression. The last mechanism, missing at random, is somewhat unfortunately named. Contrary to what it may sound like, it means there is a systematic relationship between the missingness of an outcome variable' and other observed variables, but not the outcome variable itself. This is probably best explained by the following example. Suppose that in a survey, there is a question about income level that, in its wording, uses a particular colloquialism. Due to this, a large number of the participants in the survey whose native language is not English couldn't interpret the question, and left it blank. If the survey collected just the name, gender, and income, the missing data mechanism of the question on income would be MNAR. If, however, the questionnaire included a question that asked if the participant spoke English as a first language, then the mechanism would be MAR. The inclusion of the Is English your first language? variable means that the missingness of the income question can be completely accounted for. The reason for the moniker missing at random is that when you control the relationship between the missing variable and the observed variable(s) it is related to (for example, What is your income? and Is English your first language? respectively), the data are missing at random. As another example, there is a systematic relationship between the am and qsec variables in our simulated missing dataset: qsecs were missing only for automatic cars. But within the group of automatic cars, the qsec variable is missing at random. Therefore, qsec 's mechanism is MAR; controlling for transmission type, qsec is missing at random. Bear in mind, though, if we removed am from our simulated dataset, qsec would become MNAR. As mentioned earlier, MCAR is the easiest type to work with because of the complete absence of a systematic relationship in the data's missingness. Many unsophisticated techniques for handling missing data rest on the assumption that the data are MCAR. On the other hand, MNAR data is the hardest to work with since the properties of the missing data that caused its missingness has to be understood quantifiably, and included in the imputation model. Though multiple imputations can handle the MNAR mechanisms, the procedures involved become more complicated and far beyond the scope of this text. The MCAR and MAR mechanisms allow us not to worry about the properties and parameters of the missing data. For this reason, may sometimes find MCAR or MAR missingness being referred to as ignorable missingness. MAR data is not as hard to work with as MNAR data, but it is not as forgiving as MCAR. For this reason, though our simulated dataset contains MCAR and MAR components, the mechanism that describes the whole data is MAR—just one MAR mechanism makes the whole dataset MAR. So which one is it? You may have noticed that the place of a particular dataset in the missing data mechanism taxonomy is dependent on the variables that it includes. For example, we know that the mechanism behind qsec is MAR, but if the dataset did not include am, it would be MNAR. Since we are the ones that created the data, we know the procedure that resulted in qsec 's missing values. If we weren't the ones creating the data—as happens in the real world—and the dataset did not contain the am column, we would just see a bunch of arbitrarily missing qsec values. This might lead us to believe that the data is MCAR. It isn't, though; just because the variable to which another variable's missingness is systematically related is non-observed, doesn't mean that it doesn't exist. This raises a critical question: can we ever be sure that our data is not MNAR? The unfortunate answer is no. Since the data that we need to prove or disprove MNAR is ipso facto missing, the MNAR assumption can never be conclusively disconfirmed. It's our job, as critically thinking data analysts, to ask whether there is likely an MNAR mechanism or not. Unsophisticated methods for dealing with missing data Here we are going to look at various types of methods for dealing with missing data: Complete case analysis This method, also called list-wise deletion, is a straightforward procedure that simply removes all rows or elements containing missing values prior to the analysis. In the univariate case—taking the mean of the drat column, for example—all elements of drat that are missing would simply be removed: > mean(miss_mtcars$drat) [1] NA > mean(miss_mtcars$drat, na.rm=TRUE) [1] 3.63 In a multivariate procedure—for example, linear regression predicting mpg from am, wt, and qsec—all rows that have a missing value in any of the columns included in the regression are removed: listwise_model <- lm(mpg ~ am + wt + qsec,                      data=miss_mtcars,                      na.action = na.omit) ## OR # complete.cases returns a boolean vector comp <- complete.cases(cbind(miss_mtcars$mpg,                              miss_mtcars$am,                              miss_mtcars$wt,                              miss_mtcars$qsec)) comp_mtcars <- mtcars[comp,] listwise_model <- lm(mpg ~ am + wt + qsec,                      data=comp_mtcars) Under an MCAR mechanism, a complete case analysis produces unbiased estimates of the mean, variance/standard deviation, and regression coefficients, which means that the estimates don't systematically differ from the true values on average, since the included data elements are just a random sampling of the recorded data elements. However, inference-wise, since we lost a number of our samples, we are going to lose statistical power and generate standard errors and confidence intervals that are bigger than they need to be. Additionally, in the multivariate regression case, note that our sample size depends on the variables that we include in the regression; more the variables, more is the missing data that we open ourselves up to, and more the rows that we are liable to lose. This makes comparing results across different models slightly hairy. Under an MAR or MNAR mechanism, list-wise deletion will produce biased estimates of the mean and variance. For example, if am were highly correlated with qsec, the fact that we are missing qsec only for automatic cars would significantly shift our estimates of the mean of qsec. Surprisingly, list-wise deletion produces unbiased estimates of the regression coefficients, even if the data is MNAR or MAR, as long as the relevant variables are included in the regression equations. For this reason, if there are relatively few missing values in a data set that is to be used in regression analysis, list-wise deletion could be an acceptable alternative to more principled approaches. Pairwise deletion Also called available-case analysis, this technique is (somewhat unfortunately) common when estimating covariance or correlation matrices. For each pair of variables, it only uses the cases that are non-missing for both. This often means that the number of elements used will vary from cell to cell of the covariance/correlation matrices. This can result in absurd correlation coefficients that are above 1, making the resulting matrices largely useless to methodologies that depend on them. Mean substitution Mean substitution, as the name suggests, replaces all the missing values with the mean of the available cases. For example: mean_sub <- miss_mtcars mean_sub$qsec[is.na(mean_sub$qsec)] <- mean(mean_sub$qsec,                                             na.rm=TRUE) # etc... Although this seemingly solves the problem of the loss of sample size in the list-wise deletion procedure, mean substitution has some very unsavory properties of it's own. Whilst mean substitution produces unbiased estimates of the mean of a column, it produces biased estimates of the variance, since it removes the natural variability that would have occurred in the missing values had they not been missing. The variance estimates from mean substitution will therefore be, systematically, too small. Additionally, it's not hard to see that mean substitution will result in biased estimates if the data are MAR or MNAR. For these reasons, mean substitution is not recommended under virtually any circumstance. Hot deck imputation Hot deck imputation is an intuitively elegant approach that fills in the missing data with donor values from another row in the dataset. In the least sophisticated formulation, a random non-missing element from the same dataset is shared with a missing value. In more sophisticated hot deck approaches, the donor value comes from a row that is similar to the row with the missing data. The multiple imputation techniques that we will be using in a later section of this article borrows this idea for one of its imputation methods. The term hot deck refers to the old practice of storing data in decks of punch cards. The deck that holds the donor value would be hot because it is the one that is currently being processed. Regression imputation This approach attempts to fill in the missing data in a column using regression to predict likely values of the missing elements using other columns as predictors. For example, using regression imputation on the drat column would employ a linear regression predicting drat from all the other columns in miss_mtcars. The process would be repeated for all columns containing missing data, until the dataset is complete. This procedure is intuitively appealing, because it integrates knowledge of the other variables and patterns of the dataset. This creates a set of more informed imputations. As a result, this produces unbiased estimates of the mean and regression coefficients under MCAR and MAR (so long as the relevant variables are included in the regression model. However, this approach is not without its problems. The predicted values of the missing data lie right on the regression line but, as we know, very few data points lie right on the regression line—there is usually a normally distributed residual (error) term. Due to this, regression imputation underestimates the variability of the missing values. As a result, it will result in biased estimates of the variance and covariance between different columns. However, we're on the right track. Stochastic regression imputation As far as unsophisticated approaches go, stochastic regression is fairly evolved. This approach solves some of the issues of regression imputation, and produces unbiased estimates of the mean, variance, covariance, and regression coefficients under MCAR and MAR. It does this by adding a random (stochastic) value to the predictions of regression imputation. This random added value is sampled from the residual (error) distribution of the linear regression—which, if you remember, is assumed to be a normal distribution. This restores the variability in the missing values (that we lost in regression imputation) that those values would have had if they weren't missing. However, as far as subsequent analysis and inference on the imputed dataset goes, stochastic regression results in standard errors and confidence intervals that are smaller than they should be. Since it produces only one imputed dataset, it does not capture the extent to which we are uncertain about the residuals and our coefficient estimates. Nevertheless, stochastic regression forms the basis of still more sophisticated imputation methods. There are two sophisticated, well-founded, and recommended methods of dealing with missing data. One is called the Expectation Maximization (EM) method, which we do not cover here. The second is called Multiple Imputation, and because it is widely considered the most effective method, it is the one we explore in this article. Multiple imputation The big idea behind multiple imputation is that instead of generating one set of imputed data with our best estimation of the missing data, we generate multiple versions of the imputed data where the imputed values are drawn from a distribution. The uncertainty about what the imputed values should be is reflected in the variation between the multiply imputed datasets. We perform our intended analysis separately with each of these m amount of completed datasets. These analyses will then yield m different parameter estimates (like regression coefficients, and so on). The critical point is that these parameter estimates are different solely due to the variability in the imputed missing values, and hence, our uncertainty about what the imputed values should be. This is how multiple imputation integrates uncertainty, and outperforms more limited imputation methods that produce one imputed dataset, conferring an unwarranted sense of confidence in the filled-in data of our analysis. The following diagram illustrates this idea: Figure 11.2: Multiple imputation in a nutshell So how does mice come up with the imputed values? Let's focus on the univariate case—where only one column contains missing data and we use all the other (completed) columns to impute the missing values—before generalizing to a multivariate case. mice actually has a few different imputation methods up its sleeve, each best suited for a particular use case. mice will often choose sensible defaults based on the data type (continuous, binary, non-binary categorical, and so on). The most important method is what the package calls the norm method. This method is very much like stochastic regression. Each of the m imputations is created by adding a normal "noise" term to the output of a linear regression predicting the missing variable. What makes this slightly different than just stochastic regression repeated m times is that the norm method also integrates uncertainty about the regression coefficients used in the predictive linear model. Recall that the regression coefficients in a linear regression are just estimates of the population coefficients from a random sample (that's why each regression coefficient has a standard error and confidence interval). Another sample from the population would have yielded slightly different coefficient estimates. If through all our imputations, we just added a normal residual term from a linear regression equation with the same coefficients, we would be systematically understating our uncertainty regarding what the imputed values should be. To combat this, in multiple imputation, each imputation of the data contains two steps. The first step performs stochastic linear regression imputation using coefficients for each predictor estimated from the data. The second step chooses slightly different estimates of these regression coefficients, and proceeds into the next imputation. The first step of the next imputation uses the slightly different coefficient estimates to perform stochastic linear regression imputation again. After that, in the second step of the second iteration, still other coefficient estimates are generated to be used in the third imputation. This cycle goes on until we have m multiply imputed datasets. How do we choose these different coefficient estimates at the second step of each imputation? Traditionally, the approach is Bayesian in nature; these new coefficients are drawn from each of the coefficients' posterior distribution, which describes credible values of the estimate using the observed data and uninformative priors. This is the approach that norm uses. There is an alternate method that chooses these new coefficient estimates from a sampling distribution that is created by taking repeated samples of the data (with replacement) and estimating the regression coefficients of each of these samples. mice calls this method norm.boot. The multivariate case is a little more hairy, since the imputation for one column depends on the other columns, which may contain missing data of their own. For this reason, we make several passes over all the columns that need imputing, until the imputation of all missing data in a particular column is informed by informed estimates of the missing data in the predictor columns. These passes over all the columns are called iterations. So that you really understand how this iteration works, let's say we are performing multiple imputation on a subset of miss_mtcars containing only mpg, wt and drat. First, all the missing data in all the columns are set to a placeholder value like the mean or a randomly sampled non-missing value from its column. Then, we visit mpg where the placeholder values are turned back into missing values. These missing values are predicted using the two-part procedure described in the univariate case. Then we move on to wt; the placeholder values are turned back into missing values, whose new values are imputed with the two-step univariate procedure using mpg and drat as predictors. Then this is repeated with drat. This is one iteration. On the next iteration, it is not the placeholder values that get turned back into random values and imputed but the imputed values from the previous iteration. As this repeats, we shift away from the starting values and the imputed values begin to stabilize. This usually happens within just a few iterations. The dataset at the completion of the last iteration is the first multiply imputed dataset. Each m starts the iteration process all over again. The default in mice is five iterations. Of course, you can increase this number if you have reason to believe that you need to. We'll discuss how to tell if this is necessary later in the section. Methods of imputation The method of imputation that we described for the univariate case, norm, works best for imputed values that follow an unconstrained normal distribution—but it could lead to some nonsensical imputations otherwise. For example, since the weights in wt are so close to 0 (because it's in units of a thousand pounds) it is possible for the norm method to impute a negative weight. Though this will no doubt balance out over the other m-1 multiply imputed datasets, we can combat this situation by using another method of imputation called predictive mean matching. Predictive mean matching (mice calls this pmm) works a lot like norm. The difference is that the norm imputations are then used to find the d closest values to the imputed value among the non-missing data in the column. Then, one of these d values is chosen as the final imputed value—d=3 is the default in mice. This method has a few great properties. For one, the possibility of imputing a negative value for wt is categorically off the table; the imputed value would have to be chosen from the set {1.513, 1.615, 1.835}, since these are the three lowest weights. More generally, any natural constraint in the data (lower or upper bounds, integer count data, numbers rounded to the nearest one-half, and so on) is respected with predictive mean matching, because the imputed values appear in the actual non-missing observed values. In this way, predictive mean matching is like hot-deck imputation. Predictive mean matching is the default imputation method in mice for numerical data, though it may be inferior to norm for small datasets and/or datasets with a lot of missing values. Many of the other imputation methods in mice are specially suited for one particular data type. For example, binary categorical variables use logreg by default; this is like norm but uses logistic regression to impute a binary outcome. Similarly, non-binary categorical data uses multinomial regression—mice calls this method polyreg. Multiple imputation in practice There are a few steps to follow and decisions to make when using this powerful imputation technique: Are the data MAR?: And be honest! If the mechanism is likely not MAR, then more complicated measures have to be taken. Are there any derived terms, redundant variables, or irrelevant variables in the data set?: Any of these types of variables will interfere with the regression process. Irrelevant variables—like unique IDs—will not have any predictive power. Derived terms or redundant variables—like having a column for weight in pounds and grams, or a column for area in addition to a length and width column—will similarly interfere with the regression step. Convert all categorical variables to factors, otherwise mice will not be able to tell that the variable is supposed to be categorical. Choose number of iterations and m: By default, these are both five. Using five iterations is usually okay—and we'll be able to tell if we need more. Five imputations are usually okay, too, but we can achieve more statistical power from more imputed datasets. I suggest setting m to 20, unless the processing power and time can't be spared. Choose an imputation method for each variable: You can stick with the defaults as long as you are aware of what they are and think they're the right fit.     Choose the predictors: Let mice use all the available columns as predictors as long as derived terms and redundant/irrelevant columns are removed. Not only does using more    predictors result in reduced bias, but it also increases the likelihood that the data is MAR.     Perform the imputations     Audit the imputations     Perform analysis with the imputations     Pool the results of the analyses Before we get down to it, let's call the mice function on our data frame with missing data, and use its default arguments, just to see what we shouldn't do and why: # we are going to set the seed and printFlag to FALSE, but # everything else will the default argument imp <- mice(miss_mtcars, seed=3, printFlag=FALSE) print(imp)   ------------------------------   Multiply imputed data set Call: mice(data = miss_mtcars, printFlag = FALSE, seed = 3) Number of multiple imputations:  5 Missing cells per column:  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb    5    5    0    0    7    3    4    3    0    0    0 Imputation methods:   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb "pmm" "pmm"    ""    "" "pmm" "pmm" "pmm" "pmm"    ""    ""    "" VisitSequence:  mpg  cyl drat   wt qsec   vs    1    2    5    6    7    8 PredictorMatrix:      mpg cyl disp hp drat wt qsec vs am gear carb mpg    0   1    1  1    1  1    1  1  1    1    1 cyl    1   0    1  1    1  1    1  1  1    1    1 disp   0   0    0  0    0  0    0  0  0    0    0  ... Random generator seed value:  3 The first thing we notice (on line four of the output) is that mice chose to create five multiply imputed datasets, by default. As we discussed, this isn't a bad default, but more imputation can only improve our statistical power (if only marginally); when we impute this data set in earnest, we will use m=20. The second thing we notice (on lines 8-10 of the output) is that it used predictive mean matching as the imputation method for all the columns with missing data. If you recall, predictive mean matching is the default imputation method for numeric columns. However, vs and cyl are binary categorical and non-binary categorical variables, respectively. Because we didn't convert them to factors, mice thinks these are just regular numeric columns. We'll have to fix this. The last thing we should notice here is the predictor matrix (starting on line 14). Each row and column of the predictor matrix refers to a column in the dataset to impute. If a cell contains a 1, it means that the variable referred to in the column is used as a predictor for the variable in the row. The first row indicates that all available attributes are used to help predict mpg with the exception of mpg itself. All the values in the diagonal are 0, because mice won't use an attribute to predict itself. Note that the disp, hp, am, gear, and carb rows all contain `0`s—this is because these variables are complete, and don't need to use any predictors. Since we thought carefully about whether there were any attributes that should be removed before we perform the imputation, we can use mice's default predictor matrix for this dataset. If there were any non-predictive attributes (like unique identifiers, redundant variables, and so on) we would have either had to remove them (easiest option), or instruct mice not to use them as predictors (harder). Let's now correct the issues that we've discussed. # convert categorical variables into factors miss_mtcars$vs <- factor(miss_mtcars$vs) miss_mtcars$cyl <- factor(miss_mtcars$cyl)   imp <- mice(miss_mtcars, m=20, seed=3, printFlag=FALSE) imp$method -------------------------------------       mpg       cyl      disp        hp      drat     "pmm" "polyreg"        ""        ""     "pmm"        wt      qsec        vs        am      gear     "pmm"     "pmm"  "logreg"        ""        ""      carb        "" Now mice has corrected the imputation method of cyl and vs to their correct defaults. In truth, cyl is a kind of discrete numeric variable called an ordinal variable, which means that yet another imputation method may be optimal for that attribute, but, for the sake of simplicity, we'll treat it as a categorical variable. Before we get to use the imputations in an analysis, we have to check the output. The first thing we need to check is the convergence of the iterations. Recall that for imputing data with missing values in multiple columns, multiple imputation requires iteration over all these columns a few times. At each iteration, mice produces imputations—and samples new parameter estimates from the parameters' posterior distributions—for all columns that need to be imputed. The final imputations, for each multiply imputed dataset m, are the imputed values from the final iteration. In contrast to when we used MCMC the convergence in mice is much faster; it usually occurs in just a few iterations. However, visually checking for convergence is highly recommended. We even check for it similarly; when we call the plot function on the variable that we assign the mice output to, it displays trace plots of the mean and standard deviation of all the variables involved in the imputations. Each line in each plot is one of the m imputations. plot(imp) Figure 11.3: A subset of the trace plots produced by plotting an object returned by a mice imputation As you can see from the preceding trace plot on imp, there are no clear trends and the variables are all overlapping from one iteration to the next. Put another way, the variance within a chain (there are m chains ) should be about equal to the variance between the chains. This indicates that convergence was achieved. If convergence was not achieved, you can increase the number of iterations that mice employs by explicitly specifying the maxit parameter to the mice function. To see an example of non-convergence, take a look at Figures 7 and 8 in the paper that describes this package written by the authors of the package' themselves. It is available at http://www.jstatsoft.org/article/view/v045i03 The next step is to make sure the imputed values are reasonable. In general, whenever we quickly review the results of something to see if they make sense, it is called a sanity test or sanity check. With the following line, we're going to display the imputed values for the five missing mpgs for the first six imputations: imp$imp$mpg[,1:6] ------------------------------------                       1    2    3    4    5    6 Duster 360         19.2 16.4 17.3 15.5 15.0 19.2 Cadillac Fleetwood 15.2 13.3 15.0 13.3 10.4 17.3 Chrysler Imperial  10.4 15.0 15.0 16.4 10.4 10.4 Porsche 914-2      27.3 22.8 21.4 22.8 21.4 15.5 Ferrari Dino       19.2 21.4 19.2 15.2 18.1 19.2 These sure look reasonable. A better method for sanity checking is to call densityplot on the variable that we assign the mice output to: densityplot(imp) Figure 11.4: Density plots of all the imputed values for mpg, drat, wt, and qsec. Each imputation has its own density curve in each quadrant This displays, for every attribute imputed, a density plot of the actual non-missing values (the thick line) and the imputed values (the thin lines). We are looking to see that the distributions are similar. Note that the density curve of the imputed values extend much higher than the observed values' density curve in this case. This is partly because we imputed so few variables that there weren't enough data points to properly smooth the density approximation. Height and non-smoothness notwithstanding, these density plots indicate no outlandish behavior among the imputed variables. We are now ready for the analysis phase. We are going to perform linear regression on each imputed dataset and attempt to model mpg as a function of am, wt, and qsec. Instead of repeating the analyses on each dataset manually, we can apply an expression to all the datasets at one time with the with function, as follows: imp_models <- with(imp, lm(mpg ~ am + wt + qsec)) We could take a peak at the estimated coefficients from each dataset using lapply on the analyses attribute of the returned object: lapply(imp_models$analyses, coef) --------------------------------- [[1]] (Intercept)          am          wt        qsec  18.1534095   2.0284014  -4.4054825   0.8637856   [[2]] (Intercept)          am          wt        qsec    8.375455    3.336896   -3.520882    1.219775   [[3]] (Intercept)          am          wt        qsec    5.254578    3.277198   -3.233096    1.337469 ......... Finally, let's pool the results of the analyses (with the pool function), and call summary on it: pooled_model <- pool(imp_models) summary(pooled_model) ----------------------------------                   est        se         t       df    Pr(>|t|) (Intercept)  7.049781 9.2254581  0.764166 17.63319 0.454873254 am           3.182049 1.7445444  1.824000 21.36600 0.082171407 wt          -3.413534 0.9983207 -3.419276 14.99816 0.003804876 qsec         1.270712 0.3660131  3.471765 19.93296 0.002416595                   lo 95     hi 95 nmis       fmi    lambda (Intercept) -12.3611281 26.460690   NA 0.3459197 0.2757138 am           -0.4421495  6.806247    0 0.2290359 0.1600952 wt           -5.5414268 -1.285641    3 0.4324828 0.3615349 qsec          0.5070570  2.034366    4 0.2736026 0.2042003 Though we could have performed the pooling ourselves using the equations that Donald Rubin outlined in his 1987 classic Multiple Imputation for Nonresponse in Surveys, it is less of a hassle and less error-prone to have the pool function do it for us. Readers who are interested in the pooling rules are encouraged to consult the aforementioned text. As you can see, for each parameter, pool has combined the coefficient estimate and standard errors, and calculated the appropriate degrees of freedom. These allow us to t-test each coefficient against the null hypothesis that the coefficient is equal to 0, produce p-values for the t-test, and construct confidence intervals. The standard errors and confidence intervals are wider than those that would have resulted from linear regression on a single imputed dataset, but that's because it is appropriately taking into account our uncertainty regarding what the missing values would have been. There are, at present time, a limited number of analyses that can be automatically pooled by mice—the most important being lm/glm. If you recall, though, the generalized linear model is extremely flexible, and can be used to express a wide array of different analyses. By extension, we could use multiple imputation for not only linear regression but logistic regression, Poisson regression, t-tests, ANOVA, ANCOVA, and more. Analysis with unsanitized data Very often, there will be errors or mistakes in data that can severely complicate analyses—especially with public data or data outside of your organization. For example, say there is a stray comma or punctuation mark in a column that was supposed to be numeric. If we aren't careful, R will read this column as character, and subsequent analysis may, in the best case scenario, fail; it is also possible, however, that our analysis will silently chug along, and return an unexpected result. This will happen, for example, if we try to perform linear regression using the punctuation-containing-but-otherwise-numeric column as a predictor, which will compel R to convert it into a factor thinking that it is a categorical variable. In the worst-case scenario, an analysis with unsanitized data may not error out or return nonsensical results, but return results that look plausible but are actually incorrect. For example, it is common (for some reason) to encode missing data with 999 instead of NA; performing a regression analysis with 999 in a numeric column can severely adulterate our linear models, but often not enough to cause clearly inappropriate results. This mistake may then go undetected indefinitely. Some problems like these could, rather easily, be detected in small datasets by visually auditing the data. Often, however, mistakes like these are notoriously easy to miss. Further, visual inspection is an untenable solution for datasets with thousands of rows and hundreds of columns. Any sustainable solution must off-load this auditing process to R. But how do we describe aberrant behavior to R so that it can catch mistakes on its own? The package assertr seeks to do this by introducing a number of data checking verbs. Using assertr grammar, these verbs (functions) can be combined with subjects (data) in different ways to express a rich vocabulary of data validation tasks. More prosaically, assertr provides a suite of functions designed to verify the assumptions about data early in the analysis process, before any time is wasted computing on bad data. The idea is to provide as much information as you can about how you expect the data to look upfront so that any deviation from this expectation can be dealt with immediately. Given that the assertr grammar is designed to be able to describe a bouquet of error-checking routines, rather than list all the functions and functionalities that the package provides, it would be more helpful to visit particular use cases. Two things before we start. First, make sure you install assertr. Second, bear in mind that all data verification verbs in assertr take a data frame to check as their first argument, and either (a) returns the same data frame if the check passes, or (b) produces a fatal error. Since the verbs return a copy of the chosen data frame if the check passes, the main idiom in assertr involves reassignment of the returning data frame after it passes the check. a_dataset <- CHECKING_VERB(a_dataset, ....) Checking for out-of-bounds data It's common for numeric values in a column to have a natural constraint on the values that it should hold. For example, if a column represents a percent of something, we might want to check if all the values in that column are between 0 and 1 (or 0 and 100). In assertr, we typically use the within_bounds function in conjunction with the assert verb to ensure that this is the case. For example, if we added a column to mtcars that represented the percent of heaviest car's weight, the weight of each car is: library(assertr) mtcars.copy <- mtcars   mtcars.copy$Percent.Max.Wt <- round(mtcars.copy$wt /                                     max(mtcars.copy$wt),                                     2)   mtcars.copy <- assert(mtcars.copy, within_bounds(0,1),                      Percent.Max.Wt) within_bounds is actually a function that takes the lower and upper bounds and returns a predicate, a function that returns TRUE or FALSE. The assert function then applies this predicate to every element of the column specified in the third argument. If there are more than three arguments, assert will assume there are more columns to check. Using within_bounds, we can also avoid the situation where NA values are specified as "999", as long as the second argument in within_bounds is less than this value. within_bounds can take other information such as whether the bounds should be inclusive or exclusive, or whether it should ignore the NA values. To see the options for this, and all the other functions in assertr, use the help function on them. Let's see an example of what it looks like when the assert function fails: mtcars.copy$Percent.Max.Wt[c(10,15)] <- 2 mtcars.copy <- assert(mtcars.copy, within_bounds(0,1),                       Percent.Max.Wt) ------------------------------------------------------------ Error: Vector 'Percent.Max.Wt' violates assertion 'within_bounds' 2 times (e.g. [2] at index 10) We get an informative error message that tells us how many times the assertion was violated, and the index and value of the first offending datum. With assert, we have the option of checking a condition on multiple columns at the same time. For example, none of the measurements in iris can possibly be negative. Here's how we might make sure our dataset is compliant: iris <- assert(iris, within_bounds(0, Inf),                Sepal.Length, Sepal.Width,                Petal.Length, Petal.Width)   # or simply "-Species" because that # will include all columns *except* Species iris <- assert(iris, within_bounds(0, Inf),                -Species) On occasion, we will want to check elements for adherence to a more complicated pattern. For example, let's say we had a column that we knew was either between -10 and -20, or 10 and 20. We can check for this by using the more flexible verify verb, which takes a logical expression as its second argument; if any of the results in the logical expression is FALSE, verify will cause an error. vec <- runif(10, min=10, max=20) # randomly turn some elements negative vec <- vec * sample(c(1, -1), 10,                     replace=TRUE)   example <- data.frame(weird=vec)   example <- verify(example, ((weird < 20 & weird > 10) |                               (weird < -10 & weird > -20)))   # or   example <- verify(example, abs(weird) < 20 & abs(weird) > 10) # passes   example$weird[4] <- 0 example <- verify(example, abs(weird) < 20 & abs(weird) > 10) # fails ------------------------------------- Error in verify(example, abs(weird) < 20 & abs(weird) > 10) :   verification failed! (1 failure) Checking the data type of a column By default, most of the data import functions in R will attempt to guess the data type for each column at the import phase. This is usually nice, because it saves us from tedious work. However, it can backfire when there are, for example, stray punctuation marks in what are supposed to be numeric columns. To verify this, we can use the assert function with the is.numeric base function: iris <- assert(iris, is.numeric, -Species) We can use the is.character and is.logical functions with assert, too. An alternative method that will disallow the import of unexpected data types is to specify the data type that each column should be at the data import phase with the colClasses optional argument: iris <- read.csv("PATH_TO_IRIS_DATA.csv",                  colClasses=c("numeric", "numeric",                               "numeric", "numeric",                               "character")) This solution comes with the added benefit of speeding up the data import process, since R doesn't have to waste time guessing each column's data type. Checking for unexpected categories Another data integrity impropriety that is, unfortunately, very common is the mislabeling of categorical variables. There are two types of mislabeling of categories that can occur: an observation's class is mis-entered/mis-recorded/mistaken for that of another class, or the observation's class is labeled in a way that is not consistent with the rest of the labels. To see an example of what we can do to combat the former case, read assertr's vignette. The latter case covers instances where, for example, the species of iris could be misspelled (such as "versicolour", "verginica") or cases where the pattern established by the majority of class names is ignored ("iris setosa", "i. setosa", "SETOSA"). Either way, these misspecifications prove to be a great bane to data analysts for several reasons. For example, an analysis that is predicated upon a two-class categorical variable (for example, logistic regression) will now have to contend with more than two categories. Yet another way in which unexpected categories can haunt you is by producing statistics grouped by different values of a categorical variable; if the categories were extracted from the main data manually—with subset, for example, as opposed to with by, tapply, or aggregate—you'll be missing potentially crucial observations. If you know what categories you are expecting from the start, you can use the in_set function, in concert with assert, to confirm that all the categories of a particular column are squarely contained within a predetermined set. # passes iris <- assert(iris, in_set("setosa", "versicolor",                             "virginica"), Species)   # mess up the data iris.copy <- iris # We have to make the 'Species' column not # a factor ris.copy$Species <- as.vector(iris$Species) iris.copy$Species[4:9] <- "SETOSA" iris.copy$Species[135] <- "verginica" iris.copy$Species[95] <- "i. versicolor"   # fails iris.copy <- assert(iris.copy, in_set("setosa", "versicolor",                                       "virginica"), Species) ------------------------------------------- Error: Vector 'Species' violates assertion 'in_set' 8 times (e.g. [SETOSA] at index 4) If you don't know the categories that you should be expecting, a priori, the following incantation, which will tell you how many rows each category contains, may help you identify the categories that are either rare or misspecified: by(iris.copy, iris.copy$Species, nrow) Checking for outliers, entry errors, or unlikely data points Automatic outlier detection (sometimes known as anomaly detection) is something that a lot of analysts scoff at and view as a pipe dream. Though the creation of a routine that automagically detects all erroneous data points with 100 percent specificity and precision is impossible, unmistakably mis-entered data points and flagrant outliers are not hard to detect even with very simple methods. In my experience, there are a lot of errors of this type. One simple way to detect the presence of a major outlier is to confirm that every data point is within some n number of standard deviations away from the mean of the group. assertr has a function, within_n_sds—in conjunction with the insist verb—to do just this; if we wanted to check that every numeric value in iris is within five standard deviations of its respective column's mean, we could express so thusly: iris <- insist(iris, within_n_sds(5), -Species) An issue with using standard deviations away from the mean (z-scores) for detecting outliers is that both the mean and standard deviation are influenced heavily by outliers; this means that the very thing we are trying to detect is obstructing our ability to find it. There is a more robust measure for finding central tendency and dispersion than the mean and standard deviation: the median and median absolute deviation. The median absolute deviation is the median of the absolute value of all the elements of a vector subtracted by the vector's median. assertr has a sister to within_n_sds, within_n_mads, that checks every element of a vector to make sure it is within n median absolute deviations away from its column's median. iris <- insist(iris, within_n_mads(4), -Species) iris$Petal.Length[5] <- 15 iris <- insist(iris, within_n_mads(4), -Species) --------------------------------------------- Error: Vector 'Petal.Length' violates assertion 'within_n_mads' 1 time (value [15] at index 5) In my experience, within_n_mads can be an effective guard against illegitimate univariate outliers if n is chosen carefully. The examples here have been focusing on outlier identification in the univariate case—across one dimension at a time. Often, there are times where an observation is truly anomalous but it wouldn't be evident by looking at the spread of each dimension individually. assertr has support for this type of multivariate outlier analysis, but a full discussion of it would require a background outside the scope of this text. Chaining assertions The check assertr aims to make the checking of assumptions so effortless that the user never feels the need to hold back any implicit assumption. Therefore, it's expected that the user uses multiple checks on one data frame. The usage examples that we've seen so far are really only appropriate for one or two checks. For example, a usage pattern such as the following is clearly unworkable: iris <- CHECKING_CONSTRUCT4(CHECKING_CONSTRUCT3(CHECKING_CONSTRUCT2(CHECKING_CONSTRUCT1(this, ...), ...), ...), ...) To combat this visual cacophony, assertr provides direct support for chaining multiple assertions by using the "piping" construct from the magrittr package. The pipe operator of magrittr', %>%, works as follows: it takes the item on the left-hand side of the pipe and inserts it (by default) into the position of the first argument of the function on the right-hand side. The following are some examples of simple magrittr usage patterns: library(magrittr) 4 %>% sqrt              # 2 iris %>% head(n=3)      # the first 3 rows of iris iris <- iris %>% assert(within_bounds(0, Inf), -Species) Since the return value of a passed assertr check is the validated data frame, you can use the magrittr pipe operator to tack on more checks in a way that lends itself to easier human understanding. For example: iris <- iris %>%   assert(is.numeric, -Species) %>%   assert(within_bounds(0, Inf), -Species) %>%   assert(in_set("setosa", "versicolor", "virginica"), Species) %>%   insist(within_n_mads(4), -Species)   # or, equivalently   CHECKS <- . %>%   assert(is.numeric, -Species) %>%   assert(within_bounds(0, Inf), -Species) %>%   assert(in_set("setosa", "versicolor", "virginica"), Species) %>%   insist(within_n_mads(4), -Species)   iris <- iris %>% CHECKS When chaining assertions, I like to put the most integral and general one right at the top. I also like to put the assertions most likely to be violated right at the top so that execution is terminated before any more checks are run. There are many other capabilities built into the assertr multivariate outlier checking. For more information about these, read the package's vignette, (vignette("assertr")). On the magrittr side, besides the forward-pipe operator, this package sports some other very helpful pipe operators. Additionally, magrittr allows the substitution at the right side of the pipe operator to occur at locations other than the first argument. For more information about the wonderful magrittr package, read its vignette. Other messiness As we discussed in this article's preface, there are countless ways that a dataset may be messy. There are many other messy situations and solutions that we couldn't discuss at length here. In order that you, dear reader, are not left in the dark regarding custodial solutions, here are some other remedies which you may find helpful along your analytics journey: OpenRefine Though OpenRefine (formerly Google Refine) doesn't have anything to do with R per se, it is a sophisticated tool for working with and for cleaning up messy data. Among its numerous, sophisticated capabilities is the capacity to auto-detect misspelled or mispecified categories and fix them at the click of a button. Regular expressions Suppose you find that there are commas separating every third digit of the numbers in a numeric column. How would you remove them? Or suppose you needed to strip a currency symbol from values in columns that hold monetary values so that you can compute with them as numbers. These, and vastly more complicated text transformations, can be performed using regular expressions (a formal grammar for specifying the search patterns in text) and associate R functions like grep and sub. Any time spent learning regular expressions will pay enormous dividends over your career as an analyst, and there are many great, free tutorials available on the web for this purpose. tidyr There are a few different ways in which you can represent the same tabular dataset. In one form—called long, narrow, stacked, or entity-attribute-value model—each row contains an observation ID, a variable name, and the value of that variable. For example:             member  attribute  value 1     Ringo Starr  birthyear   1940 2  Paul McCartney  birthyear   1942 3 George Harrison  birthyear   1943 4     John Lennon  birthyear   1940 5     Ringo Starr instrument  Drums 6  Paul McCartney instrument   Bass 7 George Harrison instrument Guitar 8     John Lennon instrument Guitar In another form (called wide or unstacked), each of the observation's variables are stored in each column:             member birthyear instrument 1 George Harrison      1943     Guitar 2     John Lennon      1940     Guitar 3  Paul McCartney      1942       Bass 4     Ringo Starr      1940      Drums If you ever need to convert between these representations, (which is a somewhat common operation, in practice) tidyr is your tool for the job. Exercises The following are a few exercises for you to strengthen your grasp over the concepts learned in this article: Normally, when there is missing data for a question such as "What is your income?", we strongly suspect an MNAR mechanism, because we live in a dystopia that equates wealth with worth. As a result, the participants with the lowest income may be embarrassed to answer that question. In the relevant section, we assumed that because the question was poorly worded and we could account for whether English was the first language of the participant, the mechanism is MAR. If we were wrong about this reason, and it was really because the lower income participants were reticent to admit their income, what would the missing data mechanism be now? If, however, the differences in income were fully explained by whether English was the first language of the participant, what would the missing data mechanism be in that case? Find a dataset on the web with missing data. What does it use to denote that data is missing? Think about that dataset's missing data mechanism. Is there a chance that this data is MNAR? Find a freely available government dataset on the web. Read the dataset's description, and think about what assumptions you might make about the data when planning a certain analysis. Translate these into actual code so that R can check them for you. Were there any deviations from your expectations? When two autonomous individuals decide to voluntarily trade, the transaction can be in both parties' best interests. Does it necessarily follow that a voluntary trade between nations benefits both states? Why or why not? Summary "Messy data"—no matter what definition you use—present a huge roadblock for people who work with data. This article focused on two of the most notorious and prolific culprits: missing data and data that has not been cleaned or audited for quality. On the missing data side, you learned how to visualize missing data patterns, and how to recognize different types of missing data. You saw a few unprincipled ways of tackling the problem, and learned why they were suboptimal solutions. Multiple imputation, so you learned, addresses the shortcomings of these approaches and, through its usage of several imputed data sets, correctly communicates our uncertainty surrounding the imputed values. On unsanitized data, we saw that the, perhaps, optimal solution (visually auditing the data) was untenable for moderately sized datasets or larger. We discovered that the grammar of the package assertr provides a mechanism to offload this auditing process to R. You now have a few assertr checking "recipes" under your belt for some of the more common manifestations of the mistakes that plague data that has not been scrutinized. You can check out similar books published by Packt Publishing on R (https://www.packtpub.com/tech/r): Unsupervised Learning with R by Erik Rodríguez Pacheco (https://www.packtpub.com/big-data-and-business-intelligence/unsupervised-learning-r) R Data Science Essentials by Raja B. Koushik and Sharan Kumar Ravindran (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science-essentials) Resources for Article: Further resources on this subject: Debugging The Scheduler In Oracle 11G Databases [article] Looking Good – The Graphical Interface [article] Being Offline [article]
Read more
  • 0
  • 0
  • 2752
article-image-introduction-clustering-and-unsupervised-learning
Packt
23 Feb 2016
16 min read
Save for later

Introduction to Clustering and Unsupervised Learning

Packt
23 Feb 2016
16 min read
The act of clustering, or spotting patterns in data, is not much different from spotting patterns in groups of people. In this article, you will learn: The ways clustering tasks differ from the classification tasks How clustering defines a group, and how such groups are identified by k-means, a classic and easy-to-understand clustering algorithm The steps needed to apply clustering to a real-world task of identifying marketing segments among teenage social media users Before jumping into action, we'll begin by taking an in-depth look at exactly what clustering entails. (For more resources related to this topic, see here.) Understanding clustering Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items. It does this without having been told how the groups should look ahead of time. As we may not even know what we're looking for, clustering is used for knowledge discovery rather than prediction. It provides an insight into the natural groupings found within data. Without advance knowledge of what comprises a cluster, how can a computer possibly know where one group ends and another begins? The answer is simple. Clustering is guided by the principle that items inside a cluster should be very similar to each other, but very different from those outside. The definition of similarity might vary across applications, but the basic idea is always the same—group the data so that the related elements are placed together. The resulting clusters can then be used for action. For instance, you might find clustering methods employed in the following applications: Segmenting customers into groups with similar demographics or buying patterns for targeted marketing campaigns Detecting anomalous behavior, such as unauthorized network intrusions, by identifying patterns of use falling outside the known clusters Simplifying extremely large datasets by grouping features with similar values into a smaller number of homogeneous categories Overall, clustering is useful whenever diverse and varied data can be exemplified by a much smaller number of groups. It results in meaningful and actionable data structures that reduce complexity and provide insight into patterns of relationships. Clustering as a machine learning task Clustering is somewhat different from the classification, numeric prediction, and pattern detection tasks we examined so far. In each of these cases, the result is a model that relates features to an outcome or features to other features; conceptually, the model describes the existing patterns within data. In contrast, clustering creates new data. Unlabeled examples are given a cluster label that has been inferred entirely from the relationships within the data. For this reason, you will, sometimes, see the clustering task referred to as unsupervised classification because, in a sense, it classifies unlabeled examples. The catch is that the class labels obtained from an unsupervised classifier are without intrinsic meaning. Clustering will tell you which groups of examples are closely related—for instance, it might return the groups A, B, and C—but it's up to you to apply an actionable and meaningful label. To see how this impacts the clustering task, let's consider a hypothetical example. Suppose you were organizing a conference on the topic of data science. To facilitate professional networking and collaboration, you planned to seat people in groups according to one of three research specialties: computer and/or database science, math and statistics, and machine learning. Unfortunately, after sending out the conference invitations, you realize that you had forgotten to include a survey asking which discipline the attendee would prefer to be seated with. In a stroke of brilliance, you realize that you might be able to infer each scholar's research specialty by examining his or her publication history. To this end, you begin collecting data on the number of articles each attendee published in computer science-related journals and the number of articles published in math or statistics-related journals. Using the data collected for several scholars, you create a scatterplot: As expected, there seems to be a pattern. We might guess that the upper-left corner, which represents people with many computer science publications but few articles on math, could be a cluster of computer scientists. Following this logic, the lower-right corner might be a group of mathematicians. Similarly, the upper-right corner, those with both math and computer science experience, may be machine learning experts. Our groupings were formed visually; we simply identified clusters as closely grouped data points. Yet in spite of the seemingly obvious groupings, we unfortunately have no way to know whether they are truly homogeneous without personally asking each scholar about his/her academic specialty. The labels we applied required us to make qualitative, presumptive judgments about the types of people that would fall into the group. For this reason, you might imagine the cluster labels in uncertain terms, as follows: Rather than defining the group boundaries subjectively, it would be nice to use machine learning to define them objectively. This might provide us with a rule in the form if a scholar has few math publications, then he/she is a computer science expert. Unfortunately, there's a problem with this plan. As we do not have data on the true class value for each point, a supervised learning algorithm would have no ability to learn such a pattern, as it would have no way of knowing what splits would result in homogenous groups. On the other hand, clustering algorithms use a process very similar to what we did by visually inspecting the scatterplot. Using a measure of how closely the examples are related, homogeneous groups can be identified. In the next section, we'll start looking at how clustering algorithms are implemented. This example highlights an interesting application of clustering. If you begin with unlabeled data, you can use clustering to create class labels. From there, you could apply a supervised learner such as decision trees to find the most important predictors of these classes. This is called semi-supervised learning. The k-means clustering algorithm The k-means algorithm is perhaps the most commonly used clustering method. Having been studied for several decades, it serves as the foundation for many more sophisticated clustering techniques. If you understand the simple principles it uses, you will have the knowledge needed to understand nearly any clustering algorithm in use today. Many such methods are listed on the following site, the CRAN Task View for clustering at http://cran.r-project.org/web/views/Cluster.html. As k-means has evolved over time, there are many implementations of the algorithm. One popular approach is described in : Hartigan JA, Wong MA. A k-means clustering algorithm. Applied Statistics. 1979; 28:100-108. Even though clustering methods have advanced since the inception of k-means, this is not to imply that k-means is obsolete. In fact, the method may be more popular now than ever. The following table lists some reasons why k-means is still used widely: Strengths Weaknesses Uses simple principles that can be explained in non-statistical terms Highly flexible, and can be adapted with simple adjustments to address nearly all of its shortcomings Performs well enough under many real-world use cases Not as sophisticated as more modern clustering algorithms Because it uses an element of random chance, it is not guaranteed to find the optimal set of clusters Requires a reasonable guess as to how many clusters naturally exist in the data Not ideal for non-spherical clusters or clusters of widely varying density The k-means algorithm assigns each of the n examples to one of the k clusters, where k is a number that has been determined ahead of time. The goal is to minimize the differences within each cluster and maximize the differences between the clusters. Unless k and n are extremely small, it is not feasible to compute the optimal clusters across all the possible combinations of examples. Instead, the algorithm uses a heuristic process that finds locally optimal solutions. Put simply, this means that it starts with an initial guess for the cluster assignments, and then modifies the assignments slightly to see whether the changes improve the homogeneity within the clusters. We will cover the process in depth shortly, but the algorithm essentially involves two phases. First, it assigns examples to an initial set of k clusters. Then, it updates the assignments by adjusting the cluster boundaries according to the examples that currently fall into the cluster. The process of updating and assigning occurs several times until changes no longer improve the cluster fit. At this point, the process stops and the clusters are finalized. Due to the heuristic nature of k-means, you may end up with somewhat different final results by making only slight changes to the starting conditions. If the results vary dramatically, this could indicate a problem. For instance, the data may not have natural groupings or the value of k has been poorly chosen. With this in mind, it's a good idea to try a cluster analysis more than once to test the robustness of your findings. To see how the process of assigning and updating works in practice, let's revisit the case of the hypothetical data science conference. Though this is a simple example, it will illustrate the basics of how k-means operates under the hood. Using distance to assign and update clusters As with k-NN, k-means treats feature values as coordinates in a multidimensional feature space. For the conference data, there are only two features, so we can represent the feature space as a two-dimensional scatterplot as depicted previously. The k-means algorithm begins by choosing k points in the feature space to serve as the cluster centers. These centers are the catalyst that spurs the remaining examples to fall into place. Often, the points are chosen by selecting k random examples from the training dataset. As we hope to identify three clusters, according to this method, k = 3 points will be selected at random. These points are indicated by the star, triangle, and diamond in the following diagram: It's worth noting that although the three cluster centers in the preceding diagram happen to be widely spaced apart, this is not always necessarily the case. Since they are selected at random, the three centers could have just as easily been three adjacent points. As the k-means algorithm is highly sensitive to the starting position of the cluster centers, this means that random chance may have a substantial impact on the final set of clusters. To address this problem, k-means can be modified to use different methods for choosing the initial centers. For example, one variant chooses random values occurring anywhere in the feature space (rather than only selecting among the values observed in the data). Another option is to skip this step altogether; by randomly assigning each example to a cluster, the algorithm can jump ahead immediately to the update phase. Each of these approaches adds a particular bias to the final set of clusters, which you may be able to use to improve your results. In 2007, an algorithm called k-means++ was introduced, which proposes an alternative method for selecting the initial cluster centers. It purports to be an efficient way to get much closer to the optimal clustering solution while reducing the impact of random chance. For more information, refer to Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. 2007:1027–1035. After choosing the initial cluster centers, the other examples are assigned to the cluster center that is nearest according to the distance function. You will remember that we studied distance functions while learning about k-Nearest Neighbors. Traditionally, k-means uses Euclidean distance, but Manhattan distance or Minkowski distance are also sometimes used. Recall that if n indicates the number of features, the formula for Euclidean distance between example x and example y is: For instance, if we are comparing a guest with five computer science publications and one math publication to a guest with zero computer science papers and two math papers, we could compute this in R as follows: > sqrt((5 - 0)^2 + (1 - 2)^2) [1] 5.09902 Using this distance function, we find the distance between each example and each cluster center. The example is then assigned to the nearest cluster center. Keep in mind that as we are using distance calculations, all the features need to be numeric, and the values should be normalized to a standard range ahead of time. As shown in the following diagram, the three cluster centers partition the examples into three segments labeled Cluster A, Cluster B, and Cluster C. The dashed lines indicate the boundaries for the Voronoi diagram created by the cluster centers. The Voronoi diagram indicates the areas that are closer to one cluster center than any other; the vertex where all the three boundaries meet is the maximal distance from all three cluster centers. Using these boundaries, we can easily see the regions claimed by each of the initial k-means seeds: Now that the initial assignment phase has been completed, the k-means algorithm proceeds to the update phase. The first step of updating the clusters involves shifting the initial centers to a new location, known as the centroid, which is calculated as the average position of the points currently assigned to that cluster. The following diagram illustrates how as the cluster centers shift to the new centroids, the boundaries in the Voronoi diagram also shift and a point that was once in Cluster B (indicated by an arrow) is added to Cluster A: As a result of this reassignment, the k-means algorithm will continue through another update phase. After shifting the cluster centroids, updating the cluster boundaries, and reassigning points into new clusters (as indicated by arrows), the figure looks like this: Because two more points were reassigned, another update must occur, which moves the centroids and updates the cluster boundaries. However, because these changes result in no reassignments, the k-means algorithm stops. The cluster assignments are now final: The final clusters can be reported in one of the two ways. First, you might simply report the cluster assignments such as A, B, or C for each example. Alternatively, you could report the coordinates of the cluster centroids after the final update. Given either reporting method, you are able to define the cluster boundaries by calculating the centroids or assigning each example to its nearest cluster. Choosing the appropriate number of clusters In the introduction to k-means, we learned that the algorithm is sensitive to the randomly-chosen cluster centers. Indeed, if we had selected a different combination of three starting points in the previous example, we may have found clusters that split the data differently from what we had expected. Similarly, k-means is sensitive to the number of clusters; the choice requires a delicate balance. Setting k to be very large will improve the homogeneity of the clusters, and at the same time, it risks overfitting the data. Ideally, you will have a priori knowledge (a prior belief) about the true groupings and you can apply this information to choosing the number of clusters. For instance, if you were clustering movies, you might begin by setting k equal to the number of genres considered for the Academy Awards. In the data science conference seating problem that we worked through previously, k might reflect the number of academic fields of study that were invited. Sometimes the number of clusters is dictated by business requirements or the motivation for the analysis. For example, the number of tables in the meeting hall could dictate how many groups of people should be created from the data science attendee list. Extending this idea to another business case, if the marketing department only has resources to create three distinct advertising campaigns, it might make sense to set k = 3 to assign all the potential customers to one of the three appeals. Without any prior knowledge, one rule of thumb suggests setting k equal to the square root of (n / 2), where n is the number of examples in the dataset. However, this rule of thumb is likely to result in an unwieldy number of clusters for large datasets. Luckily, there are other statistical methods that can assist in finding a suitable k-means cluster set. A technique known as the elbow method attempts to gauge how the homogeneity or heterogeneity within the clusters changes for various values of k. As illustrated in the following diagrams, the homogeneity within clusters is expected to increase as additional clusters are added; similarly, heterogeneity will also continue to decrease with more clusters. As you could continue to see improvements until each example is in its own cluster, the goal is not to maximize homogeneity or minimize heterogeneity, but rather to find k so that there are diminishing returns beyond that point. This value of k is known as the elbow point because it looks like an elbow. There are numerous statistics to measure homogeneity and heterogeneity within the clusters that can be used with the elbow method (the following information box provides a citation for more detail). Still, in practice, it is not always feasible to iteratively test a large number of k values. This is in part because clustering large datasets can be fairly time consuming; clustering the data repeatedly is even worse. Regardless, applications requiring the exact optimal set of clusters are fairly rare. In most clustering applications, it suffices to choose a k value based on convenience rather than strict performance requirements. For a very thorough review of the vast assortment of cluster performance measures, refer to: Halkidi M, Batistakis Y, Vazirgiannis M. On clustering validation techniques. Journal of Intelligent Information Systems. 2001; 17:107-145. The process of setting k itself can sometimes lead to interesting insights. By observing how the characteristics of the clusters change as k is varied, one might infer where the data have naturally defined boundaries. Groups that are more tightly clustered will change a little, while less homogeneous groups will form and disband over time. In general, it may be wise to spend little time worrying about getting k exactly right. The next example will demonstrate how even a tiny bit of subject-matter knowledge borrowed from a Hollywood film can be used to set k such that actionable and interesting clusters are found. As clustering is unsupervised, the task is really about what you make of it; the value is in the insights you take away from the algorithm's findings. Summary This article covered only the fundamentals of clustering. As a very mature machine learning method, there are many variants of the k-means algorithm as well as many other clustering algorithms that bring unique biases and heuristics to the task. Based on the foundation in this article, you will be able to understand and apply other clustering methods to new problems. To learn more about different machine learning techniques, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r) Mastering Scientific Computing with R (https://www.packtpub.com/application-development/mastering-scientific-computing-r) R for Data Science (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science) Resources for Article:   Further resources on this subject: Displaying SQL Server Data using a Linq Data Source [article] Probability of R? [article] Working with Commands and Plugins [article]
Read more
  • 0
  • 0
  • 18351

article-image-push-your-data-web
Packt
22 Feb 2016
27 min read
Save for later

Push your data to the Web

Packt
22 Feb 2016
27 min read
This article covers the following topics: An introduction to the Shiny app framework Creating your first Shiny app The connection between the server file and the user interface The concept of reactive programming Different types of interface layouts, widgets, and Shiny tags How to create a dynamic user interface Ways to share your Shiny applications with others How to deploy Shiny apps to the web (For more resources related to this topic, see here.) Introducing Shiny – the app framework The Shiny package delivers a powerful framework to build fully featured interactive Web applications just with R and RStudio. Basic Shiny applications typically consist of two components: ~/shinyapp |-- ui.R |-- server.R While the ui.R function represents the appearance of the user interface, the server.R function contains all the code for the execution of the app. The look of the user interface is based on the famous Twitter bootstrap framework, which makes the look and layout highly customizable and fully responsive. In fact, you only need to know R and how to use the shiny package to build a pretty web application. Also, a little knowledge of HTML, CSS, and JavaScript may help. If you want to check the general possibilities and what is possible with the Shiny package, it is advisable to take a look at the inbuilt examples. Just load the library and enter the example name: library(shiny) runExample("01_hello") As you can see, running the first example opens the Shiny app in a new window. This app creates a simple histogram plot where you can interactively change the number of bins. Further, this example allows you to inspect the corresponding ui.R and server.R code files. There are currently eleven inbuilt example apps: 01_hello 02_text 03_reactivity 04_mpg 05_sliders 06_tabsets 07_widgets 08_html 09_upload 10_download 11_timer These examples focus mainly on the user interface possibilities and elements that you can create with Shiny. Creating a new Shiny web app with RStudio RStudio offers a fast and easy way to create the basis of every new Shiny app. Just click on New Project and select the New Directory option in the newly opened window: After that, click on the Shiny Web Application field: Give your new app a name in the next step, and click on Create Project: RStudio will then open a ready-to-use Shiny app by opening a prefilled ui.R and server.R file: You can click on the now visible Run App button in the right corner of the file pane to display the prefilled example application. Creating your first Shiny application In your effort to create your first Shiny application, you should first create or consider rough sketches for your app. Questions that you might ask in this context are, What do I want to show? How do I want it to show?, and so on. Let's say we want to create an application that allows users to explore some of the variables of the mtcars dataset. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Sketching the final app We want the user of the app to be able to select one out of the three variables of the dataset that gets displayed in a histogram. Furthermore, we want users to get a summary of the dataset under the main plot. So, the following figure could be a rough project sketch: Constructing the user interface for your app We will reuse the already opened ui.R file from the RStudio example, and adapt it to our needs. The layout of the ui.R file for your first app is controlled by nested Shiny functions and looks like the following lines: library(shiny) shinyUI(pageWithSidebar( headerPanel("My First Shiny App"), sidebarPanel( selectInput(inputId = "variable", label = "Variable:", choices = c ("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), mainPanel( plotOutput("carsPlot"), verbatimTextOutput ("carsSummary") ) )) Creating the server file The server file holds all the code for the execution of the application: library(shiny) library(datasets) shinyServer(function(input, output) { output$carsPlot <- renderPlot({ hist(mtcars[,input$variable], main = "Histogram of mtcars variables", xlab = input$variable) }) output$carsSummary <- renderPrint({ summary(mtcars[,input$variable]) }) }) The final application After changing the ui.R and the server.R files according to our needs, just hit the Run App button and the final app opens in a new window: As planned in the app sketch, the app offers the user a drop-down menu to choose the desired variable on the left side, and shows a histogram and data summary of the selected variable on the right side. Deconstructing the final app into its components For a better understanding of the Shiny application logic and the interplay of the two main files, ui.R and server.R, we will disassemble your first app again into its individual parts. The components of the user interface We have divided the user interface into three parts: After loading the Shiny library, the complete look of the app gets defined by the shinyUI() function. In our app sketch, we chose a sidebar look; therefore, the shinyUI function holds the argument, pageWithSidebar(): library(shiny) shinyUI(pageWithSidebar( ... The headerPanel() argument is certainly the simplest component, since usually only the title of the app will be stored in it. In our ui.R file, it is just a single line of code: library(shiny) shinyUI(pageWithSidebar( titlePanel("My First Shiny App"), ... The sidebarPanel() function defines the look of the sidebar, and most importantly, handles the input of the variables of the chosen mtcars dataset: library(shiny) shinyUI(pageWithSidebar( titlePanel("My First Shiny App"), sidebarPanel( selectInput(inputId = "variable", label = "Variable:", choices = c ("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), ... Finally, the mainPanel() function ensures that the output is displayed. In our case, this is the histogram and the data summary for the selected variables: library(shiny) shinyUI(pageWithSidebar( titlePanel("My First Shiny App"), sidebarPanel( selectInput(inputId = "variable", label = "Variable:", choices = c ("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), mainPanel( plotOutput("carsPlot"), verbatimTextOutput ("carsSummary") ) )) The server file in detail While the ui.R file defines the look of the app, the server.R file holds instructions for the execution of the R code. Again, we use our first app to deconstruct the related server.R file into its main important parts. After loading the needed libraries, datasets, and further scripts, the function, shinyServer(function(input, output) {} ), defines the server logic: library(shiny) library(datasets) shinyServer(function(input, output) { The marked lines of code that follow translate the inputs of the ui.R file into matching outputs. In our case, the server side output$ object is assigned to carsPlot, which in turn was called in the mainPanel() function of the ui.R file as plotOutput(). Moreover, the render* function, in our example it is renderPlot(), reflects the type of output. Of course, here it is the histogram plot. Within the renderPlot() function, you can recognize the input$ object assigned to the variables that were defined in the user interface file: library(shiny) library(datasets) shinyServer(function(input, output) { output$carsPlot <- renderPlot({ hist(mtcars[,input$variable], main = "Histogram of mtcars variables", xlab = input$variable) }) ... In the following lines, you will see another type of the render function, renderPrint() , and within the curly braces, the actual R function, summary(), with the defined input variable: library(shiny) library(datasets) shinyServer(function(input, output) { output$carsPlot <- renderPlot({ hist(mtcars[,input$variable], main = "Histogram of mtcars variables", xlab = input$variable) }) output$carsSummary <- renderPrint({ summary(mtcars[,input$variable]) }) }) There are plenty of different render functions. The most used are as follows: renderPlot: This creates normal plots renderPrin: This gives printed output types renderUI: This gives HTML or Shiny tag objects renderTable: This gives tables, data frames, and matrices renderText: This creates character strings Every code outside the shinyServer() function runs only once on the first launch of the app, while all the code in between the brackets and before the output functions runs as often as a user visits or refreshes the application. The code within the output functions runs every time a user changes the widget that belongs to the corresponding output. The connection between the server and the ui file As already inspected in our decomposed Shiny app, the input functions of the ui.R file are linked with the output functions of the server file. The following figure illustrates this again: The concept of reactivity Shiny uses a reactive programming model, and this is a big deal. By applying reactive programming, the framework is able to be fast, efficient, and robust. Briefly, changing the input in the user interface, Shiny rebuilds the related output. Shiny uses three reactive objects: Reactive source Reactive conductor Reactive endpoint For simplicity, we use the formal signs of the RStudio documentation: The implementation of a reactive source is the reactive value; that of a reactive conductor is a reactive expression; and the reactive endpoint is also called the observer. The source and endpoint structure As taught in the previous section, the defined input of the ui.R links is the output of the server.R file. For simplicity, we use the code from our first Shiny app again, along with the introduced formal signs: ... output$carsPlot <- renderPlot({ hist(mtcars[,input$variable], main = "Histogram of mtcars variables", xlab = input$variable) }) ... The input variable, in our app these are the Horsepower; Miles per Gallon, and Number of Carburetors choices, represents the reactive source. The histogram called carsPlot stands for the reactive endpoint. In fact, it is possible to link the reactive source to numerous reactive endpoints, and also conversely. In our Shiny app, we also connected the input variable to our first and second output—carsSummary: ... output$carsPlot <- renderPlot({ hist(mtcars[,input$variable], main = "Histogram of mtcars variables", xlab = input$variable) }) output$carsSummary <- renderPrint({ summary(mtcars[,input$variable]) }) ... To sum it up, this structure ensures that every time a user changes the input, the output refreshes automatically and accordingly. The purpose of the reactive conductor The reactive conductor differs from the reactive source and the endpoint is so far that this reactive type can be dependent and can have dependents. Therefore, it can be placed between the source, which can only have dependents and the endpoint, which in turn can only be dependent. The primary function of a reactive conductor is the encapsulation of heavy and difficult computations. In fact, reactive expressions are caching the results of these computations. The following graph displays a possible connection of the three reactive types: In general, reactivity raises the impression of a logically working directional system; after input, the output occurs. You get the feeling that an input pushes information to an output. But this isn't the case. In reality, it works vice versa. The output pulls the information from the input. And this all works due to sophisticated server logic. The input sends a callback to the server, which in turn informs the output that pulls the needed value from the input and shows the result to the user. But of course, for a user, this all feels like an instant updating of any input changes, and overall, like a responsive app's behavior. Of course, we have just touched upon the main aspects of reactivity, but now you know what's really going on under the hood of Shiny. Discovering the scope of the Shiny user interface After you know how to build a simple Shiny application, as well as how reactivity works, let us take a look at the next step: the various resources to create a custom user interface. Furthermore, there are nearly endless possibilities to shape the look and feel of the layout. As already mentioned, the entire HTML, CSS, and JavaScript logic and functions of the layout options are based on the highly flexible bootstrap framework. And, of course, everything is responsive by default, which makes it possible for the final application layout to adapt to the screen of any device. Exploring the Shiny interface layouts Currently, there are four common shinyUI () page layouts: pageWithSidebar() fluidPage() navbarPage() fixedPage() These page layouts can be, in turn, structured with different functions for a custom inner arrangement structure of the page layout. In the following sections, we are introducing the most useful inner layout functions. As an example, we will use our first Shiny application again. The sidebar layout The sidebar layout, where the sidebarPanel() function is used as the input area, and the mainPanel() function as the output, just like in our first Shiny app. The sidebar layout uses the pageWithSidebar() function: library(shiny) shinyUI(pageWithSidebar( headerPanel("The Sidebar Layout"), sidebarPanel( selectInput(inputId = "variable", label = "This is the sidebarPanel", choices = c("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), mainPanel( tags$h2("This is the mainPanel"), plotOutput("carsPlot"), verbatimTextOutput("carsSummary") ) )) When you only change the first three functions, you can create exactly the same look as the application with the fluidPage() layout. This is the sidebar layout with the fluidPage() function: library(shiny) shinyUI(fluidPage( titlePanel("The Sidebar Layout"), sidebarLayout( sidebarPanel( selectInput(inputId = "variable", label = "This is the sidebarPanel", choices = c("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), mainPanel( tags$h2("This is the mainPanel"), plotOutput("carsPlot"), verbatimTextOutput("carsSummary") ) ) ))   The grid layout The grid layout is where rows are created with the fluidRow() function. The input and output are made within free customizable columns. Naturally, a maximum of 12 columns from the bootstrap grid system must be respected. This is the grid layout with the fluidPage () function and a 4-8 grid: library(shiny) shinyUI(fluidPage( titlePanel("The Grid Layout"), fluidRow( column(4, selectInput(inputId = "variable", label = "Four-column input area", choices = c("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), column(8, tags$h3("Eight-column output area"), plotOutput("carsPlot"), verbatimTextOutput("carsSummary") ) ) )) As you can see from inspecting the previous ui.R file, the width of the columns is defined within the fluidRow() function, and the sum of these two columns adds up to 12. Since the allocation of the columns is completely flexible, you can also create something like the grid layout with the fluidPage() function and a 4-4-4 grid: library(shiny) shinyUI(fluidPage( titlePanel("The Grid Layout"), fluidRow( column(4, selectInput(inputId = "variable", label = "Four-column input area", choices = c("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), column(4, tags$h5("Four-column output area"), plotOutput("carsPlot") ), column(4, tags$h5("Another four-column output area"), verbatimTextOutput("carsSummary") ) ) )) The tabset panel layout The tabsetPanel() function can be built into the mainPanel() function of the aforementioned sidebar layout page. By applying this function, you can integrate several tabbed outputs into one view. This is the tabset layout with the fluidPage() function and three tab panels: library(shiny) shinyUI(fluidPage( titlePanel("The Tabset Layout"), sidebarLayout( sidebarPanel( selectInput(inputId = "variable", label = "Select a variable", choices = c("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), mainPanel( tabsetPanel( tabPanel("Plot", plotOutput("carsPlot")), tabPanel("Summary", verbatimTextOutput("carsSummary")), tabPanel("Raw Data", dataTableOutput("tableData")) ) ) ) )) After changing the code to include the tabsetPanel() function, the three tabs with the tabPanel() function display the respective output. With the help of this layout, you are no longer dependent on representing several outputs among themselves. Instead, you can display each output in its own tab, while the sidebar does not change. The position of the tabs is flexible and can be assigned to be above, below, right, and left. For example, in the following code file detail, the position of the tabsetPanel() function was assigned as follows: ... mainPanel( tabsetPanel(position = "below", tabPanel("Plot", plotOutput("carsPlot")), tabPanel("Summary", verbatimTextOutput("carsSummary")), tabPanel("Raw Data", tableOutput("tableData")) ) ) ... The navlist panel layout The navlistPanel() function is similar to the tabsetPanel() function, and can be seen as an alternative if you need to integrate a large number of tabs. The navlistPanel() function also uses the tabPanel() function to include outputs: library(shiny) shinyUI(fluidPage( titlePanel("The Navlist Layout"), navlistPanel( "Discovering The Dataset", tabPanel("Plot", plotOutput("carsPlot")), tabPanel("Summary", verbatimTextOutput("carsSummary")), tabPanel("Another Plot", plotOutput("barPlot")), tabPanel("Even A Third Plot", plotOutput("thirdPlot"), "More Information", tabPanel("Raw Data", tableOutput("tableData")), tabPanel("More Datatables", tableOutput("moreData")) ) ))   The navbar page as the page layout In the previous examples, we have used the page layouts, fluidPage() and pageWithSidebar(), in the first line. But, especially when you want to create an application with a variety of tabs, sidebars, and various input and output areas, it is recommended that you use the navbarPage() layout. This function makes use of the standard top navigation of the bootstrap framework: library(shiny) shinyUI(navbarPage("The Navbar Page Layout", tabPanel("Data Analysis", sidebarPanel( selectInput(inputId = "variable", label = "Select a variable", choices = c("Horsepower" = "hp", "Miles per Gallon" = "mpg", "Number of Carburetors" = "carb"), selected = "hp") ), mainPanel( plotOutput("carsPlot"), verbatimTextOutput("carsSummary") ) ), tabPanel("Calculations" … ), tabPanel("Some Notes" … ) )) Adding widgets to your application After inspecting the most important page layouts in detail, we now look at the different interface input and output elements. By adding these widgets, panels, and other interface elements to an application, we can further customize each page layout. Shiny input elements Already, in our first Shiny application, we got to know a typical Shiny input element: the selection box widget. But, of course, there are a lot more widgets with different types of uses. All widgets can have several arguments; the minimum setup is to assign an inputId, which instructs the input slot to communicate with the server file, and a label to communicate with a widget. Each widget can also have its own specific arguments. As an example, we are looking at the code of a slider widget. In the previous screenshot are two versions of a slider; we took the slider range for inspection: sliderInput(inputId = "sliderExample", label = "Slider range", min = 0, max = 100, value = c(25, 75)) Besides the mandatory arguments, inputId and label, three more values have been added to the slider widget. The min and max arguments specify the minimum and maximum values that can be selected. In our example, these are 0 and 100. A numeric vector was assigned to the value argument, and this creates a double-ended range slider. This vector must logically be within the set minimum and maximum values. Currently, there are more than twenty different input widgets, which in turn are all individually configurable by assigning to them their own set of arguments. A brief overview of the output elements As we have seen, the output elements in the ui.R file are connected to the rendering functions in the server file. The mainly used output elements are: htmlOutput imageOutput plotOutput tableOutput textOutput verbatimTextOutput downloadButton Due to their unambiguous naming, the purpose of these elements should be clear. Individualizing your app even further with Shiny tags Although you don't need to know HTML to create stunning Shiny applications, you have the option to create highly customized apps with the usage of raw HTML or so-called Shiny tags. To add raw HTML, you can use the HTML() function. We will focus on Shiny tags in the following list. Currently, there are over a 100 different Shiny tag objects, which can be used to add text styling, colors, different headers, visual and audio, lists, and many more things. You can use these tags by writing tags $tagname. Following is a brief list of useful tags: tags$h1: This is first level header; of course you can also use the known h1 -h6 tags$hr: This makes a horizontal line, also known as a thematic break tags$br: This makes a line break, a popular way to add some space tags$strong = This makes the text bold tags$div: This makes a division of text with a uniform style tags$a: This links to a webpage tags$iframe: This makes an inline frame for embedding possibilities The following ui.R file and corresponding screenshot show the usage of Shiny tags by an example: shinyUI(fluidPage( fluidRow( column(6, tags$h3("Customize your app with Shiny tags!"), tags$hr(), tags$a(href = "http://www.rstudio.com", "Click me"), tags$hr() ), column(6, tags$br(), tags$em("Look - the R project logo"), tags$br(), tags$img(src = "http://www.r-project.org/Rlogo.png") ) ), fluidRow( column(6, tags$strong("We can even add a video"), tags$video(src = "video.mp4", type = "video/mp4", autoplay = NA, controls = NA) ), column(6, tags$br(), tags$ol( tags$li("One"), tags$li("Two"), tags$li("Three")) ) ) ))   Creating dynamic user interface elements We know how to build completely custom user interfaces with all the bells and whistles. But all the introduced types of interface elements are fixed and static. However, if you need to create dynamic interface elements, Shiny offers three ways to achieve it: The conditinalPanel() function: The renderUI() function The use of directly injected JavaScript code In the following section, we only show how to use the first two ways, because firstly, they are built into the Shiny package, and secondly, the JavaScript method is indicated as experimental. Using conditionalPanel The condtionalPanel() functions allow you to show or hide interface elements dynamically, and is set in the ui.R file. The dynamic of this function is achieved by JavaScript expressions, but as usual in the Shiny package, all you need to know is R programming. The following example application shows how this function works for the ui.R file: library(shiny) shinyUI(fluidPage( titlePanel("Dynamic Interface With Conditional Panels"), column(4, wellPanel( sliderInput(inputId = "n", label= "Number of points:", min = 10, max = 200, value = 50, step = 10) )), column(5, "The plot below will be not displayed when the slider value", "is less than 50.", conditionalPanel("input.n >= 50", plotOutput("scatterPlot", height = 300) ) ) )) The following example application shows how this function works for the Related server.R file: library(shiny) shinyServer(function(input, output) { output$scatterPlot <- renderPlot({ x <- rnorm(input$n) y <- rnorm(input$n) plot(x, y) }) }) The code for this example application was taken from the Shiny gallery of RStudio (http://shiny.rstudio.com/gallery/conditionalpanel-demo.html). As readable in both code files, the defined function, input.n, is the linchpin for the dynamic behavior of the example app. In the conditionalPanel() function, it is defined that inputId="n" must have a value of 50 or higher, while the input and output of the plot will work as already defined. Taking advantage of the renderUI function The renderUI() function is hooked, contrary to the previous model, to the server file to create a dynamic user interface. We have already introduced different render output functions in this article. The following example code shows the basic functionality using the ui.R file: # Partial example taken from the Shiny documentation numericInput("lat", "Latitude"), numericInput("long", "Longitude"), uiOutput("cityControls") The following example code shows the basic functionality of the Related sever.R file: # Partial example output$cityControls <- renderUI({ cities <- getNearestCities(input$lat, input$long) checkboxGroupInput("cities", "Choose Cities", cities) }) As described, the dynamic of this method gets defined in the renderUI() process as an output, which then gets displayed through the uiOutput() function in the ui.R file. Sharing your Shiny application with others Typically, you create a Shiny application not only for yourself, but also for other users. There are a two main ways to distribute your app; either you let users download your application, or you deploy it on the web. Offering a download of your Shiny app By offering the option to download your final Shiny application, other users can run your app locally. Actually, there are four ways to deliver your app this way. No matter which way you choose, it is important that the user has installed R and the Shiny package on his/her computer. Gist Gist is a public code sharing pasteboard from GitHub. To share your app this way, it is important that both the ui.R file and the server.R file are in the same Gist and have been named correctly. Take a look at the following screenshot: There are two options to run apps via Gist. First, just enter runGist("Gist_URL") in the console of RStudio; or second, just use the Gist ID and place it in the shiny::runGist("Gist_ID") function. Gist is a very easy way to share your application, but you need to keep in mind that your code is published on a third-party server. GitHub The next way to enable users to download your app is through a GitHub repository: To run an application from GitHub, you need to enter the command, shiny::runGitHub ("Repository_Name", "GitHub_Account_Name"), in the console: Zip file There are two ways to share a Shiny application by zip file. You can either let the user download the zip file over the web, or you can share it via email, USB stick, memory card, or any other such device. To download a zip file via the Web, you need to type runUrl ("Zip_File_URL") in the console: Package Certainly, a much more labor-intensive but also publically effective way is to create a complete R package for your Shiny application. This especially makes sense if you have built an extensive application that may help many other users. Another advantage is the fact that you can also publish your application on CRAN. Later in the book, we will show you how to create an R package. Deploying your app to the web After showing you the ways users can download your app and run it on their local machines, we will now check the options to deploy Shiny apps to the web. Shinyapps.io http://www.shinyapps.io/ is a Shiny app- hosting service by RStudio. There is a free-to- use account package, but it is limited to a maximum of five applications, 25 so-called active hours, and the apps are branded with the RStudio logo. Nevertheless, this service is a great way to publish one's own applications quickly and easily to the web. To use http://www.shinyapps.io/ with RStudio, a few R packages and some additional operating system software needs to be installed: RTools (If you use Windows) GCC (If you use Linux) XCode Command Line Tools (If you use Mac OS X) The devtools R package The shinyapps package Since the shinyapps package is not on CRAN, you need to install it via GitHub by using the devtools package: if (!require("devtools")) install.packages("devtools") devtools::install_github("rstudio/shinyapps") library(shinyapps) When everything that is needed is installed ,you are ready to publish your Shiny apps directly from the RStudio IDE. Just click on the Publish icon, and in the new window you will need to log in to your http://www.shinyapps.io/ account once, if you are using it for the first time. All other times, you can directly create a new Shiny app or update an existing app: After clicking on Publish, a new tab called Deploy opens in the console pane, showing you the progress of the deployment process. If there is something set incorrectly, you can use the deployment log to find the error: When the deployment is successful, your app will be publically reachable with its own web address on http://www.shinyapps.io/.   Setting up a self-hosted Shiny server There are two editions of the Shiny Server software: an open source edition and the professional edition. The open source edition can be downloaded for free and you can use it on your own server. The Professional edition offers a lot more features and support by RStudio, but is also priced accordingly. Diving into the Shiny ecosystem Since the Shiny framework is such an awesome and powerful tool, a lot of people, and of course, the creators of RStudio and Shiny have built several packages around it that are enormously extending the existing functionalities of Shiny. These almost infinite possibilities of technical and visual individualization, which are possible by deeply checking the Shiny ecosystem, would certainly go beyond the scope of this article. Therefore, we are presenting only a few important directions to give a first impression. Creating apps with more files In this article, you have learned how to build a Shiny app consisting of two files: the server.R and the ui.R. To include every aspect, we first want to point out that it is also possible to create a single file Shiny app. To do so, create a file called app.R. In this file, you can include both the server.R and the ui.R file. Furthermore, you can include global variables, data, and more. If you build larger Shiny apps with multiple functions, datasets, options, and more, it could be very confusing if you do all of it in just one file. Therefore, single-file Shiny apps are a good idea for simple and small exhibition apps with a minimal setup. Especially for large Shiny apps, it is recommended that you outsource extensive custom functions, datasets, images, and more into your own files, but put them into the same directory as the app. An example file setup could look like this: ~/shinyapp |-- ui.R |-- server.R |-- helper.R |-- data |-- www |-- js |-- etc   To access the helper file, you just need to add source("helpers.R") into the code of your server.R file. The same logic applies to any other R files. If you want to read in some data from your data folder, you store it in a variable that is also in the head of your server.R file, like this: myData &lt;- readRDS("data/myDataset.rds") Expanding the Shiny package As said earlier, you can expand the functionalities of Shiny with several add-on packages. There are currently ten packages available on CRAN with different inbuilt functions to add some extra magic to your Shiny app. shinyAce: This package makes available Ace editor bindings to enable a rich text-editing environment within Shiny. shinybootstrap2: The latest Shiny package uses bootstrap 3; so, if you built your app with bootstrap 2 features, you need to use this package. shinyBS: This package adds the additional features of the original Twitter Bootstraptheme, such as tooltips, modals, and others, to Shiny. shinydashboard: This packages comes from the folks at RStudio and enables the user to create stunning and multifunctional dashboards on top of Shiny. shinyFiles: This provides functionality for client-side navigation of the server side file system in Shiny apps. shinyjs: By using this package, you can perform common JavaScript operations in Shiny applications without having to know any JavaScript. shinyRGL: This package provides Shiny wrappers for the RGL package. This package exposes RGL's ability to export WebGL visualization in a shiny-friendly format. shinystan: This package is, in fact, not a real add-on. Shinystan is a fantastic full-blown Shiny application to give users a graphical interface for Markov chain Monte Carlo simulations. shinythemes: This packages gives you the option of changing the whole look and feel of your application by using different inbuilt bootstrap themes. shinyTree: This exposes bindings to jsTree—a JavaScript library that supports interactive trees—to enable rich, editable trees in Shiny. Of course, you can find a bunch of other packages with similar or even more functionalities, extensions, and also comprehensive Shiny apps on GitHub. Summary To learn more about Shiny, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Shiny (https://www.packtpub.com/application-development/learning-shiny) Mastering Machine Learning with R (https://www.packtpub.com/big-data-and-business-intelligence/mastering-machine-learning-r) Mastering Data Analysis with R (https://www.packtpub.com/big-data-and-business-intelligence/mastering-data-analysis-r)
Read more
  • 0
  • 0
  • 15036

article-image-working-commands-and-plugins
Packt
22 Feb 2016
26 min read
Save for later

Working with Commands and Plugins

Packt
22 Feb 2016
26 min read
In this article written by Tom Ryder, author of the book Nagios Core Administration Cookbook, Second Edition, we will cover the following topics: Installing a plugin Removing a plugin Creating a new command Customizing an existing command (For more resources related to this topic, see here.) Introduction Nagios Core is perhaps best thought of as a monitoring framework and less as a monitoring tool.Its modular design allows any kind of program that returns appropriate values based on some kind of check as a check_command option for a host or service. This is where the concepts of commands and pluginscome into play. For Nagios Core, a plugin is any program that can be used to gather information about a host or service. To ensure that a host is responding to ping requests, we'd use a plugin, such as check_ping,which when run against a hostname or address—whether by Nagios Core or not—returns a status code to whatever called it, based on whether a response was received to the pingrequest within a certain period of time. This status code and any accompanying message is what Nagios Core uses to establish the state that a host or service is in. Plugins are generally just like any other program on a Unix-like system; they can be run from the command line, are subject to permissions and owner restrictions, can be written in any language, can read variables from their environment, and can take parameters and options to modify how they work. Most importantly, they are entirely separate from Nagios Core itself (even if programmed by the same people), and the way that they're used by the application can be changed. To allow for additional flexibility in how plugins are used, Nagios Core uses these programs according to the terms of a command definition. A command for a specific plugin defines the way in which that plugin is used, including its location in the filesystem, any parameters that should be passed to it, and any other options. In particular, parameters and options often include thresholds for the WARNINGand CRITICAL states. Nagios Core is usually downloaded and installed alongside a set of plugins called Nagios Plugins, available at https://nagios-plugins.org/, which this article assumes you have installed. These plugins were chosen because they cover the most common needs for a monitoring infrastructure quite well as a set, including checks for common services, such as web services, mail services, DNS services, and others as well as more generic checks, such as whether a TCP or UDP port is accessible and open on a server. It's possible that for most, if not all, of our monitoring needs, we won't need any other plugins—but if we do, Nagios Core makes it possible to use existing plugins in novel ways using custom command definitions, adding third-party plugins written by contributors on the Nagios Exchange website or even writing custom plugins ourselves from scratch in some special cases. Installing a plugin In this recipe, we'll install a custom plugin that we retrieved from Nagios Exchange onto a Nagios Core server so that we can use it in a Nagios Core command, and hence check a service with it. Getting ready You should have a Nagios Core 4.0 or newer server running with a few hosts and services configured already, and you should have found an appropriate plugin to install to solve some particular monitoring needs. Your Nagios Core server should have Internet connectivity to allow you to download the plugin directly from the website. In this example, we'll use check_rsync,which is available on the Web at https://exchange.nagios.org/directory/Plugins/Network-Protocols/Rsync/check_rsync/details. This particular plugin is quite simple,consisting of a single Perlscript with only very basic dependencies. If you want to install this script as an example,the server will also need to have a Perl interpreter installed, for example, in /usr/bin/perl. This example will also include directly testing a server running an rsync(1)daemon called troy.example.net. How to do it... We can download and install a new plugin using the following steps: Copy the URL for the download link for the most recent version of the check_rsync plugin. Navigate to the plugins directory for the Nagios Core server. The default location is /usr/local/nagios/libexec: # cd /usr/local/nagios/libexec Download the plugin using the wget command into a file called check_rsync. It's important to enclose the URL in quotes: # wget 'https://exchange.nagios.org/components/com_mtree/attachment. php?link_id=307&cf_id=29' -O check_rsync Make the plugin executable using the chmod(1) and chown(1) commands: # chown root.nagios check_rsync # chmod 0770 check_rsync Run the plugin directly with no arguments to check that it runs and to get usage instructions. It's a good idea to test it as the nagios user using the su(8) or sudo(8) commands:# sudo -s -u nagios$ ./check_rsync Usage: check_rsync -H <host> [-p <port>] [-m <module>[,<user>,<password>] [-m <module>[,<user>,<password>]...]] Try running the plugin directly against a host running rsync(1) to check whether it works and reports a status: $ ./check_rsync -H troy.example.net The output normally starts with the status determined, with any extra information after a colon: OK: Rsync is up If all of this works, then the plugin is now installed and working correctly. How it works... Because Nagios Core plugins are programs in themselves, all that installing a plugin really amounts to is saving a program or script into an appropriate directory, in this case, /usr/local/nagios/libexec, where all the other plugins live. It's then available to be used the same way as any other plugin. The next step once the plugin is working is defining a command for it in the Nagios Core configuration so that it can be used to monitor hosts and/or services. This can be done with the Creating a new commandrecipe in this article. There's more... If we inspect the Perl script, we can see a little bit of how it works. It works like any other Perl script except perhaps for the fact that its return valuesare defined in a hash called %ERRORS,and the return values it chooses depend on what happens when it tries to check the rsync(1)process. This is the most important part of implementing a plugin for Nagios Core. Installation procedures for different plugins vary. In particular, many plugins are written in languages like C, and hence, they need to be compiled. One such plugin is the popular check_nrpe plugin.Rather than simply being saved into a directory and made executable, these sorts of plugins often follow the usual pattern of configuration, compilation, and installation: $ ./configure $ make # make install For many plugins that are built in this style, the final step of make installwill often install the compiled plugin into the appropriate directory for us. In general, if instructions are included with the plugin, it pays to read them to see how best to install it. See also The Removing a plugin recipe The Creating a new command recipe Removing a plugin In this recipe, we'll remove a plugin that we no longer need as part of our Nagios Core installation. Perhaps it's not working correctly, the service it monitors is no longer available, or there are security or licensing concerns with its usage. Getting ready You should have a Nagios Core 4.0 or newer server running with a few hosts and services configured already and have a plugin that you would like to remove from the server. In this instance, we'll remove the now unneeded check_rsync plugin from our Nagios Core server. How to do it... We can remove a plugin from our Nagios Core instance using the following steps: Remove any part of the configuration that uses the plugin, including the hosts or services that use it for check_command and command definitions that refer to the program. As an example, the following definition for a command would no longer work after we remove the check_rsync plugin: define command { command_name check_rsync command_line $USER1$/check_rsync -H $HOSTADDRESS$ } Using a tool, such as grep(1), can be a good way to find mentions of the command and plugin: # grep -R check_rsync /usr/local/nagios/etc Change the directory on the Nagios Core server to wherever the plugins are kept. The default location is /usr/local/nagios/libexec: # cd /usr/local/nagios/libexec Delete the plugin with the rm(1) command: # rm check_rsync Validate the configuration and restart the Nagios Core server: # /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg # /etc/init.d/nagios reload How it works... Nagios Core plugins are simply external programs that the server uses to perform checks of hosts and services. If a plugin is no longer needed, all that we need to do is remove references to it in our configuration, if any, and delete it from /usr/local/nagios/libexec. There's more... There's not usually any harm in leaving the plugin's program on the server even if Nagios Core isn't using it. It doesn't slow anything down or cause any other problems, and it may be needed later. Nagios Core plugins are generally quite small programs and should not really cause disk space concerns on a modern server. See also The Installing a plugin recipe The Creating a new command recipe Creating a new command In this recipe, we'll create a new command for a plugin that was just installed into the /usr/local/nagios/libexecdirectory in the Nagios Core server. This will define the way in which Nagios Core should use the plugin, and thereby allow it to be used as part of a service definition. Getting ready You should have a Nagios Core 4.0 or newer server running with a few hosts and services configured already and have a plugin installed for which you'd like to define a new command so that you can use it as part of a service definition. In this instance, we'll define a command for an installed check_rsyncplugin. How to do it... We can define a new command in our configuration as follows: Change to the directory containing the objects configuration for Nagios Core. The default location is /usr/local/nagios/etc/objects: # cd /usr/local/nagios/etc/objects Edit the commands.cfg file: # vi commands.cfg At the bottom of the file, add the following command definition: define command {     command_name  check_rsync     command_line  $USER1$/check_rsync -H $HOSTADDRESS$ } Validate the configuration and restart the Nagios Core server: # /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg # /etc/init.d/nagios reload If the validation passed and the server restarted successfully, we should now be able to use the check_rsync command in a service definition. How it works... The configuration we added to the commands.cfgfile in the preceding steps defines a new command called check_rsync,which specifies a method for using the plugin of the same name to monitor a service. This enables us to use check_rsyncas a value for the check_commanddirective in a service declaration, which might look like this: define service {     use                  generic-service     host_name            troy.example.net     service_description  RSYNC     check_command        check_rsync } Only two directives are required for command definitions, and we've defined both: command_name: This defines the unique name with which we can reference the command when we use it in host or service definitions command_line: This defines the command line that should be executed by Nagios Core to make the appropriate check This particular command line also uses the following two macros: $USER1$: This expands to /usr/local/nagios/libexec, the location of the plugin binaries, including check_rsync. This is defined in the sample configuration in the /usr/local/nagios/etc/resource.cfg file. $HOSTADDRESS$: This expands to the address of any host for which this command is used as a host or service definition. So, if we used the command in a service, checking the rsync(1) server on troy.example.net, the completed command might look like this: $ /usr/local/nagios/libexec/check_rsync -H troy.example.net We can run this straight from the command line ourselves as the nagios userto see what kind of results it returns: $ /usr/local/nagios/libexec/check_rsync -H troy.example.net OK: Rsync is up There's more... A plugin can be used for more than one command. If we had a particular rsync(1) module, which we wanted to check named backup, we could write another command called check_rsync_backupas follows: define command {     command_name  check_rsync_backup     command_line  $USER1$/check_rsync -H $HOSTADDRESS$ -m backup } Alternatively, if one or more of our rsync(1) servers were running on an alternate port, say, port 5873, we could define a separate command check_rsync_altport for that: define command {     command_name  check_rsync_altport     command_line  $USER1$/check_rsync -H $HOSTADDRESS$ -p 5873 } Commands can thus be defined as precisely as we need them to be. We explore this in more detail in the Customizing an existing commandrecipe in this article. See also The Installing a plugin recipe The Customizing an existing command recipe Customizing an existing command In this recipe, we'll customize an existing command definition. There are a number of reasons why you might want to do this, but a common one is if a check is overzealous, sending notifications for the WARNING orCRITICALstates, which aren't actually terribly worrisome, or on the other hand, if a check is too "forgiving" and doesn't flag hosts or services as having problems when it would actually be appropriate to do so. Another reason is to account for peculiarities in your own network. For example, if you run HTTPdaemons on a large number of hosts in your hosts on the alternative port 8080 that you need to check, it would be convenient to have a check_http_altportcommand available. We can do this by copying and altering the definition for the vanilla check_httpcommand. Getting ready You should have a Nagios Core 4.0 or newer server running with a few hosts and services configured already. You should also already be familiar with the relationship between services, commands, and plugins. How to do it... We can customize an existing command definition as follows: Change to the directory containing the objects configuration for Nagios Core. The default location is /usr/local/nagios/etc/objects: # cd /usr/local/nagios/etc/objects Edit the commands.cfg or whichever file is an appropriate location for the check_http command: # vi commands.cfg Find the definition for the check_http command. In a default Nagios Core configuration, it should look something like this: # 'check_http' command_definition define command {     command_name  check_http     command_line  $USER1$/check_http -I $HOSTADDRESS$ $ARG1$ } Copy this definition into a new definition directly under it and alter it to look like the following, renaming the command and adding a new option to its command line: # 'check_http_altport' command_definition define command {     command_name  check_http_altport     command_line  $USER1$/check_http -I $HOSTADDRESS$ -p 8080 $ARG1$ } Validate the configuration and restart the Nagios Core server: # /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg # /etc/init.d/nagios reload If the validation passed and the server restarted successfully, we should now be able to use the check_http_altportcommand, which is based on the original check_httpcommand, in a service definition. How it works... The configuration we added to the commands.cfgfile in the preceding steps reproduces the command definition for check_http,but changes it in two ways: It renames the command from check_http to check_http_alt, which is necessary to distinguish the commands from one another. Command names in Nagios Core, like host names, must be unique. It adds the -p 8080 option to the command line call, specifying that when the call to check_http is made, the check will be made using TCP port 8080 rather than the default value for TCP port 80. The check_http_altcommand can now be used as a check command in the same way a check_httpcommand can be used. For example, a service definition that checks whether the sparta.example.nethost is running an HTTP daemon on port 8080 might look something like this: define service {     use                  generic-service     host_name            sparta.example.net     service_description  HTTP_8080     check_command        check_http_alt } There's more... This recipe's title implies that we should customize the existing commands by editing them in-place, and indeed, this works fine if we really do want to do things this way. Instead of copying the command definition, we can just add -p 8080 or any other customization to the command line and change the original command. However, this is bad practice in most cases, mostly because it can break existing monitoring and can be potentially confusing to other administrators of the Nagios Core server. If we have a special case for monitoring, in this case, checking a nonstandard port for HTTP, then it's wise to create a whole new command based on the existing one with the customisations we need. Particularly if you share monitoring configuration duties with someone else on your team, changing the command can break the monitoring for anyone who had set up the services using the check_http command beforeyou changed it, meaning that their checks would all start failing because port 8080 would be checked instead. There is no limit to the number of commands you can define, so you can be very liberal in defining as many alternative commands as you need. It's a good idea to give them instructive names that say something about what they do as well as to add explanatory comments to the configuration file. You can add a comment to the file by prefixing it with a # character: # # 'check_http_altport' command_definition. This is to keep track of the # servers that have administrative panels running on an alternative port # to confer special privileges to a separate instance of Apache HTTPD # that we don't want to confer to the one for running public-facing # websites. # define command {     command_name  check_http_altport     command_line  $USER1$/check_http -H $HOSTADDRESS$ -p 8080 $ARG1$ } See also The Creating a new command recipe Writing a new plugin from scratch Even given the very useful standard plugins in the Nagios Plugins set and the large number of custom plugins available on Nagios Exchange, occasionally, as our monitoring setup becomes more refined, we may well find that there is some service or property of a host that we would like to check, but for which there doesn't seem to be any suitable plugin available. Every network is different, and sometimes, the plugins that others have generously donated their time to make for the community don't quite cover all your bases. Generally, the more specific your monitoring requirements get, the less likely it is for there to be a plugin available that does exactly what you need. In this example, we'll deal with a very particular problem that we'll assume can't be dealt with effectively by any known Nagios Core plugins, and we'll write one ourselves using Perl. Here's the example problem. Our Linux security team wants to be able to automatically check whether any of our servers are running kernels that have known exploits. However, they're not worried about every vulnerable kernel, only certain ones. They have provided us with the version numbers of three kernels that have small vulnerabilities that they're not particularly worried about but that do need patching, and one they're extremely worried about. Let's say the minor vulnerabilities are in the kernels with version numbers 2.6.19, 2.6.24, and 3.0.1. The serious vulnerability is in the kernel with version number 2.6.39. Note that these version numbers in this case are arbitrary and don't necessarily reflect any real kernel vulnerabilities! The team could log in to all of the servers individually to check them, but the servers are of varying ages and access methods, and they are managed by different people. They would also have to check manually more than once because it's possible that a naive administrator could upgrade to a kernel that's known to be vulnerable in an older release, and they also might want to add other vulnerable kernel numbers for checking later on. So, the team have asked us to solve the problem with Nagios Core monitoring, and we've decided that the best way to do it is to write our own plugin, check_vuln_kernel, thatchecks the output of uname(1)for a kernel version string, and then does the following: If it's one of the slightly vulnerable kernels, it will return a WARNING state so that we can let the security team know that they should address it when they're next able to. If it's the highly vulnerable kernel version, it will return a CRITICAL state so that the security team knows that a patched kernel needs to be installed immediately. If uname(1) gives an error or output we don't understand, it will return an UNKNOWN state, alerting the team to a bug in the plugin or possibly more serious problems with the server. Otherwise, it returns an OK state, confirming that the kernel is not known to be a vulnerable one. Finally, in the Nagios Core monitoring, they want to be able to see at a glance what the kernel version is and whether it's vulnerable or not. For the purposes of this example, we'll only monitor the Nagios Core server; however, via NRPE, we'd be able to install this plugin on the other servers that require this monitoring, they'll work just fine here as well. While this problem is very specific, we'll approach it in a very general way, which you'll be able to adapt to any solution where it's required for a Nagios plugin to: Run a command and pull its output into a variable. Check the output for the presence or absence of certain patterns. Return an appropriate status based on those tests. All that this means is that if you're able to do this, you'll be able to monitor anything effectively from Nagios Core! Getting ready You should have a Nagios Core 4.0 or newer server running with a few hosts and services configured already. You should also already be familiar with the relationship between services, commands, and plugins. You should have Perl installed, at least version 5.10. This will include the required POSIX module. You should also have the Perl modules Nagios::Plugin(or Monitoring::Plugin) andReadonly installed. On Debian-like systems, you can install this with the following: # apt-get install libnagios-plugin-perl libreadonly-perl On RPM-based systems, such as CentOS or Fedora Core, the following command should work: # yum install perl-Nagios-Plugin perl-Readonly This will be a rather long recipe that ties in a lot of Nagios Core concepts. You should be familiar with all the following concepts: Defining new hosts and services and how they relate to one another Defining new commands and how they relate to the plugins they call Installing, testing, and using Nagios Core plugins Some familiarity with Perl would also be helpful, but it is not required. We'll include comments to explain what each block of code is doing in the plugin. How to do it... We can write, test, and implement our example plugin as follows: Change to the directory containing the plugin binaries for Nagios Core. The default location is /usr/local/nagios/libexec: # cd /usr/local/nagios/libexec Start editing a new file called check_vuln_kernel: # vi check_vuln_kernel Include the following code in it. Take note of the comments, which explain what each block of code is doing. #!/usr/bin/env perl   # Use strict Perl style use strict; use warnings; use utf8;   # Require at least Perl v5.10 use 5.010;   # Require a few modules, including Nagios::Plugin use Nagios::Plugin; use POSIX; use Readonly;   # Declare some constants with patterns that match bad kernels Readonly::Scalar my $CRITICAL_REGEX => qr/^2[.]6[.]39[^d]/msx; Readonly::Scalar my $WARNING_REGEX =>   qr/^(?:2[.]6[.](?:19|24)|3[.]0[.]1)[^d]/msx;   # Run POSIX::uname() to get the kernel version string my @uname   = uname(); my $version = $uname[2];   # Create a new Nagios::Plugin object my $np = Nagios::Plugin->new();   # If we couldn't get the version, bail out with UNKNOWN if ( !$version ) {     $np->nagios_die('Could not read kernel version     string'); }   # Exit with CRITICAL if the version string matches the critical pattern if ( $version =~ $CRITICAL_REGEX ) {     $np->nagios_exit( CRITICAL, $version ); }   # Exit with WARNING if the version string matches the warning pattern if ( $version =~ $WARNING_REGEX ) {     $np->nagios_exit( WARNING, $version ); }   # Exit with OK if neither of the patterns matched $np->nagios_exit( OK, $version ); Make the plugin owned by the nagios group and executable with chmod(1): # chown root.nagios check_vuln_kernel # chmod 0770 check_vuln_kernel Run the plugin directly to test it: # sudo -s -u nagios $ ./check_vuln_kernel VULN_KERNEL OK: 3.16.0-4-amd64 We should now be able to use the plugin in a command, and hence in a service check just like any other command. How it works... The code we added in the new plugin file, check_vuln_kernel,earlier is actually quite simple: It runs Perl's POSIX uname implementation to get the version number of the kernel If that didn't work, it exits with the UNKNOWN status If the version number matches anything in a pattern containing critical version numbers, it exits with the CRITICAL status If the version number matches anything in a pattern containing warning version numbers, it exits with the WARNING status Otherwise, it exits with the OK status It also prints the status as a string along with the kernel version number, if it was able to retrieve one. We might set up a command definition for this plugin, as follows: define command {     command_name  check_vuln_kernel     command_line  $USER1$/check_vuln_kernel } In turn, we might set up a service definition for that command, as follows: define service {     use                  local-service     host_name            localhost     service_description  VULN_KERNEL     check_command        check_vuln_kernel } If the kernel was not vulnerable, the service's appearance in the web interface might be something like this: However, if the monitoring server itself happened to be running a vulnerable kernel, it might look more like this (and send consequent notifications, if configured to do so): There's more... This may be a simple plugin, but its structure can be generalised to all sorts of monitoring tasks. If we can figure out the correct logic to return the status we want in an appropriate programming language, then we can write a plugin to do basically anything. A plugin like this can just as effectively be written in C or for improved performance, but we'll assume for simplicity's sake that high performance for the plugin is not required, we can instead use a language that's better suited for quick ad hoc scripts like this one, in this case, Perl. The utils.shfile,also in /usr/local/nagios/libexec, allows us to write in shell script if we'd prefer that. If you prefer Python, the nagiosplugin library should meet your needs for both Python 2 and Python 3. Ruby users may like the nagiosplugin gem. If you write a plugin that you think could be generally useful for the Nagios community at large, consider putting it under a free software license and submitting it to the Nagios Exchange so that others can benefit from your work. Community contribution and support is what has made Nagios Core such a great monitoring platform in such wide use. Any plugin you publish in this way should confirm to the Nagios Plugin Development Guidelines. At the time of writing, these are available at https://nagios-plugins.org/doc/guidelines.html. You may find older Nagios Core plugins written in Perl using the utils.pm file instead of Nagios::Plugin or Monitoring::Plugin. This will work fine, but Nagios::Plugin is recommended, as it includes more functionality out of the box and tends to be easier to use. See also The Creating a new command recipe The Customizing an existing command recipe Summary In this article, we learned about how to install a custom plugin that we retrieved from Nagios Exchange onto a Nagios Core server so that we can use it in a Nagios Core command, removing a plugin that we no longer need as part of our Nagios Core installation, creating new command, writing and customizing commands. Resources for Article: Further resources on this subject: An Introduction To NODE.JS Design Patterns [article] Developing A Basic Site With NODE.JS And EXPRESS [article] Creating Our First App With IONIC [article]
Read more
  • 0
  • 0
  • 5735
article-image-training-neural-networks-efficiently-using-keras
Packt
22 Feb 2016
9 min read
Save for later

Training neural networks efficiently using Keras

Packt
22 Feb 2016
9 min read
In this article, we will take a look at Keras, one of the most recently developed libraries to facilitate neural network training. The development on Keras started in the early months of 2015; as of today, it has evolved into one of the most popular and widely used libraries that are built on top of Theano, and allows us to utilize our GPU to accelerate neural network training. One of its prominent features is that it's a very intuitive API, which allows us to implement neural networks in only a few lines of code. Once you have Theano installed, you can install Keras from PyPI by executing the following command from your terminal command line: (For more resources related to this topic, see here.) pip install Keras For more information about Keras, please visit the official website at http://keras.io. To see what neural network training via Keras looks like, let's implement a multilayer perceptron to classify the handwritten digits from the MNIST dataset. The MNIST dataset can be downloaded from http://yann.lecun.com/exdb/mnist/ in four parts as listed here: train-images-idx3-ubyte.gz: These are training set images (9912422 bytes) train-labels-idx1-ubyte.gz: These are training set labels (28881 bytes) t10k-images-idx3-ubyte.gz: These are test set images (1648877 bytes) t10k-labels-idx1-ubyte.gz: These are test set labels (4542 bytes) After downloading and unzipped the archives, we place the files into a directory mnist in our current working directory, so that we can load the training as well as the test dataset using the following function: import os import struct import numpy as np def load_mnist(path, kind='train'): """Load MNIST data from `path`""" labels_path = os.path.join(path, '%s-labels-idx1-ubyte' % kind) images_path = os.path.join(path, '%s-images-idx3-ubyte' % kind) with open(labels_path, 'rb') as lbpath: magic, n = struct.unpack('>II', lbpath.read(8)) labels = np.fromfile(lbpath, dtype=np.uint8) with open(images_path, 'rb') as imgpath: magic, num, rows, cols = struct.unpack(">IIII", imgpath.read(16)) images = np.fromfile(imgpath, dtype=np.uint8).reshape(len(labels), 784) return images, labels X_train, y_train = load_mnist('mnist', kind='train') print('Rows: %d, columns: %d' % (X_train.shape[0], X_train.shape[1])) Rows: 60000, columns: 784 X_test, y_test = load_mnist('mnist', kind='t10k') print('Rows: %d, columns: %d' % (X_test.shape[0], X_test.shape[1])) Rows: 10000, columns: 784 On the following pages, we will walk through the code examples for using Keras step by step, which you can directly execute from your Python interpreter. However, if you are interested in training the neural network on your GPU, you can either put it into a Python script, or download the respective code from the Packt Publishing website. In order to run the Python script on your GPU, execute the following command from the directory where the mnist_keras_mlp.py file is located: THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python mnist_keras_mlp.py To continue with the preparation of the training data, let's cast the MNIST image array into 32-bit format: >>> import theano >>> theano.config.floatX = 'float32' >>> X_train = X_train.astype(theano.config.floatX) >>> X_test = X_test.astype(theano.config.floatX) Next, we need to convert the class labels (integers 0-9) into the one-hot format. Fortunately, Keras provides a convenient tool for this: >>> from keras.utils import np_utils >>> print('First 3 labels: ', y_train[:3]) First 3 labels: [5 0 4] >>> y_train_ohe = np_utils.to_categorical(y_train) >>> print('nFirst 3 labels (one-hot):n', y_train_ohe[:3]) First 3 labels (one-hot): [[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]] Now, we can get to the interesting part and implement a neural network. However, we will replace the logistic units in the hidden layer with hyperbolic tangent activation functions, replace the logistic function in the output layer with softmax, and add an additional hidden layer. Keras makes these tasks very simple, as you can see in the following code implementation: >>> from keras.models import Sequential >>> from keras.layers.core import Dense >>> from keras.optimizers import SGD >>> np.random.seed(1) >>> model = Sequential() >>> model.add(Dense(input_dim=X_train.shape[1], ... output_dim=50, ... init='uniform', ... activation='tanh')) >>> model.add(Dense(input_dim=50, ... output_dim=50, ... init='uniform', ... activation='tanh')) >>> model.add(Dense(input_dim=50, ... output_dim=y_train_ohe.shape[1], ... init='uniform', ... activation='softmax')) >>> sgd = SGD(lr=0.001, decay=1e-7, momentum=.9) >>> model.compile(loss='categorical_crossentropy', optimizer=sgd) First, we initialize a new model using the Sequential class to implement a feedforward neural network. Then, we can add as many layers to it as we like. However, since the first layer that we add is the input layer, we have to make sure that the input_dim attribute matches the number of features (columns) in the training set (here, 768). Also, we have to make sure that the number of output units (output_dim) and input units (input_dim) of two consecutive layers match. In the preceding example, we added two hidden layers with 50 hidden units plus 1 bias unit each. Note that bias units are initialized to 0 in fully connected networks in Keras. This is in contrast to the MLP implementation, where we initialized the bias units to 1, which is a more common (not necessarily better) convention. Finally, the number of units in the output layer should be equal to the number of unique class labels—the number of columns in the one-hot encoded class label array. Before we can compile our model, we also have to define an optimizer. In the preceding example, we chose a stochastic gradient descent optimization. Furthermore, we can set values for the weight decay constant and momentum learning to adjust the learning rate at each epoch. Lastly, we set the cost (or loss) function to categorical_crossentropy. The (binary) cross-entropy is just the technical term for the cost function in logistic regression, and the categorical cross-entropy is its generalization for multi-class predictions via softmax. After compiling the model, we can now train it by calling the fit method. Here, we are using mini-batch stochastic gradient with a batch size of 300 training samples per batch. We train the MLP over 50 epochs, and we can follow the optimization of the cost function during training by setting verbose=1. The validation_split parameter is especially handy, since it will reserve 10 percent of the training data (here, 6,000 samples) for validation after each epoch, so that we can check if the model is overfitting during training. >>> model.fit(X_train, ... y_train_ohe, ... nb_epoch=50, ... batch_size=300, ... verbose=1, ... validation_split=0.1, ... show_accuracy=True) Train on 54000 samples, validate on 6000 samples Epoch 0 54000/54000 [==============================] - 1s - loss: 2.2290 - acc: 0.3592 - val_loss: 2.1094 - val_acc: 0.5342 Epoch 1 54000/54000 [==============================] - 1s - loss: 1.8850 - acc: 0.5279 - val_loss: 1.6098 - val_acc: 0.5617 Epoch 2 54000/54000 [==============================] - 1s - loss: 1.3903 - acc: 0.5884 - val_loss: 1.1666 - val_acc: 0.6707 Epoch 3 54000/54000 [==============================] - 1s - loss: 1.0592 - acc: 0.6936 - val_loss: 0.8961 - val_acc: 0.7615 […] Epoch 49 54000/54000 [==============================] - 1s - loss: 0.1907 - acc: 0.9432 - val_loss: 0.1749 - val_acc: 0.9482 Printing the value of the cost function is extremely useful during training, since we can quickly spot whether the cost is decreasing during training and stop the algorithm earlier if otherwise to tune the hyperparameters values. To predict the class labels, we can then use the predict_classes method to return the class labels directly as integers: >>> y_train_pred = model.predict_classes(X_train, verbose=0) >>> print('First 3 predictions: ', y_train_pred[:3]) >>> First 3 predictions: [5 0 4] Finally, let's print the model accuracy on training and test sets: >>> train_acc = np.sum( ... y_train == y_train_pred, axis=0) / X_train.shape[0] >>> print('Training accuracy: %.2f%%' % (train_acc * 100)) Training accuracy: 94.51% >>> y_test_pred = model.predict_classes(X_test, verbose=0) >>> test_acc = np.sum(y_test == y_test_pred, ... axis=0) / X_test.shape[0] print('Test accuracy: %.2f%%' % (test_acc * 100)) Test accuracy: 94.39% Note that this is just a very simple neural network without optimized tuning parameters. If you are interested in playing more with Keras, please feel free to further tweak the learning rate, momentum, weight decay, and number of hidden units. Although Keras is great library for implementing and experimenting with neural networks, there are many other Theano wrapper libraries that are worth mentioning. A prominent example is Pylearn2 (http://deeplearning.net/software/pylearn2/), which has been developed in the LISA lab in Montreal. Also, Lasagne (https://github.com/Lasagne/Lasagne) may be of interest to you if you prefer a more minimalistic but extensible library, that offers more control over the underlying Theano code. Summary We caught a glimpse of the most beautiful and most exciting algorithms in the whole machine learning field: artificial neural networks. I can recommend you to follow the works of the leading experts in this field, such as Geoff Hinton (http://www.cs.toronto.edu/~hinton/), Andrew Ng (http://www.andrewng.org), Yann LeCun (http://yann.lecun.com), Juergen Schmidhuber (http://people.idsia.ch/~juergen/), and Yoshua Bengio (http://www.iro.umontreal.ca/~bengioy), just to name a few. To learn more about material design, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Building Machine Learning Systems with Python (https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python) Neural Network Programming with Java (https://www.packtpub.com/networking-and-servers/neural-network-programming-java) Resources for Article: Further resources on this subject: Python Data Analysis Utilities [article] Machine learning and Python – the Dream Team [article] Adding a Spark to R [article]
Read more
  • 0
  • 0
  • 3708

article-image-publication-apps
Packt
22 Feb 2016
10 min read
Save for later

Publication of Apps

Packt
22 Feb 2016
10 min read
Ever wondered if you could prepare and publish an app on Google Play and you needed a short article on how you could get this done quickly? Here it is! Go ahead, read this piece of article, and you'll be able to get your app running on Google Play. (For more resources related to this topic, see here.) Preparing to publish You probably don't want to upload any of the apps from this book, so the first step is to develop an app that you want to publish. Head over to https://play.google.com/apps/publish/ and follow the instructions to get a Google Play developer account. This was $25 at the time of writing and is a one-time charge with no limit on the number of apps you can publish. Creating an app icon Exactly how to design an icon is beyond the remit of this book. But, simply put, you need to create a nice image for each of the Android screen density categorizations. This is easier than it sounds. Design one nice app icon in your favorite drawing program and save it as a .png file. Then, visit http://romannurik.github.io/AndroidAssetStudio/icons-launcher.html. This will turn your single icon into a complete set of icons for every single screen density. Warning! The trade-off for using this service is that the website will collect your e-mail address for their own marketing purposes. There are many sites that offer a similar free service. Once you have downloaded your .zip file from the preceding site, you can simply copy the res folder from the download into the main folder within the project explorer. All icons at all densities have now been updated. Preparing the required resources When we log into Google Play to create a new listing in the store, there is nothing technical to handle, but we do need to prepare quite a few images that we will need to upload. Prepare upto 8 screenshots for each device type (a phone/tablet/TV/watch) that your app is compatible with. Don't crop or pad these images. Create a 512 x 512 pixel image that will be used to show off your app icon on the Google Play store. You can prepare your own icon, or the process of creating app icons that we just discussed will have already autogenerated icons for you. You also need to create three banner graphics, which are as follows: 1024 x 500 180 x 120 320 x 180 These can be screenshots, but it is usually worth taking a little time to create something a bit more special. If you are not artistically minded, you can place a screenshot inside some quite cool device art and then simply add a background image. You can generate some device art at https://developer.android.com/distribute/tools/promote/device-art.html. Then, just add the title or feature of your app to the background. The following banner was created with no skill at all, just with a pretty background purchased for $10 and the device art tool I just mentioned: Also, consider creating a video of your app. Recording video of your Android device is nearly impossible unless your device is rooted. I cannot recommend you to root your device; however, there is a tool called ARC (App Runtime for Chrome) that enables you to run APK files on your desktop. There is no debugging output, but it can run a demanding app a lot more smoothly than the emulator. It will then be quite simple to use a free, open source desktop capture program such as OBS (Open Broadcaster Software) to record your app running within ARC. You can learn more about ARC at https://developer.chrome.com/apps/getstarted_arc and about OBS at https://obsproject.com/. Building the publishable APK file What we are doing in this section is preparing the file that we will upload to Google Play. The format of the file we will create is .apk. This type of file is often referred to as an APK. The actual contents of this file are the compiled class files, all the resources that we've added, and the files and resources that Android Studio has autogenerated. We don't need to concern ourselves with the details, as we just need to follow these steps. The steps not only create the APK, but they also create a key and sign your app with the key. This process is required and it also protects the ownership of your app:   Note that this is not the same thing as copy protection/digital rights management. In Android Studio, open the project that you want to publish and navigate to Build | Generate Signed APK and a pop-up window will open, as shown: In the Generate Signed APK window, click on the Create new button. After this, you will see the New Key Store window, as shown in the following screenshot: In the Key store path field, browse to a location on your hard disk where you would like to keep your new key, and enter a name for your key store. If you don't have a preference, simply enter keys and click on OK. Add a password and then retype it to confirm it. Next, you need to choose an alias and type it into the Alias field. You can treat this like a name for your key. It can be any word that you like. Now, enter another password for the key itself and type it again to confirm. Leave Validity (years) at its default value of 25. Now, all you need to do is fill out your personal/business details. This doesn't need to be 100% complete as the only mandatory field is First and Last Name. Click on the OK button to continue. You will be taken back to the Generate Signed APK window with all the fields completed and ready to proceed, as shown in the following window: Now, click on Next to move to the next screen: Choose where you would like to export your new APK file and select release for the Build Type field. Click on Finish and Android Studio will build the shiny new APK into the location you've specified, ready to be uploaded to the App Store. Taking a backup of your key store in multiple safe places! The key store is extremely valuable. If you lose it, you will effectively lose control over your app. For example, if you try to update an app that you have on Google Play, it will need to be signed by the same key. Without it, you would not be able to update it. Think of the chaos if you had lots of users and your app needed a database update, but you had to issue a whole new app because of a lost key store. As we will need it quite soon, locate the file that has been built and ends in the .apk extension. Publishing the app Log in to your developer account at https://play.google.com/apps/publish/. From the left-hand side of your developer console, make sure that the All applications tab is selected, as shown: On the top right-hand side corner, click on the Add new application button, as shown in the next screenshot: Now, we have a bit of form filling to do, and you will need all the images from the Preparing to publish section that is near the start of the chapter. In the ADD NEW APPLICATION window shown next, choose a default language and type the title of your application: Now, click on the Upload APK button and then the Upload your first APK button and browse to the APK file that you built and signed in. Wait for the file to finish uploading: Now, from the inner left-hand side menu, click on Store Listing: We are faced with a fair bit of form filling here. If, however, you have all your images to hand, you can get through this in about 10 minutes. Almost all the fields are self-explanatory, and the ones that aren't have helpful tips next to the field entry box. Here are a few hints and tips to make the process smooth and produce a good end result: In the Full description and Short description fields, you enter the text that will be shown to potential users/buyers of your app. Be sure to make the description as enticing and exciting as you can. Mention all the best features in a clear list, but start the description with one sentence that sums up your app and what it does. Don't worry about the New content rating field as we will cover that in a minute. If you haven't built your app for tablet/phone devices, then don't add images in these tabs. If you have, however, make sure that you add a full range of images for each because these are the only images that the users of this type of device will see. When you have completed the form, click on the Save draft button at the top-right corner of the web page. Now, click on the Content rating tab and you can answer questions about your app to get a content rating that is valid (and sometimes varied) across multiple countries. The last tab you need to complete is the Pricing and Distribution tab. Click on this tab and choose the Paid or Free distribution button. Then, enter a price if you've chosen Paid. Note that if you choose Free, you can never change this. You can, however, unpublish it. If you chose Paid, you can click on Auto-convert prices now to set up equivalent pricing for all currencies around the world. In the DISTRIBUTE IN THESE COUNTRIES section, you can select countries individually or check the SELECT ALL COUNTRIES checkbox, as shown in the next screenshot:   The next six options under the Device categories and User programs sections in the context of what you have learned in this book should all be left unchecked. Do read the tips to find out more about Android Wear, Android TV, Android Auto, Designed for families, Google Play for work, and Google Play for education, however. Finally, you must check two boxes to agree with the Google consent guidelines and US export laws. Click on the Publish App button in the top-right corner of the web page and your app will soon be live on Google Play. Congratulations. Summary You can now start building Android apps. Don't run off and build the next Evernote, Runtatstic, or Angry Birds just yet. Head over to our book, Android Programming for Beginners: https://www.packtpub.com/application-development/android-programming-beginners. Here are a few more books that you can check out to learn more about Android: Android Studio Cookbook (https://www.packtpub.com/application-development/android-studio-cookbook) Learning Android Google Maps (https://www.packtpub.com/application-development/learning-android-google-maps) Android 6 Essentials (https://www.packtpub.com/application-development/android-6-essentials) Android Sensor Programming By Example (https://www.packtpub.com/application-development/android-sensor-programming-example) Resources for Article: Further resources on this subject: Saying Hello to Unity and Android[article] Android and iOS Apps Testing at a Glance[article] Testing with the Android SDK[article]
Read more
  • 0
  • 0
  • 13775
Modal Close icon
Modal Close icon