How-To Tutorials

article-image-creating-your-own-node-module

18 Apr 2016

6 min read

Creating Your Own Node Module

18 Apr 2016

Node.js has a great community and one of the best package managers I have ever seen. One of the reasons npm is so great is because it encourages you to make small composable modules, which usually have just one responsibility. Many of the larger, more complex node modules are built by composing smaller node modules. As of this writing, npm has over 219,897 packages. One of the reasons this community is so vibrant is because it is ridiculously easy to make your own node module. This post will go through the steps to create your own node module, as well as some of the best practices to follow while doing so. Prerequisites and Installation node and npm are a given. Additionally, you should also configure your npm author details: npm set init.author.name "My Name" npm set init.author.email "your@email.com" npm set init.author.url "http://your-website.com" npm adduser These are the details that would show up on npmjs.org once you publish. Hello World The reason that I say creating a node module is ridiculously easy is because you only need two files to create the most basic version of a node module. First up, create a package.json file inside of a new folder by running the npm init command. This will ask you to choose a name. Of course, the name you are thinking of might already exist in the npm registry, so to check for this run the command npm ls owner module_name , where module_name is replaced by the namespace you want to check. If it exists, you will get information about the authors: $ npm owner ls forever indexzero <charlie.robbins@gmail.com> bradleymeck <bradley.meck@gmail.com> julianduque <julianduquej@gmail.com> jeffsu <me@jeffsu.com> jcrugzz <jcrugzz@gmail.com> If your namespace is free you would get an error message. Something similar to : $ npm owner ls does_not_exist npm ERR! owner ls Couldnt get owner data does_not_exist npm ERR! Darwin 14.5.0 npm ERR! argv "node" "/usr/local/bin/npm" "owner" "ls" "does_not_exist" npm ERR! node v0.12.4 npm ERR! npm v2.10.1 npm ERR! code E404 npm ERR! 404 Registry returned 404 GET on https://registry.npmjs.org/does_not_exist npm ERR! 404 npm ERR! 404 'does_not_exist' is not in the npm registry. npm ERR! 404 You should bug the author to publish it (or use the name yourself!) npm ERR! 404 npm ERR! 404 Note that you can also install from a npm ERR! 404 tarball, folder, http url, or git url. npm ERR! Please include the following file with any support request: npm ERR! /Users/sohamchetan/Documents/jekyll-blog/npm-debug.log After setting up package.json, add a JavaScript file: module.exports = function(){ return 'Hello World!'; } And that's it! Now execute npm publish . and your node module will be published to npmjs.org. Also, anyone can now install your node module by running npm install --save module_name, where module name is the "name" property contained in package.json. Now anyone can use your module like this : var someModule = require('module_name'); console.log(someModule()); // This will output "Hello World!" Dependencies As stated before, rarely will you find large scale node modules that do not depend on other smaller modules. This is because npm encourages modularity and composability. To add dependancies to your own module, simply install them. For example, one of the most depended upon packages is lodash, a utility library. To add this, run the command : npm install --save lodash Now you can use lodash everywhere in your module by "requiring" it, and when someone else downloads your module, they get lodash bundled along with it as well. Additionally you would want to have some modules purely for development and not for distribution. These are dev-dependencies, and can be installed with the npm install --save-dev command. Dev dependencies will not install when someone else installs your node module. Configuring package.json The package.json file is what contains all the metadata for your node_module. A few fields are filled out automatically (like dependencies or devDependencies during npm installs). There are a few more fields in package.json that you should consider filling out so that your node module is best fitted to its purpose. "main": The relative path of the entry point of your module. Whatever is assigned to module.exports in this file is exported when someone "requires" your module. By default this is the index.js file. "keywords": It’s an array of keywords describing your module. Quite helpful when others from the community are searching for something that your module happens to solve. "license": I normally publish all my packages with an "MIT" licence because of its openness and popularity in the open source community. "version": This is pretty crucial because you cannot publish a node module with the same version twice. Normally, semver versioning should be followed. If you want to know more about the different properties you can set in package.json there's a great interactive guide you can check out. Using Yeoman Generators Although it's really simple to make a basic node module, it can be quite a task to make something substantial using just index.js nd package.json file. In these cases, there's a lot more to do, such as: Writing and running tests. Setting up a CI tool like Travis. Measuring code coverage. Installing standard dev dependencies for testing. Fortunately, there are many Yeoman generators to help you bootstrap your project. Check out generator-nm for setting up a basic project structure for a simple node module. If writing in ES6 is more your style, you can take a look at generator-nm-es6. These generators get your project structure, complete with a testing framework and CI integration so that you don't have to spend all your time writing boilerplate code. About the Author Soham Kamani is a full-stack web developer and electronics hobbyist. He is especially interested in JavaScript, Python, and IoT.

0
0
9226

How-To Tutorials

Packt

18 Apr 2016

24 min read

Setting up a Build Chain with Grunt

Packt

18 Apr 2016

24 min read

0
0
35045

How-To Tutorials

Packt

18 Apr 2016

15 min read

Configuring Redmine

Packt

18 Apr 2016

15 min read

In this article by Andriy Lesyuk, author of Mastering Redmine, whentalking about the web interface (that is, not system files), all of the global configuration of Redmine can be done on the Settings page of the Administration menu. This is actually the page that this articleis based around. Some settings on this page, however, depend on special system files or third-party tools that need to be installed. And these are the other things that we will discuss. You might expect to see detailed explanations for all the administration settings here, but instead, we will review in detail only a few of them, as I believe that the others do not need to be explained or can easily be tested. So generally, we will focus on hard-to-understand settings and thosesettings that need to be configured additionally in some special way or have some obscurities. So, why should you read this articleif you are not an administrator? Some features of Redmine are available only if they have been configured, so by reading this article, you will learn what extra features exist and get an idea of how to enable them. In this article, we will cover the following topics: The first thing to fix The general settings Authentication (For more resources related to this topic, see here.) The first thing to fix A fresh Redmine installation has only one user account, which has administrator privileges. You can see it in the following screenshot: This account is exactly the same by default on all Redmine installations. That's why it is extremely important to change its credentials immediately after you complete the installation, especially for Redmine instances that can be accessed publicly. The administrator credentials can be changed on the Users page of the Administration menu. To do this, click on the admin link. You will see this screen: In this form, you should specify a new password in the Password and Confirmation fields. Also, it's recommended that you change the login to something different. Additionally, consider specifying your e-mail instead of admin@example.net (at least), changing the First name and Last name. The general settings Everything that is possible to configure at the global level (the opposite is the project level) can be found under the Administration link in the top-left menu. Of course, this link is available only for administrators If you click on the Administrationlink, you will get the list of available administration pages on the sidebar to the right. Most of them are for managing Redmine objects, such as projects and trackers. We will be discussing only general, system-wide configuration. Most of the settings that we are going to review are compiled on the Settings page, as shown in the following screenshot: As all of these settings can't fit on a single page, Redmine organizes them into tabs. We will discuss the Authentication, Email notifications, Incoming emails, and Repositories tabs in the next sections. The General tab So let's start with the General tab, which can be seen in the previous screenshot. Settings in this tab control the general behavior of Redmine, thus Application title is the name of the website that is shown at the top of non-project pages, Welcome text is displayed on the start page of Redmine, Objects per page options specifies how many objects users will be able to see on a page, such settings as Search results per page and Days displayed on project activity allow to control the number of objects that are shown on search results and activity pages correspondingly, the Protocol setting specifies the preferred protocol that will be used in links to the website, Wiki history compression controls whether the history of Wiki changes should be compressed to save the space, and finally Maximum number of items in Atom feeds sets the limit for the amount of items that are returned in the Atom feed. Additionally, the General tab contains settings, which I want to discuss in detail. The Cache formatted text setting Redmine supports text formatting through the lightweight markup language Textile or Markdown. While conversion of text from such a language to HTML is quite fast, in some circumstances, you may want to cache the resulting HTML. If that is the case, the Cache formatted text checkbox is what you need. When this setting is enabled, all Textile or Markdown content that is larger than 2 KB will be cached. The cached HTML will be refreshed only when any changes are made to the source text, so you should take this into account if you are using a Wiki extension that generates the dynamic content (such as my WikiNG plugin). Unless performance is extremely critical for you, you should leave this checkbox unchecked. Other settings tips Here are some other tips for the General tab: The value of the Host name and path setting will be used to generate URLs in the e-mail messages that will be sent to users, so it's important to specify a proper value here. For the Text formatting, select the markup language that is best for you. It's also possible to select none here, but I would not recommend to do this. The Display tab As it comes from the name, this tab contains settings related to the look and feel of Redmine. Its settings can be seen in the following screenshot: Using the Theme setting users can choose a theme for the Redmine interface. The Default language setting allows to specify which language will be used for the interface, if Redmine fails to determine the language of the user. Thus, for not logged-in users it will attempt to use the preferred language of the user's browser, what can be disabled by the Force default language for anonymous users setting, and for logged-in users it will use the language that is chosen by users in their profiles, what can be disabled by the Force default language for logged-in users setting. By default the user's language also affects the start day of the week, and date and time formats, what can also be changed by the Start calendars on, Date format and Time format settings correspondingly. The display format of the user name is controlled by the Users display format setting. Finally, the Thumbnails size (in pixels) setting specifies the size of thumbnail images in pixels. Now let's check what the rest of settings mean. The Use Gravatar user icons setting Once I used a WordPress form to leave a comment on someone's blog. That form asked me to specify the first name, the last name, my e-mail address, and the text. After submitting it, I was surprised to see my photo near the comment. That's what Gravatar does. Gravatar stands for Globally Recognized Avatar. It's a web service that allows you to assign an image for each user's e-mail. Then, third-party sites can fetch the corresponding image by supplying a hash of the user's e-mail address. The Use Gravatar user icons setting enables this behavior for Redmine. Having this option checked is a good idea (unless potential users of your Redmine installation can be unable to access Internet because, for example, Redmine is going to be used in an isolated Intranet. The Default Gravatar image setting What happens if a Gravatar is not available for the user's e-mail? In such cases, the Gravatar service returns a default image, which depends on the Default Gravatar image setting. The following table shows the six available themes of the default avatar image: Theme Sample image Description None The default image, which is shown if no other theme is selected Wavatars A generated face with differing features and background Identicons A geometric pattern Monster IDs A generated monster image with different colors, face, and so on Retro A generated 8-bit, arcade-style pixelated face Mystery man A simple, cartoon-style silhouetted outline of a person For all of these themes, except Mystery manandnone, Gravatar generates an avatar image that is based on the hash of the user's e-mail and is therefore unique to it. The Redmine Local Avatars plugin Consider installing the Redmine Local Avatars plugin by Andrew Chaika, Luca Pireddu, and Ricardo Santos, if you preferwant users to upload their avatars directly onto Redmine: https://github.com/thorin/redmine_local_avatars This plugin will also let your users take their pictures with web cameras. The Display attachment thumbnails setting If the Display attachment thumbnails setting is enabled, all image attachments—no matter what object (for example, Wiki or issue) they are attached to—will be also seen under the attachment list as clickable thumbnails. If the user clicks on such a thumbnail, the full-size image will be opened. The Redmine Lightbox 2 plugin In pure Redmine, full-size images are opened in the same browser window. To open them in a lightbox, you can use the Lightbox 2 plugin that was created by Genki Zhang and Tobias Fischer: https://github.com/paginagmbh/redmine_lightbox2 Note that in order for this setting to work, you must have the ImageMagick's convert tool installed. The API tab In addition to the web interface that is intended for human Redmine comes with a special REST application programming interface (API) that is intended for third-party applications. Thus, Redmine REST API is used by Redmine Mylyn Connector for Eclipse and RedmineApp for iPhone. This interface can be enabled and configured under the API tab of the Settings page which is shown in the following screenshot: Let's check what these settings mean: If you need to support integration of third-party tools, you should turn on Redmine REST API using the Enable REST web service checkbox. But it is safe to keep this setting disabled, if you are not using any external Redmine tools. Redmine API can also be used via JavaScript in the web browser, but not if the API client (that is, a website, that runs JavaScript) is on different domain. In such cases to bypass the browser's same-origin policy the API client may use the technique called JSONP. As this technique is considered to be insecure it should be explicitly enabled using the Enable JSONP support setting. So in most cases you should leave this option disabled. The Files tab The Files tab contains settings related to file display and attachment as shown in the following screenshot: Here Allowed extensions and Disallowed extensions can be used to restrict file uploads by file extensions – thus you can use the former setting to allow certain extensions only or the latter one to forbid certain extensions only. Such settings as Maximum size of text files displayed inline and Maximum number of diff lines displayed control the amount of the file content that can be displayed. The rest settings are used more often: You may need to change the Maximum attachment size setting to a large value (which is in kB). Thus, project files (releases) are attachments as well, so if you expect your users to upload large files, consider changing this setting to a bigger value. The value of the Attachments and repositories encodings option is used to convert commit messages to UTF-8. Authentication There are two pages in Redmine intended for configuring the authentication. The first one is the Authentication tab on the Settings page, and the second one is the special LDAP Authentication page, which can be found in the Administration menu. Let's discuss these pages in detail. The Authentication tab The next tab in the administration settings is Authentication. The following screenshot shows the various options available under this tab: If the Authentication required setting is enabled, users won't be able to see the content of your Redmine without having logged in first. The Autologin setting can be used to let your users keep themselves logged in for some period of time using their browsers. The Self-registration setting controls, how user accounts are activated (the manual account activation option means that users should be enabled by administrators). The Allow users to delete their own account setting controls, whether users will be able to delete their accounts. The Minimum password length setting specifies the minimum size of the password in characters and the Require password change after setting can be used to force users to change their passwords periodically. The Lost password setting controls, whether users will be able to restore their passwords in cases when they, for example, have forgotten them. And finally the Maximum number of additional email addresses setting specifies the number of additional email addresses a user account may have. After a user logs in Redmine opens a user session. The lifetime of such session is controlled by the Session maximum lifetime setting (value disabled means that the session hangs forever). Such session can also be automatically terminated, if the user was not active for some time, what is controlled by the Session inactivity timeout setting (value disabled means that the session never expires). Now, let's discuss the very special setting, which we skipped. The Allow OpenID login and registration setting If you are running a public website with open registration, you perhaps know (or you will know if you want your Redmine installation to be public and open for user registration) that users do not like to register on each new site. This is understandable, as they do not want to create another password to remember or share their existing password with a new and therefore untrusted website. Besides, it's also a matter of sharing the e-mail address and—sometimes—remembering another login. That's when OpenID comes in handy. OpenID is an open-standard authentication protocol in which authentication (password verification) is performed by the OpenID provider. This popular protocol is currently supported by many companies, such as Yahoo!, PayPal, AOL, LiveJournal, IBM, VeriSign, and WordPress. In other words, servers of such companies can act as OpenID providers, and therefore users can log in to Redmine using their accounts that they have on these companies' websites if the Allow OpenID login and registration setting is enabled. Google used to support OpenID too, but they shut it down recently in favor of the OAuth2.0-based OpenID Connect authentication protocol. Despite the use of "OpenID" in its name, OpenID Connect is very different from OpenID. So, if your Redmine installation is (or is going to be) public, consider enabling this setting. But note that to log in using this protocol, your users will need to specify OpenID URL (the URL of the OpenID provider) in addition to Login and Password, as it can be seen on the following Redmine login form: LDAP authentication Just as OpenID is convenient for public sites to be used to authenticate external users, LDAP is convenient for private sites—to authenticate corporate users. Like OpenID, LDAP is a standard that describes how to authenticate against a special LDAP directory server, and is widely used by many applications such as MediaWiki, Apache, JIRA, Samba, SugarCRM, and so on. Also, as LDAP is an open protocol, it is supported by some other directory servers, such as Microsoft Active Directory and Apple Open Directory. For this reason, it is often used by companies as a centralized users' directory and an authentication server. To allow users to authenticate against an LDAP server, you should add it to the list of supported authentication modes on the LDAP authentication page, which is available in the Administration menu. To add a mode, click on the New authentication mode link. This will open the form: If the On-the-fly user creation option is checked, user accounts will be created automatically when users log in to the system for the first time. If this option is not checked, users will have to be added manually beforehand. Also, if you check this option, you need to specify all the attributes in the Attributes box, as they are going to be used to import user details from the LDAP server. Check with your LDAP server administrator to find out what values should be used in this form. In Redmine, LDAP authentication can be performed against many LDAP servers. Every such server is represented as an authentication source in the authentication mode list, which has just been mentioned. The corresponding source can also be seen in the user's profile and can even be changed to the internal Redmine authentication if needed. Summary I guess you have become a bit tired with all those general details, installations, configurations, integrations, and so on. You might expect to see detailed explanations for all the administration settings here, but instead, we will review in detail only a few of them, as I believe that the others do not need to be explained or can easily be tested. So generally, we will focus on hard-to-understand settings and those settings that need to be configured additionally in some special way or have some obscurities. Resources for Article: Further resources on this subject: Project management with Redmine [article] Redmine - Permissions and Security [article] Installing and customizing Redmine [article]

0
0
5530

How-To Tutorials

Packt

15 Apr 2016

24 min read

Web Server Development

Packt

15 Apr 2016

24 min read

In this article by Holger Brunn, Alexandre Fayolle, and Daniel Eufémio Gago Reis, the authors of the book, Odoo Development Cookbook, have discussed how to deploy the web server in Odoo. In this article, we'll cover the following topics: Make a path accessible from the network Restrict access to web accessible paths Consume parameters passed to your handlers Modify an existing handler Using the RPC API (For more resources related to this topic, see here.) Introduction We'll introduce the basics of the web server part of Odoo in this article. Note that this article covers the fundamental pieces. All of Odoo's web request handling is driven by the Python library werkzeug (http://werkzeug.pocoo.org). While the complexity of werkzeug is mostly hidden by Odoo's convenient wrappers, it is an interesting read to see how things work under the hood. Make a path accessible from the network In this recipe, we'll see how to make an URL of the form http://yourserver/path1/path2 accessible to users. This can either be a web page or a path returning arbitrary data to be consumed by other programs. In the latter case, you would usually use the JSON format to consume parameters and to offer you data. Getting ready We'll make use of a ready-made library.book model. We want to allow any user to query the full list of books. Furthermore, we want to provide the same information to programs via a JSON request. How to do it… We'll need to add controllers, which go into a folder called controllers by convention. Add a controllers/main.py file with the HTML version of our page: from openerp import http from openerp.http import request class Main(http.Controller): @http.route('/my_module/books', type='http', auth='none') def books(self): records = request.env['library.book']. sudo().search([]) result = '<html><body><table><tr><td>' result += '</td></tr><tr><td>'.join( records.mapped('name')) result += '</td></tr></table></body></html>' return result Add a function to serve the same information in the JSON format @http.route('/my_module/books/json', type='json', auth='none') def books_json(self): records = request.env['library.book']. sudo().search([]) return records.read(['name']) Add the file controllers/__init__.py: from . import main Add controllers to your __init__.py addon: from . import controllers After restarting your server, you can visit /my_module/books in your browser and get presented with a flat list of book names. To test the JSON-RPC part, you'll have to craft a JSON request. A simple way to do that would be using the following command line to receive the output on the command line: curl -i -X POST -H "Content-Type: application/json" -d "{}" localhost:8069/my_module/books/json If you get 404 errors at this point, you probably have more than one database available on your instance. In this case, it's impossible for Odoo to determine which database is meant to serve the request. Use the --db-filter='^yourdatabasename$' parameter to force using exact database you installed the module in. Now the path should be accessible. How it works… The two crucial parts here are that our controller is derived from openerp.http.Controller and that the methods we use to serve content are decorated with openerp.http.route. Inheriting from openerp.http.Controller registers the controller with Odoo's routing system in a similar way as models are registered by inheriting from openerp.models.Model; also, Controller has a meta class that takes care of this. In general, paths handled by your addon should start with your addon's name to avoid name clashes. Of course, if you extend some addon's functionality, you'll use this addon's name. openerp.http.route The route decorator allows us to tell Odoo that a method is to be web accessible in the first place, and the first parameter determines on which path it is accessible. Instead of a string, you can also pass a list of strings in case you use the same function to serve multiple paths. The type argument defaults to http and determines what type of request is to be served. While strictly speaking JSON is HTTP, declaring the second function as type='json' makes life a lot easier, because Odoo then handles type conversions itself. Don't worry about the auth parameter for now, it will be addressed in recipe Restrict access to web accessible paths. Return values Odoo's treatment of the functions' return values is determined by the type argument of the route decorator. For type='http', we usually want to deliver some HTML, so the first function simply returns a string containing it. An alternative is to use request.make_response(), which gives you control over the headers to send in the response. So to indicate when our page was updated the last time, we might change the last line in books() to the following: return request.make_response( result, [ ('Last-modified', email.utils.formatdate( ( fields.Datetime.from_string( request.env['library.book'].sudo() .search([], order='write_date desc', limit=1) .write_date) - datetime.datetime(1970, 1, 1) ).total_seconds(), usegmt=True)), ]) This code sends a Last-modified header along with the HTML we generated, telling the browser when the list was modified for the last time. We extract this information from the write_date field of the library.book model. In order for the preceding snippet to work, you'll have to add some imports on the top of the file: import email import datetime from openerp import fields You can also create a Response object of werkzeug manually and return that, but there's little gain for the effort. Generating HTML manually is nice for demonstration purposes, but you should never do this in production code. Always use templates as appropriate and return them by calling request.render(). This will give you localization for free and makes your code better by separating business logic from the presentation layer. Also, templates provide you with functions to escape data before outputting HTML. The preceding code is vulnerable to cross-site-scripting attacks if a user manages to slip a script tag into the book name, for example. For a JSON request, simply return the data structure you want to hand over to the client, Odoo takes care of serialization. For this to work, you should restrict yourself to data types that are JSON serializable, which are roughly dictionaries, lists, strings, floats and integers. openerp.http.request The request object is a static object referring to the currently handled request, which contains everything you need to take useful action. Most important is the property request.env, which contains an Environment object which is just the same as in self.env for models. This environment is bound to the current user, which is none in the preceding example because we used auth='none'. Lack of a user is also why we have to sudo() all our calls to model methods in the example code. If you're used to web development, you'll expect session handling, which is perfectly correct. Use request.session for an OpenERPSession object (which is quite a thin wrapper around the Session object of werkzeug), and request.session.sid to access the session id. To store session values, just treat request.session as a dictionary: request.session['hello'] = 'world' request.session.get('hello') Note that storing data in the session is not different from using global variables. Use it only if you must - that is usually the case for multi request actions like a checkout in the website_sale module. And also in this case, handle all functionality concerning sessions in your controllers, never in your modules. There's more… The route decorator can have some extra parameters to customize its behavior further. By default, all HTTP methods are allowed, and Odoo intermingles with the parameters passed. Using the parameter methods, you can pass a list of methods to accept, which usually would be one of either ['GET'] or ['POST']. To allow cross origin requests (browsers block AJAX and some other types of requests to domains other than where the script was loaded from for security and privacy reasons), set the cors parameter to * to allow requests from all origins, or some URI to restrict requests to ones originating from this URI. If this parameter is unset, which is the default, the Access-Control-Allow-Origin header is not set, leaving you with the browser's standard behavior. In our example, we might want to set it on /my_module/books/json in order to allow scripts pulled from other websites accessing the list of books. By default, Odoo protects certain types of requests from an attack known as cross-site request forgery by passing a token along on every request. If you want to turn that off, set the parameter csrf to False, but note that this is a bad idea in general. See also If you host multiple Odoo databases on the same instance and each database has different web accessible paths on possibly multiple domain names per database, the standard regular expressions in the --db-filter parameter might not be enough to force the right database for every domain. In that case, use the community module dbfilter_from_header from https://github.com/OCA/server-tools in order to configure the database filters on proxy level. To see how using templates makes modularity possible, see recipe Modify an existing handler later in the article. Restrict access to web accessible paths We'll explore the three authentication mechanisms Odoo provides for routes in this recipe. We'll define routes with different authentication mechanisms in order to show their differences. Getting ready As we extend code from the previous recipe, we'll also depend on the library.book model, so you should get its code correct in order to proceed. How to do it… Define handlers in controllers/main.py: Add a path that shows all books: @http.route('/my_module/all-books', type='http', auth='none') def all_books(self): records = request.env['library.book'].sudo().search([]) result = '<html><body><table><tr><td>' result += '</td></tr><tr><td>'.join( records.mapped('name')) result += '</td></tr></table></body></html>' return result Add a path that shows all books and indicates which was written by the current user, if any: @http.route('/my_module/all-books/mark-mine', type='http', auth='public') def all_books_mark_mine(self): records = request.env['library.book'].sudo().search([]) result = '<html><body><table>' for record in records: result += '<tr>' if record.author_ids & request.env.user.partner_id: result += '<th>' else: result += '<td>' result += record.name if record.author_ids & request.env.user.partner_id: result += '</th>' else: result += '</td>' result += '</tr>' result += '</table></body></html>' return result Add a path that shows the current user's books: @http.route('/my_module/all-books/mine', type='http', auth='user') def all_books_mine(self): records = request.env['library.book'].search([ ('author_ids', 'in', request.env.user.partner_id.ids), ]) result = '<html><body><table><tr><td>' result += '</td></tr><tr><td>'.join( records.mapped('name')) result += '</td></tr></table></body></html>' return result With this code, the paths /my_module/all_books and /my_module/all_books/mark_mine look the same for unauthenticated users, while a logged in user sees her books in a bold font on the latter path. The path /my_module/all-books/mine is not accessible at all for unauthenticated users. If you try to access it without being authenticated, you'll be redirected to the login screen in order to do so. How it works… The difference between authentication methods is basically what you can expect from the content of request.env.user. For auth='none', the user record is always empty, even if an authenticated user is accessing the path. Use this if you want to serve content that has no dependencies on users, or if you want to provide database agnostic functionality in a server wide module. The value auth='public' sets the user record to a special user with XML ID, base.public_user, for unauthenticated users, and to the user's record for authenticated ones. This is the right choice if you want to offer functionality to both unauthenticated and authenticated users, while the authenticated ones get some extras, as demonstrated in the preceding code. Use auth='user' to be sure that only authenticated users have access to what you've got to offer. With this method, you can be sure request.env.user points to some existing user. There's more… The magic for authentication methods happens in the ir.http model from the base addon. For whatever value you pass to the auth parameter in your route, Odoo searches for a function called _auth_method_<yourvalue> on this model, so you can easily customize this by inheriting this model and declaring a method that takes care of your authentication method of choice. As an example, we provide an authentication method base_group_user which enforces a currently logged in user who is a member of the group with XML ID, base.group_user: from openerp import exceptions, http, models from openerp.http import request class IrHttp(models.Model): _inherit = 'ir.http' def _auth_method_base_group_user(self): self._auth_method_user() if not request.env.user.has_group('base.group_user'): raise exceptions.AccessDenied() Now you can say auth='base_group_user' in your decorator and be sure that users running this route's handler are members of this group. With a little trickery you can extend this to auth='groups(xmlid1,…)', the implementation of this is left as an exercise to the reader, but is included in the example code. Consume parameters passed to your handlers It's nice to be able to show content, but it's better to show content as a result of some user input. This recipe will demonstrate the different ways to receive this input and react to it. As the recipes before, we'll make use of the library.book model. How to do it… First, we'll add a route that expects a traditional parameter with a book's ID to show some details about it. Then, we'll do the same, but we'll incorporate our parameter into the path itself: Add a path that expects a book's ID as parameter: @http.route('/my_module/book_details', type='http', auth='none') def book_details(self, book_id): record = request.env['library.book'].sudo().browse( int(book_id)) return u'<html><body><h1>%s</h1>Authors: %s' % ( record.name, u', '.join(record.author_ids.mapped( 'name')) or 'none', ) Add a path where we can pass the book's ID in the path @http.route("/my_module/book_details/<model('library.book') :book>", type='http', auth='none') def book_details_in_path(self, book): return self.book_details(book.id) If you point your browser to /my_module/book_details?book_id=1, you should see a detail page of the book with ID 1. If this doesn't exist, you'll receive an error page. The second handler allows you to go to /my_module/book_details/1 and view the same page. How it works… By default, Odoo (actually werkzeug) intermingles with GET and POST parameters and passes them as keyword argument to your handler. So by simply declaring your function as expecting a parameter called book_id, you introduce this parameter as either GET (the parameter in the URL) or POST (usually passed by forms with your handler as action) parameter. Given that we didn't add a default value for this parameter, the runtime will raise an error if you try to access this path without setting the parameter. The second example makes use of the fact that in a werkzeug environment, most paths are virtual anyway. So we can simply define our path as containing some input. In this case, we say we expect the ID of a library.book as the last component of the path. The name after the colon is the name of a keyword argument. Our function will be called with this parameter passed as keyword argument. Here, Odoo takes care of looking up this ID and delivering a browse record, which of course only works if the user accessing this path has appropriate permissions. Given that book is a browse record, we can simply recycle the first example's function by passing book.id as parameter book_id to give out the same content. There's more… Defining parameters within the path is a functionality delivered by werkzeug, which is called converters. The model converter is added by Odoo, which also defines the converter, models, that accepts a comma separated list of IDs and passes a record set containing those IDs to your handler. The beauty of converters is that the runtime coerces the parameters to the expected type, while you're on your own with normal keyword parameters. These are delivered as strings and you have to take care of the necessary type conversions yourself, as seen in the first example. Built-in werkzeug converters include int, float, and string, but also more intricate ones such as path, any, or uuid. You can look up their semantics at http://werkzeug.pocoo.org/docs/0.11/routing/#builtin-converters. See also Odoo's custom converters are defined in ir_http.py in the base module and registered in the _get_converters method of ir.http. As an exercise, you can create your own converter that allows you to visit the /my_module/book_details/Odoo+cookbook page to receive the details of this book (if you added it to your library before). Modify an existing handler When you install the website module, the path /website/info displays some information about your Odoo instance. In this recipe, we override this in order to change this information page's layout, but also to change what is displayed. Getting ready Install the website module and inspect the path /website/info. Now craft a new module that depends on website and uses the following code. How to do it… We'll have to adapt the existing template and override the existing handler: Override the qweb template in a file called views/templates.xml: <?xml version="1.0" encoding="UTF-8"?> <odoo> <template id="show_website_info" inherit_id="website.show_website_info"> <xpath expr="//dl[@t-foreach='apps']" position="replace"> <table class="table"> <tr t-foreach="apps" t-as="app"> <th> <a t-att-href="app.website"> <t t-esc="app.name" /></a> </th> <td><t t-esc="app.summary" /></td> </tr> </table> </xpath> </template> </odoo> Override the handler in a file called controllers/main.py: from openerp import http from openerp.addons.website.controllers.main import Website class Website(Website): @http.route() def website_info(self): result = super(Website, self).website_info() result.qcontext['apps'] = result.qcontext[ 'apps'].filtered( lambda x: x.name != 'website') return result Now when visiting the info page, we'll only see a filtered list of installed applications, and in a table as opposed to the original definition list. How it works In the first step, we override an existing QWeb template. In order to find out which that is, you'll have to consult the code of the original handler. Usually, it will end with the following command line, which tells you that you need to override template.name: return request.render('template.name', values) In our case, the handler uses a template called website.info, but this one is extended immediately by another template called website.show_website_info, so it's more convenient to override this one. Here, we replace the definition list showing installed apps with a table. In order to override the handler method, we must identify the class that defines the handler, which is openerp.addons.website.controllers.main.Website in this case. We import the class to be able to inherit from it. Now we override the method and change the data passed to the response. Note that what the overridden handler returns is a Response object and not a string of HTML as the previous recipes did for the sake of brevity. This object contains a reference to the template to be used and the values accessible to the template, but is only evaluated at the very end of the request. In general, there are three ways to change an existing handler: If it uses a QWeb template, the simplest way of changing it is to override the template. This is the right choice for layout changes and small logic changes. QWeb templates get a context passed, which is available in the response as the field qcontext. This usually is a dictionary where you can add or remove values to suit your needs. In the preceding example, we filter the list of apps to only contain apps which have a website set. If the handler receives parameters, you could also preprocess those in order to have the overridden handler behave the way you want. There's more… As seen in the preceding section, inheritance with controllers works slightly differently than model inheritance: You actually need a reference to the base class and use Python inheritance on it. Don't forget to decorate your new handler with the @http.route decorator; Odoo uses it as a marker for which methods are exposed to the network layer. If you omit the decorator, you actually make the handler's path inaccessible. The @http.route decorator itself behaves similarly to field declarations: every value you don't set will be derived from the decorator of the function you're overriding, so we don't have to repeat values we don't want to change. After receiving a response object from the function you override, you can do a lot more than just changing the QWeb context: You can add or remove HTTP headers by manipulating response.headers. If you want to render an entirely different template, you can set response.template. To detect if a response is based on QWeb in the first place, query response.is_qweb. The resulting HTML code is available by calling response.render(). Using the RPC API One of Odoo's strengths is its interoperability, which is helped by the fact that basically any functionality is available via JSON-RPC 2.0 and XMLRPC. In this recipe, we'll explore how to use both of them from client code. This interface also enables you to integrate Odoo with any other application. Making functionality available via any of the two protocols on the server side is explained in the There's more section of this recipe. We'll query a list of installed modules from the Odoo instance, so that we could show a list as the one displayed in the previous recipe in our own application or website. How to do it… The following code is not meant to run within Odoo, but as simple scripts: First, we query the list of installed modules via XMLRPC: #!/usr/bin/env python2 import xmlrpclib db = 'odoo9' user = 'admin' password = 'admin' uid = xmlrpclib.ServerProxy( 'http://localhost:8069/xmlrpc/2/common') .authenticate(db, user, password, {}) odoo = xmlrpclib.ServerProxy( 'http://localhost:8069/xmlrpc/2/object') installed_modules = odoo.execute_kw( db, uid, password, 'ir.module.module', 'search_read', [[('state', '=', 'installed')], ['name']], {'context': {'lang': 'fr_FR'}}) for module in installed_modules: print module['name'] Then we do the same with JSONRPC: import json import urllib2 db = 'odoo9' user = 'admin' password = 'admin' request = urllib2.Request( 'http://localhost:8069/web/session/authenticate', json.dumps({ 'jsonrpc': '2.0', 'params': { 'db': db, 'login': user, 'password': password, }, }), {'Content-type': 'application/json'}) result = urllib2.urlopen(request).read() result = json.loads(result) session_id = result['result']['session_id'] request = urllib2.Request( 'http://localhost:8069/web/dataset/call_kw', json.dumps({ 'jsonrpc': '2.0', 'params': { 'model': 'ir.module.module', 'method': 'search_read', 'args': [ [('state', '=', 'installed')], ['name'], ], 'kwargs': {'context': {'lang': 'fr_FR'}}, }, }), { 'X-Openerp-Session-Id': session_id, 'Content-type': 'application/json', }) result = urllib2.urlopen(request).read() result = json.loads(result) for module in result['result']: print module['name'] Both code snippets will print a list of installed modules, and because they pass a context that sets the language to French, the list will be in French if there are no translations available. How it works… Both snippets call the function search_read, which is very convenient because you can specify a search domain on the model you call, pass a list of fields you want to be returned, and receive the result in one request. In older versions of Odoo, you had to call search first to receive a list of IDs and then call read to actually read the data. search_read returns a list of dictionaries, with the keys being the names of the fields requested and the values the record's data. The ID field will always be transmitted, no matter if you requested it or not. Now, we need to look at the specifics of the two protocols. XMLRPC The XMLRPC API expects a user ID and a password for every call, which is why we need to fetch this ID via the method authenticate on the path /xmlrpc/2/common. If you already know the user's ID, you can skip this step. As soon as you know the user's ID, you can call any model's method by calling execute_kw on the path /xmlrpc/2/object. This method expects the database you want to execute the function on, the user's ID and password for authentication, then the model you want to call your function on, and then the function's name. The next two mandatory parameters are a list of positional arguments to your function, and a dictionary of keyword arguments. JSONRPC Don't be distracted by the size of the code example, that's because Python doesn't have built in support for JSONRPC. As soon as you've wrapped the urllib calls in some helper functions, the example will be as concise as the XMLRPC one. As JSONRPC is stateful, the first thing we have to do is to request a session at /web/session/authenticate. This function takes the database, the user's name, and their password. The crucial part here is that we record the session ID Odoo created, which we pass in the header X-Openerp-Session-Id to /web/dataset/call_kw. Then the function behaves the same as execute_kw from; we need to pass a model name and a function to call on it, then positional and keyword arguments. There's more… Both protocols allow you to call basically any function of your models. In case you don't want a function to be available via either interface, prepend its name with an underscore – Odoo won't expose those functions as RPC calls. Furthermore, you need to take care that your parameters, as well as the return values, are serializable for the protocol. To be sure, restrict yourself to scalar values, dictionaries, and lists. As you can do roughly the same with both protocols, it's up to you which one to use. This decision should be mainly driven by what your platform supports best. In a web context, you're generally better off with JSON, because Odoo allows JSON handlers to pass a CORS header conveniently (see the Make a path accessible from the network recipe for details). This is rather difficult with XMLRPC. Summary In this article, we saw how to start about with the web server architecture. Later on, we covered the Routes and Controllers that will be used in the article and their authentication, how the handlers consumes parameters, and how to use an RPC API, namely, JSON-RPC and XML-RPC. Resources for Article: Further resources on this subject: Advanced React [article] Remote Authentication [article] ASP.Net Site Performance: Improving JavaScript Loading [article]

0
0
17536

Packt

15 Apr 2016

17 min read

Finding Patterns in the Noise – Clustering and Unsupervised Learning

Packt

15 Apr 2016

17 min read

In this article by, Joseph J, author of Mastering Predictive Analytics with Python, we will cover one of the natural questions to ask about a dataset is if it contains groups. For example, if we examine financial markets as a time series of prices over time, are there groups of stocks that behave similarly over time? Likewise, in a set of customer financial transactions from an e-commerce business, are there user accounts distinguished by patterns of similar purchasing activity? By identifying groups using the methods described in this article, we can understand the data as a set of larger patterns rather than just individual points. These patterns can help in making high-level summaries at the outset of a predictive modeling project, or as an ongoing way to report on the shape of the data we are modeling. Likewise, the groupings produced can serve as insights themselves, or they can provide starting points for the models. For example, the group to which a datapoint is assigned can become a feature of this observation, adding additional information beyond its individual values. Additionally, we can potentially calculate statistics (such as mean and standard deviation) for other features within these groups, which may be more robust as model features than individual entries. (For more resources related to this topic, see here.) In contrast to the methods, grouping or clustering algorithms are known as unsupervised learning, meaning we have no response, such as a sale price or click-through rate, which is used to determine the optimal parameters of the algorithm. Rather, we identify similar datapoints, and as a secondary analysis might ask whether the clusters we identify share a common pattern in their responses (and thus suggest the cluster is useful in finding groups associated with the outcome we are interested in). The task of finding these groups, or clusters, has a few common ingredients that vary between algorithms. One is a notion of distance or similarity between items in the dataset, which will allow us to compare them. A second is the number of groups we wish to identify; this can be specified initially using domain knowledge, or determined by running an algorithm with different choices of initial groups to identify the best number of groups that describes a dataset, as judged by numerical variance within the groups. Finally, we need a way to measure the quality of the groups we've identified; this can be done either visually or through the statistics that we will cover. In this article we will dive into: How to normalize data for use in a clustering algorithm and to compute similarity measurements for both categorical and numerical data How to use k-means to identify an optimal number of clusters by examining the loss function How to use agglomerative clustering to identify clusters at different scales Using affinity propagation to automatically identify the number of clusters in a dataset How to use spectral methods to cluster data with nonlinear boundaries Similarity and distance The first step in clustering any new dataset is to decide how to compare the similarity (or dissimilarity) between items. Sometimes the choice is dictated by what kinds of similarity we are trying to measure, in others it is restricted by the properties of the dataset. In the following we illustrate several kinds of distance for numerical, categorical, time series, and set-based data—while this list is not exhaustive, it should cover many of the common use cases you will encounter in business analysis. We will also cover normalizations that may be needed for different data types prior to running clustering algorithms. Numerical distances Let's begin by looking at an example contained in the wine.data file. It contains a set of chemical measurements that describe the properties of different kinds of wines, and the class of quality (I-III) to which the wine is assigned (Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation, Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.). Open the file in an iPython notebook and look at the first few rows: Notice that in this dataset we have no column descriptions. We need to parse these from the dataset description file wine.data. With the following code, we generate a regular expression that will match a header name (we match a pattern where a number followed by a parenthesis has a column name after it, as you can see in the list of column names listed in the file), and add these to an array of column names along with the first column, which is the class label of the wine (whether it belongs to category I-III). We then assign this list to the dataframe column names: Now that we have appended the column names, we can look at a summary of the dataset: How can we calculate a similarity between wines based on this data? One option would be to consider each of the wines as a point in a thirteen-dimensional space specified by its dimensions (for example, each of the properties other than the class). Since the resulting space has thirteen dimensions, we can't directly visualize the datapoints using a scatterplot to see if they are nearby, but we can calculate distances just the same as with a more familiar 2- or 3-dimensional space using the Euclidean distance formula, which is simply the length of the straight line between two points. This formula for this length can be used whether the points are in a 2-dimensional plot or a more complex space such as this example, and is given by: Here aand bare rows of the dataset and nis the number of columns. One feature of the Euclidean distance is that columns whose scale is much different from others can distort it. In our example, the values describing the magnesium content of each wine are ~100 times greater than the magnitude of features describing the alcohol content or ash percentage. If we were to calculate the distance between these datapoints, it would largely be determined by the magnesium concentration (as even small differences on this scale overwhelmingly determine the value of the distance calculation), rather than any of its other properties. While this might sometimes be desirable, in most applications we do not favour one feature over another and want to give equal weight to all columns. To get a fair distance comparison between these points, we need to normalize the columns so that they fall into the same numerical range (have similar maxima and minima values). We can do so using the scale()function in scikit-learn: This function will subtract the mean value of a column from each element and then divide each point by the standard deviation of the column. This normalization centers each column at 0 with variance 1, and in the case of normally distributed data this would make a standard normal distribution. Also note that the scale() function returns a numpy dataframe, which is why we must call dataframe on the output to use the pandas function describe(). Now that we've scaled the data, we can calculate Euclidean distances between the points: We've now converted our dataset of 178 rows and 13 columns into a square matrix, giving the distance between each of these rows. In other words, row I, column j in this matrix represents the Euclidean distance between rows I and j in our dataset. This 'distance matrix' is the input we will use for clustering inputs in the following section. If we just want to get a visual sense of how the points compare to each other, we could use multidimensional scaling (MDS)—Modern Multidimensional Scaling - Theory and Applications Borg, I., Groenen P., Springer Series in Statistics (1997), Nonmetric multidimensional scaling: a numerical method, Kruskal, J. Psychometrika, 29 (1964), and Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis, Kruskal, J. Psychometrika, 29, (1964)—to create a visualization. Multidimensional scaling attempts to find the set of lower dimensional coordinates (here, two dimensions) that best represents the distances in the higher dimensions of a dataset (here, the pairwise Euclidean distances we calculated from the 13 dimensions). It does this by minimizing the coordinates (x, y) according to the strain function: Strain(x1…..xn) = (1 – Sum(ijdij*<xi,xj>)2/Sum(ij(dij**2)Sumij<xi,x,j>**2))1/2 Where d are the distances we've calculated between points. In other words, we find coordinates that best capture the variation in the distance through the variation in dot product the coordinates. We can then plot the resulting coordinates, using the wine class to label points in the diagram. Note that the coordinates themselves have no interpretation (in fact, they could change each time we run the algorithm). Rather, it is the relative position of points that we are interested in: Given that there are many ways we could have calculated the distance between datapoints, is the Euclidean distance a good choice here? Visually, based on the multidimensional scaling plot, we can see there is separation between the classes based on the features we've used to calculate distance, so conceptually it appears that this is a reasonable choice in this case. However, the decision also depends on what we are trying to compare; if we are interested in detecting wines with similar attributes in absolute values, then it is a good metric. However, what if we're not interested so much in the absolute composition of the wine, but whether its variables follow similar trends among wines with different alcohol contents? In this case, we wouldn't be interested in the absolute difference in values, but rather the correlationbetween the columns. This sort of comparison is common for time series, which we turn to next. Correlations and time series For time series data, we are often concerned with whether the patterns between series exhibit the same variation over time, rather than their absolute differences in value. For example, if we were to compare stocks, we might want to identify groups of stocks whose prices move up and down in similar patterns over time. The absolute price is of less interest than this pattern of increase and decrease. Let's look at an example of the Dow Jones industrial average over time (Brown, M. S., Pelosi, M., and Dirska, H. (2013). Dynamic-radius Species-conserving Genetic Algorithm for the Financial Forecasting of Dow Jones Index Stocks and Machine Learning and Data Mining in Pattern Recognition, 7988, 27-41.): This data contains the daily stock price (for 6 months) for a set of 30 stocks. Because all of the numerical values (the prices) are on the same scale, we won't normalize this data as with the wine dimensions. We notice two things about this data. First, the closing price per week (the variable we will use to calculate correlation) is presented as a string. Second, the date is not in the current format for plotting. We will process both columns to fix this, converting the columns to a float and datetime object, respectively: With this transformation, we can now make a pivot table to place the closing prices for week as columns and individual stocks as rows: As we can see, we only need columns 2 and onwards to calculate correlations between rows. Let's calculate the correlation between these time series of stock prices by selecting the second column to end columns of the data frame, calculating the pairwise correlations distance metric, and visualizing it using MDS, as before: It is important to note that the Pearson coefficient, which we've calculated here, is a measure of linearcorrelation between these time series. In other words, it captures the linear increase (or decrease) of the trend in one price relative to another, but won't necessarily capture nonlinear trends. We can see this by looking at the formula for the Pearson correlation, which is given by: P(a,b) = cov(a,b)/sd(a)/sd(b) = Sum(a-mean(b))*Sum(b-mean(b))/Sqrt(Sum(a-mean(a))2* Sqrt(Sum(b-mean(b)) This value varies from 1 (highly correlated) to -1 (inversely correlated), with 0 representing no correlation (such as a cloud of points). You might recognize the numerator of this equation as the covariance, which is a measure of how much two datasets, a and b, vary with one another. You can understand this by considering that the numerator is maximized when corresponding points in both datasets are above or below their mean value. However, whether this accurately captures the similarity in the data depends upon the scale. In data that is distributed in regular intervals between a maximum and minimum, with roughly the same difference between consecutive values (which is essentially how a trend line appears), it captures this pattern well. However, consider a case in which the data is exponentially distributed, with orders of magnitude differences between the minimum and maximum, and the difference between consecutive datapoints also varyies widely. Here, the Pearson correlation would be numerically dominated by only the largest terms, which might or might not represent the overall similarity in the data. This numerical sensitivity also occurs in the numerator, which represents the product of the standard deviations of both datasets. Thus, the value of the correlation is maximized when the variation in the two datasets is roughly explained by the product of their individual variations; there is no 'left over' variation between the datasets that is not explained by their respective standard deviations. Looking at the first two stocks in this dataset, this assumption of linearity appears to be a valid one for comparing datapoints: In addition to verifying that these stocks have a roughly linear correlation, this command introduces some new functions in pandas you may find useful. The first is iloc, which allows you to select indexed rows from a dataframe. The second is transpose, which inverts the rows and columns. Here, we select the first two rows, transpose, and then select all rows (prices) after the first (since the first is the Ticker symbol) Despite the trend we see in this example, we could imagine a nonlinear trend between prices. In these cases, it might be better to measure, not the linear correlation of the prices themselves, but whether the high prices for one stock coincide with another. In other words, the rank of market days by price should be the same, even if the prices are nonlinearly related. We can also calculate this rank correlation, also known as the Spearman's Rho, using scipy, with the following formula: Rho(a,b) = 6 * sum(d^2) / n (n2-1) Where n is the number of datapoints in each of two sets a and b, and d is the difference in ranks between each pair of datapoints ai and bi. Because we only compare the ranks of the data, not their actual values, this measure can capture variations up and down between two datasets, even if they vary over wide numerical ranges. Let's see if plotting the results using the Spearman correlation metric generates any differences in the pairwise distance of the stocks: The Spearman correlation distances, based on the x and y axes, appear closer to each other, suggesting from the perspective of rank correlation that the time series appear more similar. Though they differ in their assumptions about how the two compared datasets are distributed numerically, Pearson and Spearman correlations share the requirement that the two sets are of the same length. This is usually a reasonable assumption, and will be true of most of the examples we consider in this book. However, for cases where we wish to compare time series of unequal lengths, we can use Dynamic Time Warping (DTW). Conceptually, the idea of DTW is to warp one time series to align with a second, by allowing us to open gaps in either dataset so that it becomes the same size as the second. What the algorithm needs to resolve is where the most similar areas of the two series are, so that gaps can be places in the appropriate locations. In the simplest implementation, DTW consists of the following steps: For a dataset a of length n and a dataset n of length m, construct a matrix m by n. Set the top row and the leftmost column of this matrix to both be infinity. For each point i in set a, and point j in set b, compare their similarity using a cost function. To this cost function, add the minimum of the element (i-1, j-1), (i-1, j), and (j-1, i)—moving up and left, left, or up). These conceptually represent the costs of opening a gap in one of the series, versus aligning the same element in both. At the end of step 3, we will have traced the minimum cost path to align the two series, and the DTW distance will be represented by the bottommost corner of the matrix, (n.m). A negative aspect of this algorithm is that step 3 involves computing a value for every element of series a and b. For large time series or large datasets, this can be computationally prohibitive. While a full discussion of algorithmic improvements is beyond the scope of our present examples, we refer interested readers to FastDTW (which we will use in our example) and SparseDTW as examples of improvements that can be evaluated using many fewer calculations (Al-Naymat, G., Chawla, S., & Taheri, J. (2012), SparseDTW: A Novel Approach to Speed up Dynamic Time Warping and Stan Salvador and Philip Chan, FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space. KDD Workshop on Mining Temporal and Sequential Data, pages 70-80, 20043). We can use the FastDTW algorithm to compare the stocks data as well, and to plot the resulting coordinates. First we will compare pairwise each pair of stocks and record their DTW distance in a matrix: For computational efficiency (because the distance between i and j equals the distance between stocks j and i), we calculate only the upper triangle of this matrix. We then add the transpose (for example, the lower triangle) to this result to get the full distance matrix. Finally, we can use MDS again to plot the results: Compared to the distribution of coordinates along the x and y axis for Pearson correlation and rank correlation, the DTW distances appear to span a wider range, picking up more nuanced differences between the time series of stock prices. Now that we've looked at numerical and time series data, as a last example let's examine calculating similarity in categorical datasets. Summary In this section, we learned how to identify groups of similar items in a dataset, an exploratory analysis that we might frequently use as a first step in deciphering new datasets. We explored different ways of calculating the similarity between datapoints and described what kinds of data these metrics might best apply to. We examined both divisive clustering algorithms, which split the data into smaller components starting from a single group, and agglomerative methods, where every datapoint starts as its own cluster. Using a number of datasets, we showed examples where these algorithms will perform better or worse, and some ways to optimize them. We also saw our first (small) data pipeline, a clustering application in PySpark using streaming data. Resources for Article: Further resources on this subject: Python Data Structures[article] Big Data Analytics[article] Data Analytics[article]

0
0
12166

Packt

15 Apr 2016

17 min read

Getting Started with Force.com

Packt

15 Apr 2016

17 min read

0
0
6299

article-image-building-our-first-poky-image-raspberry-pi

Packt

14 Apr 2016

12 min read

Building Our First Poky Image for the Raspberry Pi

Packt

14 Apr 2016

12 min read

0
1
41800

How-To Tutorials

article-image-step-detector-and-step-counters-sensors

Packt

14 Apr 2016

13 min read

Step Detector and Step Counters Sensors

Packt

14 Apr 2016

13 min read

0
3
48076

Packt

14 Apr 2016

9 min read

Remote Authentication

Packt

14 Apr 2016

9 min read

0
0
2450

article-image-using-registry-and-xlswriter-modules

Packt

14 Apr 2016

12 min read

Using the Registry and xlswriter modules

Packt

14 Apr 2016

12 min read

0
0
33740

How-To Tutorials

article-image-probabilistic-graphical-models-r

Packt

14 Apr 2016

18 min read

Probabilistic Graphical Models in R

Packt

14 Apr 2016

18 min read

In this article by David Bellot, author of the book, Learning Probabilistic Graphical Models in R, explains that among all the predictions that were made about the 21st century, we may not have expected that we would collect such a formidable amount of data about everything, everyday, and everywhere in the world. The past years have seen an incredible explosion of data collection about our world and lives, and technology is the main driver of what we can certainly call a revolution. We live in the age of information. However, collecting data is nothing if we don't exploit it and if we don't try to extract knowledge out of it. At the beginning of the 20th century, with the birth of statistics, the world was all about collecting data and making statistics. Back then, the only reliable tools were pencils and papers and, of course, the eyes and ears of the observers. Scientific observation was still in its infancy despite the prodigious development of the 19th century. (For more resources related to this topic, see here.) More than a hundred years later, we have computers, electronic sensors, massive data storage, and we are able to store huge amounts of data continuously, not only about our physical world but also about our lives, mainly through the use of social networks, Internet, and mobile phones. Moreover, the density of our storage technology increased so much that we can, nowadays, store months if not years of data into a very small volume that can fit in the palm of our hand. Among all the tools and theories that have been developed to analyze, understand, and manipulate probability and statistics became one of the most used. In this field, we are interested in a special, versatile, and powerful class of models called the probabilistic graphical models (PGM, for short). Probabilistic graphical model is a tool to represent beliefs and uncertain knowledge about facts and events using probabilities. It is also one of the most advanced machine learning techniques nowadays and has many industrial success stories. They can deal with our imperfect knowledge about the world because our knowledge is always limited. We can't observe everything, and we can't represent the entire universe in a computer. We are intrinsically limited as human beings and so are our computers. With Probabilistic Graphical Models, we can build simple learning algorithms or complex expert systems. With new data, we can improve these models and refine them as much as we can, and we can also infer new information or make predictions about unseen situations and events. Probabilistic Graphical Models, seen from the point of view of mathematics, are a way to represent a probability distribution over several variables, which is called a joint probability distribution. In a PGM, such knowledge between variables can be represented with a graph, that is, nodes connected by edges with a specific meaning associated to it. Let's consider an example from the medical world: how to diagnose a cold. This is an example and by no means a medical advice. It is oversimplified for the sake of simplicity. We define several random variables such as the following: Se: This means season of the year N: This means that the nose is blocked H: This means the patient has a headache S: This means that the patient regularly sneezes C: This means that the patient coughs Cold: This means the patient has a cold. Because each of the symptoms can exist at different degrees, it is natural to represent the variable as random variables. For example, if the patient's nose is a bit blocked, we will assign a probability of, say, 60% to this variable, that is P(N=blocked)=0.6 and P(N=not blocked)=0.4. In this example, the probability distribution P(Se,N,H,S,C,Cold) will require 4 * 25 = 128 values in total (4 values for season and 2 values for each of the other random variables). It's quite a lot, and honestly, it's quite difficult to determine things such as the probability that the nose is not blocked, the patient has a headache, the patient sneeze, and so on. However, we can say that a headache is not directly related to cough of a blocked nose, expect when the patient has a cold. Indeed, the patient can have a headache for many other reasons. Moreover, we can say that the Season has quite a direct effect on Sneezing, blocked nose, or Cough but less or no direct effect on Headache. In a Probabilistic Graphical Model, we will represent these dependency relationships with a graph, as follow, where each random variable is a node in the graph, and each relationship is an arrow between 2 nodes: In the graph that follows, there is a direct relation between each node and each variable of the Probabilistic Graphical Model and also a direct relation between arrows and the way we can simplify the joint probability distribution in order to make it tractable. Using a graph as a model to simplify a complex (and sometimes complicated) distribution presents numerous benefits: As we observed in the previous example, and in general when we model a problem, the random variables interacts directly with only a small subsets of other random variables. Therefore, this promotes more compact and tractable models The knowledge and dependencies represented in a graph are easy to understand and communicate The graph induces a compact representation of the joint probability distribution and it is easy to make computations with Algorithms to draw inferences and learn can use the graph theory and the associated algorithms to improve and facilitate all the inference and learning algorithms. Compared to the raw joint probability distribution, using a PGM will speed up computations by several order of magnitude. The junction tree algorithm The Junction Tree Algorithm is one of the main algorithms to do inference on PGM. Its name arises from the fact that before doing the numerical computations, we will transform the graph of the PGM into a tree with a set of properties that allow for efficient computations of posterior probabilities. One of the main aspects is that this algorithm will not only compute the posterior distribution of the variables in the query, but also the posterior distribution of all other variables that are not observed. Therefore, for the same computational price, one can have any posterior distribution. Implementing a junction tree algorithm is a complex task, but fortunately, several R packages contain a full implementation, for example, gRain. Let's say we have several variables A, B, C, D, E,and F. We will consider for the sake of simplicity that each variable is binary so that we won't have too many values to deal with. We will assume the following factorization: This is represented by the following graph: We first start by loading the gRain package into R: library(gRain) Then, we create our set of random variables from A to F: val=c(“true”,”false”) F = cptable(~F, values=c(10,90),levels=val) C = cptable(~C|F, values=c(10,90,20,80),levels=val) E = cptable(~E|F, values=c(50,50,30,70),levels=val) A = cptable(~A|C, values=c(50,50,70,30),levels=val) D = cptable(~D|E, values=c(60,40,70,30),levels=val) B = cptable(~B|A:D, values=c(60,40,70,30,20,80,10,90),levels=val) The cptable function creates a conditional probability table, which is a factor for discrete variables. The probabilities associated to each variable are purely subjective and only serve the purpose of the example. The next step is to compute the junction tree. In most packages, computing the junction tree is done by calling one function because the algorithm just does everything at once: plist = compileCPT(list(F,E,C,A,D,B)) plist Also, we check whether the list of variable is correctly compiled into a probabilistic graphical model and we obtain from the previous code: CPTspec with probabilities: P( F ) P( E | F ) P( C | F ) P( A | C ) P( D | E ) P( B | A D ) This is indeed the factorization of our distribution, as stated earlier. If we want to check further, we can look at the conditional probability table of a few variables: print(plist$F) print(plist$B) F true false 0.1 0.9 , , D = true A B true false true 0.6 0.7 false 0.4 0.3 , , D = false A B true false true 0.2 0.1 false 0.8 0.9 The second output is a bit more complex, but if you look carefully, you will see that you have two distributions, P(B|A,D=true) and P(B|A,D=false) which is more readable presentation of P(B|A,D). We finally create the graph and run the junction tree algorithm by calling this: jtree = grain(plist) Again, when we check the result, we obtain: jtree Independence network: Compiled: FALSE Propagated: FALSE Nodes: chr [1:6] "F" "E" "C" "A" "D" "B" We only need to compute the junction tree once. Then, all queries can be computed with the same junction tree. Of course, if you change the graph, then you need to recompute the junction tree. Let's perform a few queries: querygrain(jtree, nodes=c("F"), type="marginal") $F F true false 0.1 0.9 Of course, if you ask for the marginal distribution of F, you will obtain the initial conditional probability table because F has no parents. querygrain(jtree, nodes=c("C"), type="marginal") $C C true false 0.19 0.81 This is more interesting because it computes the marginal of C while we only stated the conditional distribution of C given F. We didn't need to have such a complex algorithm as the junction tree algorithm to compute such a small marginal. We saw the variable elimination algorithm earlier and that would be enough too. But if you ask for the marginal of B, then the variable elimination will not work because of the loop in the graph. However, the junction tree will give the following: querygrain(jtree, nodes=c("B"), type="marginal") $B B true false 0.478564 0.521436 And, can ask more complex distribution, such as the joint distribution of B and A: querygrain(jtree, nodes=c("A","B"), type="joint") B A true false true 0.309272 0.352728 false 0.169292 0.168708 In fact, any combination can be given like A,B,C: querygrain(jtree, nodes=c("A","B","C"), type="joint") , , B = true A C true false true 0.044420 0.047630 false 0.264852 0.121662 , , B = false A C true false true 0.050580 0.047370 false 0.302148 0.121338 Now, we want to observe a variable and compute the posterior distribution. Let's say F=true and we want to propagate this information down to the rest of the network: jtree2 = setEvidence(jtree, evidence=list(F="true")) We can ask for any joint or marginal now: querygrain(jtree, nodes=c("A"), type="marginal") $A A true false 0.662 0.338 querygrain(jtree2, nodes=c("A"), type="marginal") $A A true false 0.68 0.32 Here, we see that knowing that F=true changed the marginal distribution on A from its previous marginal (the second query is again with jtree2, the tree with an evidence). And, we can query any other variable: querygrain(jtree, nodes=c("B"), type="marginal") $B B true false 0.478564 0.521436 querygrain(jtree2, nodes=c("B"), type="marginal") $B B true false 0.4696 0.5304 Learning Building a Probabilistic Graphical Model, generally, requires three steps: defining the random variables, which are the nodes of the graph as well; defining the structure of the graph; and finally defining the numerical parameters of each local distribution. So far, the last step has been done manually and we gave numerical values to each local probability distribution by hand. In many cases, we have access to a wealth of data and we can find the numerical values of those parameters with a method called parameters learning. In other fields, it is also called parameters fitting or model calibration. Learning parameters can be done with several approaches and there is no ultimate solution to the problem because it depends on the goal where the model's user wants to reach. Nevertheless, it is common to use the notion of Maximum Likelihood of a model and also Maximum A Posteriori. As you are now used to the notion of prior and posterior of a distribution, you can already guess what a maximum a posteriori can do. Many algorithms are used, among which we can cite the Expectation Maximization algorithm (EM), which computes the maximum likelihood of a model even when data is missing or variables are not observed at all. It is a very important algorithm, especially for mixture models. A graphical model of a linear model PGM can be used to represent standard statistical models and then extend them. One famous example is the linear regression mode. We can visualize the structure of a linear mode and better understand the relationships between the variable. The linear model captures the relationships between observable variables xand a target variable y. This relation is modeled by a set of parameters, θ. But remember the distribution of y for each data point indexed by i: Here, Xiis a row vector for which the first element is always one to capture the intercept of the linear model. The parameter θ in the following graph is itself composed of the intercept, the coefficient β for each component of X, and the variance σ2 of in the distribution of yi. The PGM for an observation of a linear model can be represented as follows: So, this decomposition leads us to a second version of the graphical model in which we explicitly separate the components of θ: In a PGM, when a rectangle is drawn around a set of nodes with a number or variables in a corner (N for example), it means that the same graph is repeated many times. The likelihood function of a linear model is , and it can be represented as a PGM. And, the vector β can also be decomposed it into its univariate components too: In this last iterations of the graphical model, we see that the parameters β could have a prior probability on it instead of being fixed. In fact, the parameter can also be considered as a random variable. For the time being, we will keep it fixed. Latent Dirichlet Allocation The last model we want to show in this article is called the Latent Dirichlet Allocation. It is a generative model that can be represented as a graphical model. It's based on the same idea as the mixture model with one notable exception. In this model, we assume that the data points might be generated by a combination of clusters and not just one cluster at a time, as it was the case before. The LDA model is primarily used in text analysis and classification. Let's consider that a text document is composed of words making sentences and paragraphs. To simplify the problem we can say that each sentence or paragraph is about one specific topic, such as science, animals, sports, and s on. Topics can also be more specific, such as cat topic or European soccer topic. Therefore, there are words that are more likely to come from specific topics. For example, the work cat is likely to come from the topic cat topic. The word stadium is likely to come from the topic European soccer. However, the word ball should come with a higher probability from the topic European soccer, but it is not unlikely to come from the topic cat, because cats like to play with balls too. So, it seems the word ball might belong to two topics at the same time with a different degree of certainty. Other words such as table will certainly belong equally to both topics and presumably to others. They are very generic; expect, of course, if we introduce another topics such as furniture. A document is a collection of words, so a document can have complex relationships with a set of topics. But in the end, it is more likely to see words coming from the same topic or the same topics within a paragraph and to some extent to the document. In general, we model a document with a bag of words model, that is, we consider a document to be a randomly generated set of words, using a specific distribution over the words. If this distribution is uniform over all the words, then the document will be purely random without a specific meaning. However, if this distribution has a specific form, with more probability mass to related words, then the collection of words generated by this model will have a meaning. Of course, generating documents is not really the application we have in mind for such a model. What we are interested in is the analysis of documents, their classification, and automatic understanding. Let's say is a categorical variable (in other words, a histogram), representing the probability of appearance of all words from a dictionary. Usually, in this kind of model, we restrict ourselves to long words only and remove the small words, like and, to, but, the, a, and so onThese words are usually called stop words. Let w_jbe the jth words in a document. The following three graphs show the progression from representing a document (left-most graph) to representing a collection of documents (the third graph): Let be a distribution over topics, then in the second graph from the left, we extend this model by choosing the kind of topic that will be selected at any time and then generate a word out of it. Therefore, the variable zi now becomes the variable zij, that is, the topic iis selected for the word j. We can go even further and decide that we want to model a collection of documents, which seems natural if we consider that we have a big data set. Assuming that documents are i.i.d, the next step (the third graph) is a PGM that represents the generative model for M documents. And, because the distribution on is categorical, we want to be Bayesian about it, mainly because it will help to model not to overfit and because we consider the selection of topics for a document to be a random process. Moreover, we want to apply the same treatment to the word variable by having a Dirichlet prior. This prior is used to avoid non-observed words that have zero probability. It smooths the distribution of words per topic. A uniform Dirichlet prior will induce a uniform prior distribution on all the words. And therefore, the final graph on the right represents the complete model. This is quite a complex graphical model but techniques have been developed to fit the parameters and use this model. If we follow this graphical model carefully, we have a process that generates documents based on a certain set of topics: α chooses the set of topics for a documents From θ, we generate a topic zij From this topic, we generate a word wj In this model, only the words are observable. All the other variables will have to be determined without observation, exactly like in the other mixture models. So, documents are represented as random mixtures over latent topics, in which each topic is represented as a distribution over words. The distribution of a topic mixture based on this graphical mode can be written as follows: You can see in this formula that for each word, we select a topic, hence the product from 1 to N. Integrating over θ and summing over z, the marginal distribution of a document is as follows: The final distribution can be obtained by taking the product of marginal distributions of single documents, so as to get the distribution over a collection of documents (assuming that documents are independently and identically distributed). Here, D is the collection of documents: The main problem to be solved now is how to compute the posterior distribution over θ and z, given a document. By applying the Bayes formula, we know the following: Unfortunately, this is intractable because of the normalization factor at the denominator. The original paper on LDA, therefore, refers to a technique called Variational inference, which aims at transforming a complex Bayesian inference problem into a simpler approximation which can be solved as an (convex) optimization problem. This technique is the third approach to Bayesian inference and has been used on many other problems. Summary The probabilistic graphical model framework offers a powerful and versatile framework to develop and extend many probabilistic models using an elegant graph-based formalism. It has many applications such as in biology, genomics, medicine, finance, robotics, computer vision, automation, engineering, law, and games, for example. Many packages in R exist to deal with all sort of models and data among which gRain or Rstan are very popular. Resources for Article: Further resources on this subject: Extending ElasticSearch with Scripting [article] Exception Handling in MySQL for Python [article] Breaking the Bank [article]

0
0
12660

article-image-detecting-fraud-e-commerce-orders-benfords-law

Packt

14 Apr 2016

7 min read

Detecting fraud on e-commerce orders with Benford's law

Packt

14 Apr 2016

7 min read

In this article by Andrea Cirillo, author of the book RStudio for R Statistical Computing Cookbook, has explained how to detect fraud on e-commerce orders. Benford's law is a popular empirical law that states that the first digits of a population of data will follow a specific logarithmic distribution. This law was observed by Frank Benford around 1938 and since then has gained increasing popularity as a way to detect anomalous alteration of population of data. Basically, testing a population against Benford's law means verifying that the given population respects this law. If deviations are discovered, the law performs further analysis for items related to those deviations. In this recipe, we will test a population of e-commerce orders against the law, focusing on items deviating from the expected distribution. (For more resources related to this topic, see here.) Getting ready This recipe will use functions from the well-documented benford.analysis package by Carlos Cinelli. We therefore need to install and load this package: install.packages("benford.analysis") library(benford.analysis) In our example, we will use a data frame that stores e-commerce orders, provided within the book as an .Rdata file. In order to make it available within your environment, we need to load this file by running the following command (assuming the file is within your current working directory): load("ecommerce_orders_list.Rdata") How to do it... Perform Benford test on the order amounts: benford_test <- benford(ecommerce_orders_list$order_amount,1) Plot test analysis: plot(benford_test) This will result in the following plot: Highlights supectes digits: suspectsTable(benford_test) This will produce a table showing for each digit absolute differences between expected and observed frequencies. The first digits will therefore be more anomalous ones: > suspectsTable(benford_test) digits absolute.diff 1: 5 4860.8974 2: 9 3764.0664 3: 1 2876.4653 4: 2 2870.4985 5: 3 2856.0362 6: 4 2706.3959 7: 7 1567.3235 8: 6 1300.7127 9: 8 200.4623 Define a function to extrapolate the first digit from each amount: left = function (string,char){ substr(string,1,char)} Extrapolate the first digit from each amount: ecommerce_orders_list$first_digit <- left(ecommerce_orders_list$order_amount,1) Filter amounts starting with the suspected digit: suspects_orders <- subset(ecommerce_orders_list,first_digit == 5) How it works Step 1 performs the Benford test on the order amounts. In this step, we applied the benford() function to the amounts. Applying this function means evaluating the distribution of the first digits of amounts against the expected Benford distribution. The function will result in the production of the following objects: Object Description Info This object covers the following general information: data.name: This shows the name of the data used n: This shows the number of observations used n.second.order: This shows the number of observations used for second-order analysis number.of.digits: This shows the number of first digits analyzed Data This is a data frame with the following subobjects: lines.used: This shows the original lines of the dataset data.used: This shows the data used data.mantissa: This shows the log data's mantissa data.digits: This shows the first digits of the data s.o.data This is a data frame with the following subobjects: data.second.order: This shows the differences of the ordered data data.second.order.digits: This shows the first digits of the second-order analysis Bfd This is a data frame with the following subobjects: digits: This highlights the groups of digits analyzed data.dist: This highlights the distribution of the first digits of the data data.second.order.dist: This highlights the distribution of the first digits of the second-order analysis benford.dist: This shows the theoretical Benford distribution data.second.order.dist.freq: This shows the frequency distribution of the first digits of the second-order analysis data.dist.freq: This shows the frequency distribution of the first digits of the data benford.dist.freq: This shows the theoretical Benford frequency distribution benford.so.dist.freq: This shows the theoretical Benford frequency distribution of the second order analysis. data.summation: This shows the summation of the data values grouped by first digits abs.excess.summation: This shows the absolute excess summation of the data values grouped by first digits difference: This highlights the difference between the data and Benford frequencies squared.diff: This shows the chi-squared difference between the data and Benford frequencies absolute.diff: This highlights the absolute difference between the data and Benford frequencies Mantissa This is a data frame with the following subobjects: mean.mantissa: This shows the mean of the mantissa var.mantissa: This shows the variance of the mantissa ek.mantissa: This shows the excess kurtosis of the mantissa sk.mantissa: This highlights the skewness of the mantissa MAD This object depicts the mean absolute deviation. distortion.factor This object talks about the distortion factor. Stats This object lists of htest class statistics as follows: chisq: This lists the Pearson's Chi-squared test. mantissa.arc.test: This lists the Mantissa Arc test Step 2 plots test results. Running plot on the object resulting from the benford() function will result in a plot showing the following (from upper-left corner to bottom-right corner): First digit distribution Results of second-order test Summation distribution for each digit Results of chi-squared test Summation differences If you look carefully at these plots, you will understand which digits show up a distribution significantly different from the one expected from the Benford law. Nevertheless, in order to have a sounder base for our consideration, we need to look at the suspects table, showing absolute differences between expected and observed frequencies. This is what we will do in the next step. Step 3 highlights suspects digits. Using suspectsTable() we can easily discover which digits presents the greater deviation from the expected distribution. Looking at the so-called suspects table, we can see that number 5 shows up as the first digit within our table. In the next step, we will focus our attention on the orders with amounts having this digit as the first digit. Step 4 defines a function to extrapolate the first digit from each amount. This function leverages the substr() function from the stringr() package and extracts the first digit from the number passed to it as an argument. Step 5 adds a new column to the investigated dataset where the first digit is extrapolated. Step 6 filters amounts starting with the suspected digit. After applying the left function to our sequence of amounts, we can now filter the dataset, retaining only rows whose amounts have 5 as the first digit. We will now be able to perform analytical, testing procedures on those items. Summary In this article, you learned how to apply the R language to an e-commerce fraud detection system. Resources for Article: Further resources on this subject: Recommending Movies at Scale (Python) [article] Visualization of Big Data [article] Big Data Analysis (R and Hadoop) [article]

0
0
3169

article-image-understanding-proxmox-ve-and-advanced-installation

Packt

13 Apr 2016

12 min read

Understanding Proxmox VE and Advanced Installation

Packt

13 Apr 2016

12 min read

In this article by Wasim Ahmed, the author of the book Mastering Proxmox - Second Edition, we will see Virtualization as we all know today is a decade old technology that was first implemented in mainframes of the 1960s. Virtualization was a way to logically divide the mainframe's resources for different application processing. With the rise in energy costs, running under-utilized server hardware is no longer a luxury. Virtualization enables us to do more with less thus save energy and money while creating a virtual green data center without geographical boundaries. (For more resources related to this topic, see here.) A hypervisor is a piece software, hardware, or firmware that creates and manages virtual machines. It is the underlying platform or foundation that allows a virtual world to be built upon. In a way, it is the very building block of all virtualization. A bare metal hypervisor acts as a bridge between physical hardware and the virtual machines by creating an abstraction layer. Because of this unique feature, an entire virtual machine can be moved over a vast distance over the Internet and be made able to function exactly the same. A virtual machine does not see the hardware directly; instead, it sees the layer of the hypervisor, which is the same no matter on what hardware the hypervisor has been installed. The Proxmox Virtual Environment (VE) is a cluster-based hypervisor and one of the best kept secrets in the virtualization world. The reason is simple. It allows you to build an enterprise business-class virtual infrastructure at a small business-class price tag without sacrificing stability, performance, and ease of use. Whether it is a massive data center to serve millions of people, or a small educational institution, or a home serving important family members, Proxmox can handle configuration to suit any situation. If you have picked up this article, no doubt you will be familiar with virtualization and perhaps well versed with other hypervisors, such VMWare, Xen, Hyper-V, and so on. In this article and upcoming articles, we will see the mighty power of Proxmox from inside out. We will examine scenarios and create a complex virtual environment. We will tackle some heavy day-to-day issues and show resolutions, which might just save the day in a production environment. So, strap yourself and let's dive into the virtual world with the mighty hypervisor, Proxmox VE. Understanding Proxmox features Before we dive in, it is necessary to understand why one should choose Proxmox over the other main stream hypervisors. Proxmox is not perfect but stands out among other contenders with some hard to beat features. The following are some of the features that makes Proxmox a real game changer. It is free! Yes, Proxmox is free! To be more accurate, Proxmox has several subscription levels among which the community edition is completely free. One can simply download Proxmox ISO at no cost and raise a fully functional cluster without missing a single feature and without paying anything. The main difference between the paid and community subscription level is that the paid subscription receives updates, which goes through additional testing and refinement. If you are running a production cluster with real workload, it is highly recommended that you purchase support and licensing from Proxmox or Proxmox resellers. Built-in firewall Proxmox VE comes with a robust firewall ready to be configured out of the box. This firewall can be configured to protect the entire Proxmox cluster down to a virtual machine. The Per VM firewall option gives you the ability to configure each VM individually by creating individualized firewall rules, a prominent feature in a multi-tenant virtual environment. Open vSwitch Licensed under Apache 2.0 license, Open vSwitch is a virtual switch designed to work in a multi-server virtual environment. All hypervisors need a bridge between VMs and the outside network. Open vSwitch enhances features of the standard Linux bridge in an ever changing virtual environment. Proxmox fully supports Open vSwitch that allows you to create an intricate virtual environment all the while, reducing virtual network management overhead. For details on Open vSwitch, refer to http://openvswitch.org/. The graphical user interface Proxmox comes with a fully functional graphical user interface or GUI out of the box. The GUI allows an administrator to manage and configure almost all the aspects of a Proxmox cluster. The GUI has been designed keeping simplicity in mind with functions and features separated into menus for easier navigation. The following screenshot shows an example of the Proxmox GUI dashboard: KVM virtual machines KVM or Kernel-based virtual machine is a kernel module that is added to Linux for full virtualization to create isolated fully independent virtual machines. KVM VMs are not dependent on the host operating system in any way, but they do require the virtualization feature in BIOS to be enabled. KVM allows a wide variety of operating systems for virtual machines, such as Linux and Windows. Proxmox provides a very stable environment for KVM-based VMs. Linux containers or LXC Introduced recently in Proxmox VE 4.0, Linux containers allow multiple Linux instances on the same Linux host. All the containers are dependent on the host Linux operating system and only Linux flavors can be virtualized as containers. There are no containers for the Windows operating system. LXC replace prior OpenVZ containers, which were the primary containers in the virtualization method in the previous Proxmox versions. If you are not familiar with LXC and for details on LXC, refer to https://linuxcontainers.org/. Storage plugins Out of the box, Proxmox VE supports a variety of storage systems to store virtual disk images, ISO templates, backups, and so on. All plug-ins are quite stable and work great with Proxmox. Being able to choose different storage systems gives an administrator the flexibility to leverage the existing storage in the network. As of Proxmox VE 4.0, the following storage plug-ins are supported: The local directory mount points iSCSI LVM Group NFS Share GlusterFS Ceph RBD ZFS Vibrant culture Proxmox has a growing community of users who are always helping others to learn Proxmox and troubleshoot various issues. With so many active users around the world and through active participation of Proxmox developers, the community has now become a culture of its own. Feature requests are continuously being worked on, and the existing features are being strengthened on a regular basis. With so many users supporting Proxmox, it is sure here to stay. The basic installation of Proxmox The installation of a Proxmox node is very straightforward. Simply, accept the default options, select localization, and enter the network information to install Proxmox VE. We can summarize the installation process in the following steps: Download ISO from the official Proxmox site and prepare a disc with the image (http://proxmox.com/en/downloads). Boot the node with the disc and hit enter to start the installation from the installation GUI. We can also install Proxmox from a USB drive. Progress through the prompts to select options or type in information. After the installation is complete, access the Proxmox GUI dashboard using the IP address, as follows: https://<proxmox_node_ip:8006 In some cases, it may be necessary to open the firewall port to allow access to the GUI over port 8006. The advanced installation option Although the basic installation works in all scenarios, there may be times when the advanced installation option may be necessary. Only the advanced installation option provides you the ability to customize the main OS drive. A common practice for the operating system drive is to use a mirror RAID array using a controller interface. This provides drive redundancy if one of the drives fails. This same level of redundancy can also be achieved using a software-based RAID array, such as ZFS. Proxmox now offers options to select ZFS-based arrays for the operating system drive right at the beginning of the installation. For details on ZFS, if you are not familiar with ZFS, refer to https://en.wikipedia.org/wiki/ZFS. It is a common question to ask why one should choose ZFS software RAID over tried and tested hardware-based RAID. The simple answer is flexibility. A hardware RAID is locked or fully dependent on the hardware RAID controller interface that created the array, whereas ZFS software-based is not dependent on any hardware, and the array can be easily be ported to different hardware nodes. Should a RAID controller failure occur, the entire array created from that controller is lost unless there is an identical controller interface available for replacement? The ZFS array is only lost when all the drives or maximum tolerable number of drives are lost in the array. Besides ZFS, we can also select other filesystem types, such as ext3, ext4, or xfs from the same advanced option. We can also set the custom disk or partition sizes through the advanced option. The following screenshot shows the installation interface with the Target Hard disk selection page: Click on Options, as shown in the preceding screenshot, to open the advanced option for the Hard disk. The following screenshot shows the option window after clicking on the Options button: In the preceding screenshot, we selected ZFS RAID1 for mirroring and the two drives, Harddisk 0 and Harddisk 1, respectively to install Proxmox. If we pick one of the filesystems such as ext3, ext4, or xfs instead of ZFS, the Hard disk Option dialog box will look like the following screenshot with different set of options: Selecting a filesystem gives us the following advanced options: hdsize: This is the total drive size to be used by the Proxmox installation. swapsize: This defines the swap partition size. maxroot: This defines the maximum size to be used by the root partition. minfree: This defines the minimum free space that should remain after the Proxmox installation. maxvz: This defines the maximum size for data partition. This is usually /var/lib/vz. Debugging the Proxmox installation Debugging features are part of any good operating system. Proxmox has debugging features that will help you during a failed installation. Some common reasons are unsupported hardware, conflicts between devices, ISO image errors, and so on. Debugging mode logs and displays installation activities in real time. When the standard installation fails, we can start the Proxmox installation in debug mode from the main installation interface, as shown in the following screenshot: The debug installation mode will drop us in the following prompt. To start the installation, we need to press Ctrl + D. When there is an error during the installation, we can simply press Ctrl + C to get back to this console to continue with our investigation: From the console, we can check the installation log using the following command: # cat /tmp/install.log From the main installation menu, we can also press e to enter edit mode to change the loader information, as shown in the following screenshot: At times, it may be necessary to edit the loader information when normal booting does not function. This is a common case when Proxmox is unable to show the video output due to UEFI or a nonsupported resolution. In such cases, the booting process may hang. One way to continue with booting is to add the nomodeset argument by editing the loader. The loader will look as follows after editing: linux/boot/linux26 ro ramdisk_size=16777216 rw quiet nomodeset Customizing the Proxmox splash screen When building a custom Proxmox solution, it may be necessary to change the default blue splash screen to something more appealing in order to identify the company or department the server belongs to. In this section, we will see how easily we can integrate any image as the splash screen background. The splash screen image must be in the .tga format and must have fixed standard sizes, such as 640 x 480, 800 x 600, or 1024 x 768. If you do not have any image software that supports the .tga format, you can easily convert an jpg, gif, or png image to the .tga format using a free online image converter (http://image.online-convert.com/convert-to-tga). Once the desired image is ready in the .tga format, the following steps will integrate the image as the Proxmox splash screen: Copy the .tga image in the Proxmox node in the /boot/grub directory. Edit the grub file in /etc/default/grub to add the following code, and click on save: GRUB_BACKGROUND=/boot/grub/<image_name>.tga Run the following command to update the grub configuration: # update-grub Reboot. The following screenshot shows an example of how the splash screen may look like after we add a custom image to it: Picture courtesy of www.techcitynews.com We can also change the font color to make it properly visible, depending on the custom image used. To change the font color, edit the debian theme file in /etc/grub.d/05_debian_theme, and find the following line of code: set_background_image "${GRUB_BACKGROUND}" || set_default_theme Edit the line to add the font color, as shown in the following format. In our example, we have changed the font color to black and highlighted the font color to light blue: set_background_image "${GRUB_BACKGROUND}" "black/black" "light-blue/black" || set_default_theme After making the necessary changes, update grub, and reboot to see the changes. Summary In this article, we looked at why Proxmox is a better option as a hypervisor, what advanced installation options are available during an installation, and why do we choose software RAID for the operating system drive. We also looked at the cost of Proxmox, storage options, and network flexibility using openvswitch. We learned the presence of the debugging features and customization options of the Proxmox splash screen. In next article, we will take a closer look at the Proxmox GUI and see how easy it is to centrally manage a Proxmox cluster from a web browser. Resources for Article: Further resources on this subject: Proxmox VE Fundamentals [article] Basic Concepts of Proxmox Virtual Environment [article]

0
0
21868

article-image-cluster-computing-using-scala

Packt

13 Apr 2016

18 min read

Cluster Computing Using Scala

Packt

13 Apr 2016

18 min read

In this article by Vytautas Jančauskas the author of the book Scientific Computing with Scala, explains the way of writing software to be run on distributed computing clusters. We will learn the MPJ Express library here. (For more resources related to this topic, see here.) Very often when dealing with intense data processing tasks and simulations of physical phenomena, there comes a time when no matter how many CPU cores and memory your workstation has, it is not enough. At times like these, you will want to turn to supercomputing clusters for help. These distributed computing environments consist of many nodes (each node being a separate computer) connected into a computer network using specialized high bandwidth and low latency connections (or if you are on a budget standard Ethernet hardware is often enough). These computers usually utilize a network filesystem allowing each node to see the same files. They communicate using messaging libraries, such as MPI. Your program will run on separate computers and utilize the message passing framework to exchange data via the computer network. Using MPJ Express for distributed computing MPJ Express is a message passing library for distributed computing. It works in programming languages using Java Virtual Machine (JVM). So, we can use it from Scala. It is similar in functionality and programming interface to MPI. If you know MPI, you will be able to use MPJ Express pretty much the same way. The differences specific to Scala are explained in this section. We will start with how to install it. For further reference, visit the MPJ Express website given here: http://mpj-express.org/ Setting up and running MPJ Express The steps to set up and run MPJ Express are as follows: First, download MPJ Express from the following link. The version at the time of this writing is 0.44.http://mpj-express.org/download.php Unpack the archive and refer to the included README file for installation instructions. Currently, you have to set MPJ_HOME to the folder you unpacked the archive to and add the bin folder in that archive to your path. For example, if you are a Linux user using bash as your shell, you can add the following two lines to your .bashrc file (the file is in your home directory at /home/yourusername/.bashrc): export MPJ_HOME=/home/yourusername/mpj export PATH=$MPJ_HOME/bin:$PATH Here, mpj is the folder you extracted the archive you downloaded from the MPJ Express website to. If you are using a different system, you will have to do the equivalent of the above for your system to use MPJ Express. We will want to use MPJ Express with Scala Build Tool (SBT), which we used previously to build and run all of our programs. Create the following directory structure: scalacluster/ lib/ project/ plugins.sbt build.sbt I have chosen to name the project folder asscalacluster here, but you can call it whatever you want. The .jar files in the lib folder will be accessible to your program now. Copy the contents of the lib folder from the mpj directory to this folder. Finally, create an empty build.sbt and plugins.sbt files. Let’s now write and run a simple "Hello, World!" program to test our setup: import mpi._ object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank val size: Int = MPI.COMM_WORLD.Size println("Hello, World, I'm <" + me + ">") MPI.Finalize() } } This should be familiar to everyone who has ever used MPI. First, we import everything from the mpj package. Then, we initialize MPJ Express by calling MPI.Initialize, the arguments to MPJ Express will be passed from the command-line arguments you will enter when running the program. The MPI.COMM_WORLD.Rank() function returns the MPJ processes rank. A rank is a unique identifier used to distinguish processes from one another. They are used when you want different processes to do different things. A common pattern is to use the process with rank 0 as the master process and the processes with other ranks as workers. Then, you can use the processes rank to decide what action to take in the program. We also determine how many MPJ processes were launched by checking MPI.COMM_WORLD.Size. Our program will simply print a processes rank for now. We will want to run it. If you don't have a distributed computing cluster readily available, don't worry. You can test your programs locally on your desktop or laptop. The same program will work without changes on clusters as well. To run programs written using MPJ Express, you have to use the mpjrun.sh script. This script will be available to you if you have added the bin folder of the MPJ Express archive to your PATH as described in the section on installing MPJ Express. The mpjrun.sh script will setup the environment for your MPJ Express processes and start said processes. The mpjrun.sh script takes a .jar file, so we need to create one. Unfortunately for us, this cannot easily be done using the sbt package command in the directory containing our program. This worked previously, because we used Scala runtime to execute our programs. MPJ Express uses Java. The problem is that the .jar package created with sbt package does not include Scala's standard library. We need what is called a fat .jar—one that contains all the dependencies within itself. One way of generating it is to use a plugin for SBT called sbt-assembly. The website for this plugin is given here: https://github.com/sbt/sbt-assembly There is a simple way of adding the plugin for use in our project. Remember that project/plugins.sbt file we created? All you need to do is add the following line to it (the line may be different for different versions of the plugin. Consult the website): addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1") Now, add the following to the build.sbt file you created: lazy val root = (project in file(".")). settings( name := "mpjtest", version := "1.0", scalaVersion := "2.11.7" ) Then, execute the sbt assembly command from the shell to build the .jar file. The file will be put under the following directory if you are using the preceding build.sbt file. That is, if the folder you put the program and build.sbt in is /home/you/cluster: /home/you/cluster/target/scala-2.11/mpjtest-assembly- 1.0.jar Now, you can run the mpjtest-assembly-1.0.jar file as follows: $ mpjrun.sh -np 4 -jar target/scala-2.11/mpjtest-assembly-1.0.jar MPJ Express (0.44) is started in the multicore configuration Hello, World, I'm <0> Hello, World, I'm <2> Hello, World, I'm <3> Hello, World, I'm <1> Argument -np specifies how many processes to run. Since we specified -np 4, four processes will be started by the script. The order of the "Hello, World" messages can differ on your system since the precise order of execution of different processes is undetermined. If you got the output similar to the one shown here, then congratulations, you have done the majority of the work needed to write and deploy applications using MPJ Express. Using Send and Recv MPJ Express processes can communicate using Send and Recv. These methods constitute arguably the simplest and easiest to understand mode of operation that is also probably the most error prone. We will look at these two first. The following are the signatures for the Send and Recv methods: public void Send(java.lang.Object buf, int offset, int count, Datatype datatype, int dest, int tag) throws MPIException public Status Recv(java.lang.Object buf, int offset, int count, Datatype datatype, int source, int tag) throws MPIException Both of these calls are blocking. This means that after calling Send, your process will block (will not execute the instructions following it) until a corresponding Recv is called by another process. Also Recv will block the process, until a corresponding Send happens. By corresponding, we mean that the dest and source arguments of the calls have the values corresponding to receivers and senders ranks, respectively. The two calls will be enough to implement many complicated communication patterns. However, they are prone to various problems such as deadlocks. Also, they are quite difficult to debug, since you have to make sure that each Send has the correct corresponding Recv and vice versa. The parameters for Send and Recv are basically the same. The meanings of those parameters are summarized in the following table: Argument Type Description Buf java.lang.Object It has to be a one-dimensional Java array. When using from Scala, use the Scala array, which is a one-to-one mapping to a Java array. offset int The start of the data you want to pass from the start of the array. Count int This shows the number items of the array you want to pass. datatype Datatype The type of data in the array. Can be one of the following: MPI.BYTE, MPI.CHAR, MPI.SHORT, MPI.BOOLEAN, MPI.INT, MPI.LONG, MPI.FLOAT, MPI.DOUBLE, MPI.OBJECT, MPI.LB, MPI.UB, and MPI.PACKED. dest/source int Either the destination to send the message to or the source to get the message from. You use the rank of the process to identify sources and destinations. tag int Used to tag the message. Can be used to introduce different message types. Can be ignored for most common applications. Let’s look at a simple program using these calls for communication. We will implement a simple master/worker communication pattern: import mpi._ import scala.util.Random object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { Here, we use an if statement to identify who we are based on our rank. Since each process gets a unique rank, this allows us to determine what action should be taken. In our case, we assigned the role of the master to the process with rank 0 and the role of a worker to processes with other ranks: for (i <- 1 until size) { val buf = Array(Random.nextInt(100)) MPI.COMM_WORLD.Send(buf, 0, 1, MPI.INT, i, 0) println("MASTER: Dear <" + i + "> please do work on " + buf(0)) } We iterate over workers, who have the ranks from 1 to whatever is the argument for number of processes you passed to the mpjrun.sh script. Let’s say that number is four. This gives us one master process and three worker processes. So, each process with a rank from 1 to 3 will get a randomly generated number. We have to put that number in an array even though it is a single number. This is because both Send and Recv methods expect an array as their first argument. We then use the Send method to send the data. We specified the array as argument buf, offset of 0, size of 1, type MPI.INT, destination as the for loop index, and tag as 0. This means that each of our three worker processes will receive a (most probably) different number: for (i <- 1 until size) { val buf = Array(0) MPI.COMM_WORLD.Recv(buf, 0, 1, MPI.INT, i, 0) println("MASTER: Dear <" + i + "> thanks for the reply, which was " + buf(0)) } Finally, we collect the results from the workers. For this, we iterate over the worker ranks and use the Recv method on each one of them. We print the result we got from the worker, and this concludes the master's part. We now move on to the workers: } else { val buf = Array(0) MPI.COMM_WORLD.Recv(buf, 0, 1, MPI.INT, 0, 0) println("<" + me + ">: " + "Understood, doing work on " + buf(0)) buf(0) = buf(0) * buf(0) MPI.COMM_WORLD.Send(buf, 0, 1, MPI.INT, 0, 0) println("<" + me + ">: " + "Reporting back") } The workers code is identical for all of them. They receive a message from the master, calculate the square of it, and send it back: MPI.Finalize() } } After you run the program, the results should be akin to the following, which I got when running this program on my system: MASTER: Dear <1> please do work on 71 MASTER: Dear <2> please do work on 12 MASTER: Dear <3> please do work on 55 <1>: Understood, doing work on 71 <1>: Reported back MASTER: Dear <1> thanks for the reply, which was 5041 <3>: Understood, doing work on 55 <2>: Understood, doing work on 12 <2>: Reported back MASTER: Dear <2> thanks for the reply, which was 144 MASTER: Dear <3> thanks for the reply, which was 3025 <3>: Reported back Sending Scala objects in MPJ Express messages Sometimes, the types provided by MPJ Express for use in the Send and Recv methods are not enough. You may want to send your MPJ Express processes a Scala object. A very realistic example of this would be to send an instance of a Scala case class. These can be used to construct more complicated data types consisting of several different basic types. A simple example is a two-dimensional vector consisting of x and y coordinates. This can be sent as a simple array, but more complicated classes can't. For example, you may want to use a case class as the one shown here. It has two attributes of type String and one attribute of type Int. So what do we do with a data type like this? The simplest answer to that problem is to serialize it. Serializing converts an object to a stream of characters or a string that can be sent over the network (or stored to a file or done other things with) and later on deserialized to get the original object back: scala> case class Person(name: String, surname: String, age: Int) defined class Person scala> val a = Person("Name", "Surname", 25) a: Person = Person(Name,Surname,25) A simple way of serializing is to use a format such as XML or JSON. This can be done automatically using a pickling library. Pickling is a term that comes from the Python programming language. It is the automatic conversion of an arbitrary object into a string representation that can later be de-converted to get the original object back. The reconstructed object will behave the same way as it did before conversion. This allows one to store arbitrary objects to files for example. There is a pickling library available for Scala as well. You can of course do serialization in several different ways (for example, using the powerful support for XML available in Scala). We will use the pickling library that is available from the following website for this example: https://github.com/scala/pickling You can install it by adding the following line to your build.sbt file: libraryDependencies += "org.scala-lang.modules" %% "scala- pickling" % "0.10.1" After doing that, use the following import statements to enable easy pickling in your projects: scala> import scala.pickling.Defaults._ import scala.pickling.Defaults._ scala> import scala.pickling.json._ import scala.pickling.json._ Here, you can see how you can then easily use this library to pickle and unpickle arbitrary objects without the use of annoying boiler plate code: scala> val pklA = a.pickle pklA: pickling.json.pickleFormat.PickleType = JSONPickle({ "$type": "Person", "name": "Name", "surname": "Surname", "age": 25 }) scala> val unpklA = pklA.unpickle[Person] unpklA: Person = Person(Name,Surname,25) Let’s see how this would work in an application using MPJ Express for message passing. A program using pickling to send a case class instance in a message is given here: import mpi._ import scala.pickling.Defaults._ import scala.pickling.json._ case class ArbitraryObject(a: Array[Double], b: Array[Int], c: String) Here, we have chosen to define a fairly complex case class, consisting of two arrays of different types and a string: object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { val obj = ArbitraryObject(Array(1.0, 2.0, 3.0), Array(1, 2, 3), "Hello") val pkl = obj.pickle.value.toCharArray MPI.COMM_WORLD.Send(pkl, 0, pkl.size, MPI.CHAR, 1, 0) In the preceding bit of code, we create an instance of our case class. We then pickle it to JSON and get the string representation of said JSON with the value method. However, to send it in an MPJ message, we need to convert it to a one-dimensional array of one of the supported types. Since it is a string, we convert it to a char array. This is done using the toCharArray method: } else if (me == 1) { val buf = new Array[Char](1000) MPI.COMM_WORLD.Recv(buf, 0, 1000, MPI.CHAR, 0, 0) val msg = buf.mkString val obj = msg.unpickle[ArbitraryObject] On the receiving end, we get the raw char array, convert it back to string using mkString method, and then unpickle it using unpickle[T]. This will return an instance of the case class that we can use as any other instance of a case class. It is in its functionality the same object that was sent to us: println(msg) println(obj.c) } MPI.Finalize() } } The following is the result of running the preceding program. It prints out the JSON representation of our object, and also show that we can access the attributes of said object by printing the c attribute. MPJ Express (0.44) is started in the multicore configuration: { "$type": "ArbitraryObject", "a": [ 1.0, 2.0, 3.0 ], "b": [ 1, 2, 3 ], "c": "Hello" } Hello You can use this method to send arbitrary objects in an MPJ Express message. However, this is just one of many ways of doing this. As mentioned previously, an example of another way is to use the XML representation. XML support is strong in Scala, and you can use it to serialize objects as well. This will usually require you to add some boiler plate code to your program to serialize to XML. The method discussed earlier has the advantage of requiring no boiler plate code. Non-blocking communication So far, we examined only blocking (or synchronous) communication between two processes. This means that the process is blocked (halted their execution) until the Send or Recv methods have been completed successfully. This is simple to understand and enough for most cases. The problem with synchronous communication is that you have to be very careful otherwise deadlocks may occur. Deadlocks are situations when processes wait for each other to release a resource first. Mexican standoff including the dining philosophers problem is one of the famous example of Deadlock in Operating System. The point is that if you are unlucky, you may end up with a program that is seemingly stuck and you don't know why. Using nonlocking communication allows you to avoid these problems most of the time. If you think you may be at risk of deadlocks, you will probably want to use it. The signatures for the primary methods used in asynchronous communication are given here: Request Isend(java.lang.Object buf, int offset, int count, Datatype datatype, int dest, int tag) Isend works similar to its Send counterpart. The main differences are that it does not block (the program continues execution after the call rather than waiting for a corresponding send), and then it returns a Request object. This object is used to check the status of your Send request, block until it is complete if required, and so on: Request Irecv(java.lang.Object buf, int offset, int count, Datatype datatype, int src, int tag) Irecv is again the same as Recv only non-blocking and returns a Request object used to handle your receive request. The operation of these methods can be seen in action in the following example: import mpi._ object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { val requests = for (i <- 0 until 10) yield { val buf = Array(i * i) MPI.COMM_WORLD.Isend(buf, 0, 1, MPI.INT, 1, 0) } } else if (me == 1) { for (i <- 0 until 10) { Thread.sleep(1000) val buf = Array[Int](0) val request = MPI.COMM_WORLD.Irecv (buf, 0, 1, MPI.INT, 0, 0) request.Wait() println("RECEIVED: " + buf(0)) } } MPI.Finalize() } } This is a very simplistic example used simply to demonstrate the basics of using the asynchronous message passing methods. First, the process with rank 0 will send 10 messages to process with rank 1 using Isend. Since Isend does not block, the loop will finish quickly and the messages it sent will be buffered until they are retrieved using Irecv. The second process (the one with rank 1) will wait for one second before retrieving each message. This is to demonstrate the asynchronous nature of these methods. The messages are in the buffer waiting to be retrieved. Therefore, Irecv can be used at your leisure when convenient. The Wait() method of the Request object, it returns, has to be used to retrieve results. The Wait() method blocks until the message is successfully received from the buffer. Summary Extremely computationally intensive programs are usually parallelized and run on supercomputing clusters. These clusters consist of multiple networked computers. Communication between these computers is usually done using messaging libraries such as MPI. These allow you to pass data between processes running on different machines in an efficient manner. In this article, you have learned how to use MPJ Express—an MPI like library for JVM. We saw how to carry out process to process communication as well as collective communication. Most important MPJ Express primitives were covered and example programs using them were given. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code[article] Getting Started with JavaFX[article] Integrating Scala, Groovy, and Flex Development with Apache Maven[article]

0
0
4814

Packt

13 Apr 2016

7 min read

Nginx "expires" directive – Emitting Caching Headers

Packt

13 Apr 2016

7 min read

In this article by Alex Kapranoff, the author of the book Nginx Troubleshooting, explains how all browsers (and even many non-browser HTTP clients) support client-side caching. It is a part of the HTTP standard, albeit one of the most complex caching to understand. Web servers do not control client-side caching to full extent, obviously, but they may issue recommendations about what to cache and how, in the form of special HTTP response headers. This is a topic thoroughly discussed in many great articles and guides, so we will mention it shortly, and with a lean towards problems you may face and how to troubleshoot them. (For more resources related to this topic, see here.) In spite of the fact that browsers have been supporting caching on their side for at least 20 years, configuring cache headers was always a little confusing mostly due to the fact that there two sets of headers designed for the same purpose but having different scopes and totally different formats. There is the Expires: header, which was designed as a quick and dirty solution and also the new (relatively) almost omnipotent Cache-Control: header, which tries to support all the different ways an HTTP cache could work. This is an example of a modern HTTP request-response pair containing the caching headers. First is the request headers sent from the browser (here Firefox 41, but it does not matter): User-Agent:"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0" Accept:"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" Accept-Encoding:"gzip, deflate" Connection:"keep-alive" Cache-Control:"max-age=0" Then, the response headers is: Cache-Control:"max-age=1800" Content-Encoding:"gzip" Content-Type:"text/html; charset=UTF-8" Date:"Sun, 10 Oct 2015 13:42:34 GMT" Expires:"Sun, 10 Oct 2015 14:12:34 GMT" We highlighted the parts that are relevant. Note that some directives may be sent by both sides of the conversation. First, the browser sent the Cache-Control: max-age=0 header because the user pressed the F5 key. This is an indication that the user wants to receive a response that is fresh. Normally, the request will not contain this header and will allow any intermediate cache to respond with a stale but still nonexpired response. In this case, the server we talked to responded with a gzipped HTML page encoded in UTF-8 and indicated that the response is okay to use for half an hour. It used both mechanisms available, the modern Cache-Control:max-age=1800 header and the very old Expires:Sun, 10 Oct 2015 14:12:34 GMT header. The X-Cache: "EXPIRED" header is not a standard HTTP header but was also probably (there is no way to know for sure from the outside) emitted by Nginx. It may be an indication that there are, indeed, intermediate caching proxies between the client and the server, and one of them added this header for debugging purposes. The header may also show that the backend software uses some internal caching. Another possible source of this header is a debugging technique used to find problems in the Nginx cache configuration. The idea is to use the cache hit or miss status, which is available in one of the handy internal Nginx variables as a value for an extra header and then to be able to monitor the status from the client side. This is the code that will add such a header: add_header X-Cache $upstream_cache_status; Nginx has a special directive that transparently sets up both of standard cache control headers, and it is named expires. This is a piece of the nginx.conf file using the expires directive: location ~* \.(?:css|js)$ { expires 1y; add_header Cache-Control "public"; } First, the pattern uses the so-called noncapturing parenthesis, which is a feature first appeared in Perl regular expressions. The effect of this regexp is the same as of a simpler \.(css|js)$ pattern, but the regular expression engine is specifically instructed not to create a variable containing the actual string from inside the parenthesis. This is a simple optimization. Then, the expires directive declares that the content of the css and js files will expire after a year of storage. The actual headers as received by the client will look like this: Server: nginx/1.9.8 (Ubuntu) Date: Fri, 11 Mar 2016 22:01:04 GMT Content-Type: text/css Last-Modified: Thu, 10 Mar 2016 05:45:39 GMT Expires: Sat, 11 Mar 2017 22:01:04 GMT Cache-Control: max-age=31536000 The last two lines contain the same information in wildly different forms. The Expires: header is exactly one year after the date in the Date: header, whereas Cache-Control: specifies the age in seconds so that the client do the date arithmetics itself. The last directive in the provided configuration extract adds another Cache-Control: header with a value of public explicitly. What this means is that the content of the HTTP resource is not access-controlled and therefore may be cached not only for one particular user but also anywhere else. A simple and effective strategy that was used in offices to minimize consumed bandwidth is to have an office-wide caching proxy server. When one user requested a resource from a website on the Internet and that resource had a Cache-Control: public designation, the company cache server would store that to serve to other users on the office network. This may not be as popular today due to cheap bandwidth, but because history has a tendency to repeat itself, you need to know how and why Cache-Control: public works. The Nginx expires directive is surprisingly expressive. It may take a number of different values. See this table: off This value turns off Nginx cache headers logic. Nothing will be added, and more importantly, existing headers received from upstreams will not be modified. epoch This is an artificial value used to purge a stored resource from all caches by setting the Expires header to "1 January, 1970 00:00:01 GMT". max This is the opposite of the "epoch" value. The Expires header will be equal to "31 December 2037 23:59:59 GMT", and the Cache-Control max-age set to 10 years. This basically means that the HTTP responses are guaranteed to never change, so clients are free to never request the same thing twice and may use their own stored values. Specific time An actual specific time value means an expiry deadline from the time of the respective request. For example, expires 10w; A negative value for this directive will emit a special header Cache-Control: no-cache. "modified" specific time If you add the keyword "modified" before the time value, then the expiration moment will be computed relatively to the modification time of the file that is served. "@" specific time A time with an @ prefix specifies an absolute time-of-day expiry. This should be less than 24 hours. For example, Expires @17h;. Many web applications choose to emit the caching headers themselves, and this is a good thing. They have more information about which resources change often and which never change. Tampering with the headers that you receive from the upstream may or may not be a thing you want to do. Sometimes, adding headers to a response while proxying it may produce a conflicting set of headers and therefore create an unpredictable behavior. The static files that you serve with Nginx yourself should have the expires directive in place. However, the general advice about upstreams is to always examine the caching headers you get and refrain from overoptimizing by setting up more aggressive caching policy. Resources for Article: Further resources on this subject: Nginx service [article] Fine-tune the NGINX Configuration [article] Nginx Web Services: Configuration and Implementation [article]

0
0
26567

How-To Tutorials

Creating Your Own Node Module

Setting up a Build Chain with Grunt

Configuring Redmine

Web Server Development

Finding Patterns in the Noise – Clustering and Unsupervised Learning

Getting Started with Force.com

Building Our First Poky Image for the Raspberry Pi

Step Detector and Step Counters Sensors

Remote Authentication

Using the Registry and xlswriter modules

Trending Topics

Probabilistic Graphical Models in R

Detecting fraud on e-commerce orders with Benford's law

Understanding Proxmox VE and Advanced Installation

Cluster Computing Using Scala

Nginx "expires" directive – Emitting Caching Headers

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access