How-To Tutorials

article-image-json-pojo-using-gson-android-studio

01 Jul 2014

6 min read

How to Convert POJO to JSON Using Gson in Android Studio

01 Jul 2014

JSON has become the defacto standard of data exchange on the web. Compared to its cousin XML, it is smaller in size and faster to both create and parse. In fact, it seems so simple that many developers roll their own code to convert plain old Java objects or POJO to and from JSON. For simple objects, it is fairly easy to write the conversion code, but as your objects grow more complex, your code's complexity grows as well. Do you really want to maintain a bunch of code whose functionality is not truly intrinsic to your app? Luckily there is no reason for you to do so. There are quite a few alternatives to writing your own Java JSON serializer/deserializer; in fact, json.org lists 25 of them. One of them, Gson, was created by Google for use on internal projects and later was open sourced. Gson is hosted on Google Code and the source code is available in an SVN repo. Create an Android app The process of converting POJO to JSON is called serialization. The reversed process is deserialization. A big reason that GSON is such a popular library is how simple it makes both processes. For both, the only thing you need is the Gson class. Let's create a simple Android app and see how simple Gson is to use. Start Android Studio and select new project Change the Application name to GsonTest. Click Next Click Next again. Click Finish At this point we have a complete Android hello world app. In past Android IDEs, we would add the Gson library at this point, but we don't do that anymore. Instead we add a Gson dependency to our build.gradle script and that will take care of everything else for us. It is super important to edit the correct Gradle file. There is one at the root directory but the one we want is at the app directory. Double-click it to open. Locate the dependencies section near the bottom of the script. After the last entry add the following line: compile 'com.google.code.gson:gson:2.2.4' After you add it, save the script and then click the Sync Project with Gradle Files icon. It is the fifth icon from the right-hand side in the toolbar. At this point, the Gson library is visible to your app. So let's build some test code. Create test code with JSON For our test we are going to use the JSON Test web service at https://www.jsontest.com/. It is a testing platform for JSON. Basically it gives us a place to send data to in order to test if we are properly serializing and deserializing data. JSON Test has a lot of services but we will use the validate service. You pass it a JSON string URL encoded as a query string and it will reply with a JSON object that indicates whether or not the JSON was encoded correctly, as well as some statistical information. The first thing we need to do is create two classes. The first class, TestPojo, is the Java class that we are going to serialize and send to JSON Test. TestPojo doesn't do anything important. It is just for our test; however, it contains several different types of objects: ints, strings, and arrays of ints. Classes that you create can easily be much more complicated, but don't worry, Gson can handle it, for example: 1 package com.tekadept.gsontest.app; 2 3 public class TestPojo { 4 private intvalue1 = 1; 5 private String value2 = "abc"; 6 private intvalues[] = {1, 2, 3, 4}; 7 private transient intvalue3 = 3; 8 9 // no argsctor 10 TestPojo() { 11 } 12 } 13 Gson will also respect the Java transient modifier, which specifies that a field should not be serialized. Any field with it will not appear in the JSON. The second class, JsonValidate, will hold the results of our call to JSON Test. In order to make it easy to parse, I've kept the field names exactly the same as those returned by the service, except for one. Gson has an annotation, @SerializedName, if you place it before a field name, you can have name the class version of a field be different than the JSON name. For example, if we wanted to name the validate field isValid all we would have to do is: 1 package com.tekadept.gsontest.app; 2 3 import com.google.gson.annotations.SerializedName; 4 5 public class JsonValidate { 6 7 public String object_or_array; 8 public booleanempty; 9 public long parse_time_nanoseconds; 10 @SerializedName("validate") 11 public booleanisValid; 12 public intsize; 13 } By using the @SerializedName annotation, our name for the JSON validate becomes isValid. Just remember that you only need to use the annotation when you change the field's name. In order to call JSON Test's validate service, we follow the best practice of not doing it on the UI thread by using an async task. An async task has four steps: onPreExecute, doInBackground, onProgressUpdate, and onPostExecute. The doInBackground method happens on another thread. It allows us to wait for the JSON Test service to respond to us without triggering the dreaded application not responding error. You can see this in action in the following code: 60 @Override 61 protected String doInBackground(String... notUsed) { 62 TestPojotp = new TestPojo(); 63 Gsongson = new Gson(); 64 String result = null; 65 66 try { 67 String json = URLEncoder.encode(gson.toJson(tp), "UTF-8"); 68 String url = String.format("%s%s", Constants.JsonTestUrl, json); 69 result = getStream(url); 70 } catch (Exception ex){ 71 Log.v(Constants.LOG_TAG, "Error: " + ex.getMessage()); 72 } 73 return result; 74 } To encode our Java object, all we need to do is create an instance of the Gson class, then call its toJson method, passing an instance of the class we wish to serialize. Deserialization is nearly as simple. In the onPostExecute method, we get the string of JSON from the web service. We then call the convertFromJson method that does the conversion. First it makes sure that it got a valid string, then it does the conversion by calling Gson'sfromJson method, passing the string and the name of its the class, as follows: 81 @Override 82 protected void onPostExecute(String result) { 83 84 // convert JSON string to a POJO 85 JsonValidatejv = convertFromJson(result); 86 if (jv != null) { 87 Log.v(Constants.LOG_TAG, "Conversion Succeed: " + result); 88 } else { 89 Log.v(Constants.LOG_TAG, "Conversion Failed"); 90 } 91 } 92 93 private JsonValidateconvertFromJson(String result) { 94 JsonValidatejv = null; 95 if (result != null &&result.length() >0) { 96 try { 97 Gsongson = new Gson(); 98 jv = gson.fromJson(result, JsonValidate.class); 99 } catch (Exception ex) { 100 Log.v(Constants.LOG_TAG, "Error: " + ex.getMessage()); 101 } 102 } 103 return jv; 104 } Conclusion For most developers this is all you need to know. There is a complete guide to Gson at https://sites.google.com/site/gson/gson-user-guide. The complete source code for the test app is at https://github.com/Rockncoder/GsonTest. Discover more Android tutorials and extra content on our Android page - find it here.

0
0
18721

How-To Tutorials

article-image-part-1-deploying-multiple-applications-capistrano-single-project

Rodrigo Rosenfeld

01 Jul 2014

9 min read

Part 1: Deploying Multiple Applications with Capistrano from a Single Project

Rodrigo Rosenfeld

01 Jul 2014

9 min read

Capistrano is a deployment tool written in Ruby that is able to deploy projects using any language or framework, through a set of recipes, which are also written in Ruby. Capistrano expects an application to have a single repository and it is able to run arbitrary commands on the server through an SSH non-interactive session. Capistrano was designed assuming that an application is completely described by a single repository with all code belonging to it. For example, your web application is written with Ruby on Rails and simply serving that application would be enough. But what if you decide to use a separate application for managing your users, in a separate language and framework? Or maybe some issue tracker application? You could setup a proxy server to properly deliver each request to the right application based upon the request path for example. But the problem remains: how do you use Capistrano to manage more complex scenarios like this if it supports a single repository? The typical approach is to integrate Capistrano on each of the component applications and then switching between those projects before deploying those components. Not only this is a lot of work to deploy all of these components, but it may also lead to a duplication of settings. For example, if your main application and the user management application both use the same database for a given environment, you’d have to duplicate this setting in each of the components. For the Market Tracker product, used byLexisNexis clients (which we develop at e-Core for Matterhorn Transactions Inc.), we were looking for a better way to manage many component applications, in lots of environments and servers. We wanted to manage all of them from a single repository, instead of adding Capistrano integration to each of our component’s repositories and having to worry about keeping the recipes in sync between each of the maintained repository branches. Motivation The Market Tracker application we maintain consists of three different applications: the main one, another to export search results to Excel files, and an administrative interface to manage users and other entities. We host the application in three servers: two for the real thing and another back-up server. The first two are identical ones and allow us to have redundancy and zero downtime deployments except for a few cases where we change our database schema in incompatible ways with previous versions. To add to the complexity of deploying our three composing applications to each of those servers, we also need to deploy them multiple times for different environments like production, certification, staging, and experimental. All of them run on the same server, in separate ports, and they are running separate databases:Solr and Redis instances. This is already complex enough to manage when you integrate Capistrano to each of your projects, but it gets worse. Sometimes you find bugs in production and have to release quick fixes, but you can't deploy the version in the master branch that has several other changes. At other times you find bugs on your Capistrano recipes themselves and fix them on the master. Or maybe you are changing your deploy settings rather than the application’s code. When you have to deploy to production, depending on how your Capistrano recipes work, you may have to change to the production branch, backport any changes for the Capistrano recipes from the master and finally deploy the latest fixes. This happens if your recipe will use any project files as a template and they moved to another place in the master branch, for example. We decided to try another approach, similar to what we do with our database migrations. Instead of integrating the database migrations into the main application (the default on Rails, Django, Grails, and similar web frameworks) we prefer to handle it as a separate project. In our case we use theactive_record_migrations gem, which brings standalone support for ActiveRecord migrations (the same that is bundled with Rails apps by default). Our database is shared between the administrative interface project and the main web application and we feel it's better to be able to manage our database schema independently from the projects using the database. We add the migrations project to the other application as submodules so that we know what database schema is expected to work for a particular commit of the application, but that's all. We wanted to apply the same principles to our Capistrano recipes. We wanted to manage all of our applications on different servers and environments from a single project containing the Capistrano recipes. We also wanted to store the common settings in a single place to avoid code duplication, which makes it hard to add new environments or update existing ones. Grouping all applications' Capistrano recipes in a single project It seems we were not the first to want all Capistrano recipes for all of our applications in a single project. We first tried a project called caphub. It worked fine initially and its inheritance model would allow us to avoid our code duplication. Well, not entirely. The problem is that we needed some kind of multiple inheritances or mixins. We have some settings, like token private key, that are unique across environments, like Certification and Production. But we also have other settings that are common in within a server. For example, the database host name will be the same for all applications and environments inside our collocation facility, but it will be different in our backup server at Amazon EC2. CapHub didn't help us to get rid of the duplication in such cases, but it certainly helped us to find a simple solution to get what we wanted. Let's explore how Capistrano 3 allows us to easily manage such complex scenarios that are more common than you might think. Capistrano stages Since Capistrano 3, multistage support is built-in (there was a multistage extension for Capistrano 2). That means you can writecap stage_nametask_name, for examplecap production deploy. By default,cap install will generate two stages: production and staging. You can generate as many as you want, for example: cap install STAGES=production,cert,staging,experimental,integrator But how do we deploy each of those stages to our multiple servers, since the settings for each stage may be different across the servers? Also, how can we manage separate applications? Even though those settings are called "stages" by Capistrano, you can use it as you want. For example, suppose our servers are named m1,m2, and ec2 and the applications are named web, exporter and admin. We can create settings likem1_staging_web, ec2_production_admin, and so on. This will result in lots of files (specifically 45 = 5 x 3 x 3 to support five environments, three applications, and three servers) but it's not a big deal if you consider the settings files can be really small, as the examples will demonstrate later on in this article by using mixins. Usually people will start with staging and production only, and then gradually add other environments. Also, they usually start with one or two servers and keep growing as they feel the need. So supporting 45 combinations is not such a pain since you don’t write all of them at once. On the other hand, if you have enough resources to have a separate server for each of your environments, Capistrano will allow you to add multiple "server" declarations and assign roles to them, which can be quite useful if you're running a cluster of servers. In our case, to avoid downtime we don't upgrade all servers in our cluster at once. We also don't have the budget to host 45 virtual machines or even 15. So the little effort to generate 45 small settings files compensates the savings with hosting expenses. Using mixins My next post will create an example deployment project from scratch providing detail for everything that has been discussed in this post. But first, let me introduce the concept of what we call a mixin in our project. Capistrano 3 is simply a wrapper on top of Rake. Rake is a build tool written in Ruby, similar to “make.” It has targets and targets have prerequisites. This fits nicely in the way Capistrano works, where some deployment tasks will depend on other tasks. Instead of a Rakefile (Rake’s Makefile) Capistrano will use a Capfile, but other than that it works almost the same way. The Domain Specific Language (DSL) in a Capfile is enhanced as you include Capistrano extensions to the Rake DSL. Here’s a sample Capfile, generated by cap install, when you install Capistrano: # Load DSL and Setup Up Stages require'capistrano/setup' # Includes default deployment tasks require'capistrano/deploy' # Includes tasks from other gems included in your Gemfile # # For documentation on these, see for example: # # https://github.com/capistrano/rvm # https://github.com/capistrano/rbenv # https://github.com/capistrano/chruby # https://github.com/capistrano/bundler # https://github.com/capistrano/rails # # require 'capistrano/rvm' # require 'capistrano/rbenv' # require 'capistrano/chruby' # require 'capistrano/bundler' # require 'capistrano/rails/assets' # require 'capistrano/rails/migrations' # Loads custom tasks from `lib/capistrano/tasks' if you have any defined. Dir.glob('lib/capistrano/tasks/*.rake').each { |r| import r } Just like a Rakefile, a Capfile is valid Ruby code, which you can easily extend using regular Ruby code. So, to support a mixin DSL, we simply need to extend the DSL, like this: defmixin (path) loadFile.join('config', 'mixins', path +'.rb') end Pretty simple, right? We prefer to add this to a separate file, like lib/mixin.rb and add this to the Capfile: $:.unshiftFile.dirname(__FILE__) require 'lib/mixin' After that, calling mixin 'environments/staging' should load settings that are common for the staging environment from a file called config/mixins/environments/staging.rb in the root of the Capistrano-enabled project. This is the base to set up our deployment project that we will create in the next post. About the author Rodrigo Rosenfeld Rosas lives in Vitória-ES, Brazil, with his lovely wife and daughter. He graduated in Electrical Engineering with a Master’s degree in Robotics and Real-time Systems.For the past five years Rodrigo has focused on building and maintaining single page web applications. He is the author of some gems includingactive_record_migrations, rails-web-console, the JS specs runner oojspec, sequel-devise and the Linux X11 utility ktrayshortcut.Rodrigo was hired by e-Core (Porto Alegre - RS, Brazil) to work from home, building and maintaining software forMatterhorn Transactions Inc. with a team of great developers. Matterhorn'smain product, the Market Tracker, is used by LexisNexis clients.

0
0
4496

Richard Feldman

30 Jun 2014

6 min read

Using React.js without JSX

Richard Feldman

30 Jun 2014

6 min read

React.js was clearly designed with JSX in mind, however, there are plenty of good reasons to use React without it. Using React as a standalone library lets you evaluate the technology without having to spend time learning a new syntax. Some teams—including my own—prefer to have their entire frontend code base in one compile-to-JavaScript language, such as CoffeeScript or TypeScript. Others might find that adding another JavaScript library to their dependencies is no big deal, but adding a compilation step to the build chain is a deal-breaker. There are two primary drawbacks to eschewing JSX. One is that it makes using React significantly more verbose. The other is that the React docs use JSX everywhere; examples demonstrating vanilla JavaScript are few and far between. Fortunately, both drawbacks are easy to work around. Translating documentation The first code sample you see in the React Documentation includes this JSX snippet: /** @jsx React.DOM */ React.renderComponent( <h1>Hello, world!</h1>, document.getElementById('example') ); Suppose we want to see the vanilla JS equivalent. Although the code samples on the React homepage include a helpful Compiled JS tab, the samples in the docs—not to mention React examples you find elsewhere on the Web—will not. Fortunately, React’s Live JSX Compiler can help. To translate the above JSX into vanilla JS, simply copy and paste it into the left side of the Live JSX Compiler. The output on the right should look like this: /** @jsx React.DOM */ React.renderComponent( React.DOM.h1(null, "Hello, world!"), document.getElementById('example') ); Pretty similar, right? We can discard the comment, as it only represents a necessary directive in JSX. When writing React in vanilla JS, it’s just another comment that will be disregarded as usual. Take a look at the call to React.renderComponent. Here we have a plain old two-argument function, which takes a React DOM element (in this case, the one returned by React.DOM.h1) as its first argument, and a regular DOM element (in this case, the one returned by document.getElementById('example')) as its second. jQuery users should note that the second argument will not accept jQuery objects, so you will have to extract the underlying DOM element with $("#example")[0] or something similar. The React.DOM object has a method for every supported tag. In this case we’re using h1, but we could just as easily have used h2, div, span, input, a, p, or any other supported tag. The first argument to these methods is optional; it can either be null (as in this case), or an object specifying the element’s attributes. This argument is how you specify things like class, ID, and so on. The second argument is either a string, in which case it specifies the object’s text content, or a list of child React DOM elements. Let’s put this together with a more advanced example, starting with the vanilla JS: React.DOM.form({className:"commentForm"}, React.DOM.input({type:"text", placeholder:"Your name"}), React.DOM.input({type:"text", placeholder:"Say something..."}), React.DOM.input({type:"submit", value:"Post"}) ) For the most part, the attributes translate as you would expect: type, value, and placeholder do exactly what they would do if used in HTML. The one exception is className, which you use in place of the usual class. The above is equivalent to the following JSX: /** @jsx React.DOM */ <form className="commentForm"> <input type="text" placeholder="Your name" /> <input type="text" placeholder="Say something..." /> <input type="submit" value="Post" /> </form> This JSX is a snippet found elsewhere in the React docs, and again you can view its vanilla JS equivalent by pasting it into the Live JSX Compiler. Note that you can include pure JSX here without any surrounding JavaScript code (unlike the JSX playground), but you do need the /** @jsx React.DOM */ comment at the top of the JSX side. Without the comment, the compiler will simply output the JSX you put in. Simple DSLs to make things concise Although these two implementations are functionally identical, clearly the JSX version is more concise. How can we make the vanilla JS version less verbose? A very quick improvement is to alias the React.DOM object: var R = React.DOM; R.form({className:"commentForm"}, R.input({type:"text", placeholder:"Your name"}), R.input({type:"text", placeholder:"Say something..."}), R.input({type:"submit", value:"Post"})) You can take it even further with a tiny bit of DSL: var R = React.DOM; var form = R.form; var input = R.input; form({className:"commentForm"}, input({type:"text", placeholder:"Your name"}), input({type:"text", placeholder:"Say something..."}), input({type:"submit", value:"Post"}) ) This is more verbose in terms of lines of code, but if you have a large DOM to set up, the extra up-front declarations can make the rest of the file much nicer to read. In CoffeeScript, a DSL like this can tidy things up even further: {form, input} = React.DOM form {className:"commentForm"}, [ input type: "text", placeholder:"Your name" input type:"text", placeholder:"Say something..." input type:"submit", value:"Post" ] Note that in this example, the form’s children are passed as an array rather than as a list of extra arguments (which, in CoffeeScript, allows you to omit commas after each line). React DOM element constructors support either approach. (Also note that CoffeeScript coders who don’t mind mixing languages can use the coffee-react compiler or set up a custom build chain that allows for inline JSX in CoffeeScript sources instead.) Takeaways No matter your particular use case, there are plenty of ways to effectively use React without JSX. Thanks to the Live JSX Compiler ’s ability to quickly translate documentation code samples, and the ease with which you can set up a simple DSL to reduce verbosity, there really is very little overhead to using React as a JavaScript library like any other. About the author Richard Feldman is a functional programmer who specializes in pushing the limits of browser-based UIs. He’s built a framework that performantly renders hundreds of thousands of shapes in the HTML5 canvas, a writing web app that functions like a desktop app in the absence of an Internet connection, and much more in between

0
0
12853

How-To Tutorials

article-image-making-most-your-hadoop-data-lake-part-2-optimized-file-formats

Kristen Hardwick

30 Jun 2014

5 min read

Making the Most of Your Hadoop Data Lake, Part 2: Optimized File Formats

Kristen Hardwick

30 Jun 2014

5 min read

One major factor of making the conversion to Hadoop is the concept of the Data Lake. That idea suggests that users keep as much data as possible in HDFS in order to prepare for future use cases and as-yet-unknown data integration points. As your data grows, it is important to make sure that the data is being stored in a way that prolongs that behavior. Data compression is not the only technique that can be used to speed up job performance and improve cluster organization. In addition to the Text and Sequence File options that are typically used by default, Hadoop offers a few more optimized file formats that are specifically designed to improve the process of interacting with the data. In the second part of this two-part series, “Making the Most of Your Hadoop Data Lake”, we will address another important factor in improving manageability—optimized file formats. Using a smarter file for your data: RCFile RCFile stands for Record Columnar File, and it serves as an ideal format for storing relational data that will be accessed through Hive. This format offers performance improvements by storing the data in an optimized way. First, the data is partitioned horizontally, into groups of rows. Then each row group is partitioned vertically, into collections of columns. Finally, the data in each column collection is compressed and stored in column-row format, as if it were a column-oriented database. The first benefit of this altered storage mechanism is apparent at the row level. All HDFS blocks used to form RCFiles will be made up of the horizontally partitioned collections of rows. This is significant because it ensures that no row of data will be split across multiple blocks, and will therefore always be on the same machine. This is not the case for traditional HDFS file formats, which typically use data size to split the file. This optimized data storage will reduce the amount of network bandwidth that is required to serve queries. The second benefit comes from optimizations at the column level, in the form of disk I/O reduction. Since the columns are stored vertically within each row group, the system will be able to seek directly to the required column position in the file, rather than being required to scan across all columns and filter out data that is not necessary. This is extremely useful in queries that only require access to a small subset of the existing columns. RCFiles can be used natively in both Hive and Pig with very little configuration. In Hive CREATE TABLE … STORED AS RCFILE; ALTER TABLE … SET FILEFORMAT RCFILE; SET hive.default.fileformat=RCFile; In Pig: register …/piggybank.jar; a = LOAD '/user/hive/warehouse/table' USING org.apache.pig.piggybank.storage.hiverc.HiveRCInputFormat(…); The Pig jar file referenced here is just one option for enabling the RCFile. At the time of writing, there was also an RCFilePigStorageclass available through Twitter’s Elephant Bird open source library. Hortonworks’ ORCFile and Cloudera’s Parquet formats RCFiles provide optimization for relational files primarily by implementing modifications at the storage level. New innovations have provided improvements on the RCFile format, namely the ORCFile format from Hortonworks and the Parquet format from Cloudera. When storing data using the Optimized Row Columnar file or Parquet formats, several pieces of metadata are automatically written at the column level within each row group; for example, minimum and maximum values for numeric data types and dictionary-style metadata for text data types. The specific metadata is also configurable. One such use case would be for a user to configure the dataset to be sorted on a given set of columns for efficient access. This excess metadata allows for queries to take advantage of an improvement on the original RCFiles–predicate pushdown. That technique allows Hive to evaluate the where clause during the record gathering process, instead of filtering data after all records have been collected. The predicate pushdown technique will evaluate the conditions of the query against the metadata associated with a particular row group, allowing it to skip over entire file blocks if possible, or to seek directly to the correct row. One major benefit of this process is that the more complex a particular where clause is, the more potential there is for row groups and columns to be filtered as irrelevant to the final result. Cloudera’s Parquet format is typically used in conjunction with Impala, but just like with RCFiles, ORCFiles can be incorporated into both Hive and Pig. HCatalog can be used as the primary method to read and write ORCFiles using Pig. The commands for Hive are provided below: In Hive: CREATE TABLE … STORED AS ORC; ALTER TABLE … SET FILEFORMAT ORC SET hive.default.fileformat=Orc Conclusion This post has detailed the alternatives to the default file formats that can be used in Hadoop in order to optimize data access and storage. This information combined with the compression techniques described in the previous post (part 1) will provide some guidelines that can be used to ensure that users can make the most of the Hadoop Data Lake. About the author Kristen Hardwick has been gaining professional experience with software development in parallel computing environments in the private, public, and government sectors since 2007. She has interfaced with several different parallel paradigms including Grid, Cluster, and Cloud. She started her software development career with Dynetics in Huntsville, AL, and then moved to Baltimore, MD, to work for Dynamics Research Corporation. She now works at Spry where her focus is on designing and developing big data analytics for the Hadoop ecosystem.

0
0
2089

article-image-making-most-your-hadoop-data-lake-part-1-data-compression

Kristen Hardwick

30 Jun 2014

6 min read

Making the Most of Your Hadoop Data Lake, Part 1: Data Compression

Kristen Hardwick

30 Jun 2014

6 min read

In the world of big data, the Data Lake concept reigns supreme. Hadoop users are encouraged to keep all data in order to prepare for future use cases and as-yet-unknown data integration points. This concept is part of what makes Hadoop and HDFS so appealing, so it is important to make sure that the data is being stored in a way that prolongs that behavior. In the first part of this two-part series, “Making the Most of Your Hadoop Data Lake”, we will address one important factor in improving manageability—data compression. Data compression is an area that is often overlooked in the context of Hadoop. In many cluster environments, compression is disabled by default, putting the burden on the user. In this post, we will discuss the tradeoffs involved in deciding how to take advantage of compression techniques and the advantages and disadvantages of specific compression codec options with respect to Hadoop. To compress or not to compress Whenever data is converted to something other than its raw data format, that naturally implies some overhead involved in completing the conversion process. When data compression is being discussed, it is important to take that overhead into account with respect to the benefits of reducing the data footprint. One obvious benefit is that compressed data will reduce the amount of disk space that is required for storage of a particular dataset. In the big data environment, this benefit is especially significant—either the Hadoop cluster will be able to keep data for a larger time range, or storing data for the same time range will require fewer nodes, or the disk usage ratios will remain lower for longer. In addition, the smaller file sizes will mean lower data transfer times—either internally for MapReduce jobs or when performing exports of data results. The cost of these benefits, however, is that the data must be decompressed at every point where the data needs to be read, and compressed before being inserted into HDFS. With respect to MapReduce jobs, this processing overhead at both the map phase and the reduce phase will increase the CPU processing time. Fortunately, by making informed choices about the specific compression codecs used at any given phase in the data transformation process, the cluster administrator or user can ensure that the advantages of compression outweigh the disadvantages. Choosing the right codec for each phase Hadoop provides the user with some flexibility on which compression codec is used at each step of the data transformation process. It is important to realize that certain codecs are optimal for some stages, and non-optimal for others. In the next sections, we will cover some important notes for each choice. zlib The major benefit of using this codec is that it is the easiest way to get the benefits of data compression from a cluster and job configuration standpoint—the zlib codec is the default compression option. From the data transformation perspective, this codec will decrease the data footprint on disk, but will not provide much of a benefit in terms of job performance. gzip The gzip codec available in Hadoop is the same one that is used outside of the Hadoop ecosystem. It is common practice to use this as the codec for compressing the final output from a job, simply for the benefit of being able to share the compressed result with others (possibly outside of Hadoop) using a standard file format. bzip2 There are two important benefits for the bzip2 codec. First, if reducing the data footprint is a high priority, this algorithm will compress the data more than the default zlib option. Second, this is the only supported codec that produces “splittable” compressed data. A major characteristic of Hadoop is the idea of splitting the data so that they can be handled on each node independently. With the other compression codecs, there is an initial requirement to gather all parts of the compressed file in order to have all information necessary to decompress the data. With this format, the data can be decompressed in parallel. This splittable quality makes this format ideal for compressing data that will be used as input to a map function, either in a single step or as part of a series of chained jobs. LZO, LZ4, Snappy These three codecs are ideal for compressing intermediate data—the data output from the mappers that will be immediately read in by the reducers. All three codecs heavily favor compression speed over file size ratio, but the detailed specifications for each algorithm should be examined based on the specific licensing, cluster, and job requirements. Enabling compression Once the appropriate compression codec for any given transformation phase has been selected, there are a few configuration properties that need to be adjusted in order to have the changes take effect in the cluster. Intermediate data to reducer mapreduce.map.output.compress = true (Optional) mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.SnappyCodec Final output from a job mapreduce.output.fileoutputformat.compress = true (Optional) mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.BZip2Codec These compression codecs are also available within some of the ecosystem tools like Hive and Pig. In most cases, the tools will default to the Hadoop-configured values for particular codecs, but the tools also provide the option to compress the data generated between steps. Pig pig.tmpfilecompression = true (Optional) pig.tmpfilecompression.codec = snappy Hive hive.exec.compress.intermediate = true hive.exec.compress.output = true Conclusion This post detailed the benefits and disadvantages of data compression, along with some helpful guidelines on how to choose a codec and enable it at various stages in the data transformation workflow. In the next post, we will go through some additional techniques that can be used to ensure that users can make the most of the Hadoop Data Lake. For more Big Data and Hadoop tutorials and insight, visit our dedicated Hadoop page. About the author Kristen Hardwick has been gaining professional experience with software development in parallel computing environments in the private, public, and government sectors since 2007. She has interfaced with several different parallel paradigms including Grid, Cluster, and Cloud. She started her software development career with Dynetics in Huntsville, AL, and then moved to Baltimore, MD, to work for Dynamics Research Corporation. She now works at Spry where her focus is on designing and developing big data analytics for the Hadoop ecosystem.

0
0
6671

article-image-part-1-managing-multiple-apps-and-environments-capistrano-3-and-chef-solo

Rodrigo Rosenfeld

30 Jun 2014

8 min read

Part 1: Managing Multiple Apps and Environments with Capistrano 3 and Chef Solo

Rodrigo Rosenfeld

30 Jun 2014

8 min read

In my previous two posts, I explored how to use Capistrano to deploy multiple applications in different environments and servers. This, however, is only one part of our deployment procedures. It just takes care of the applications themselves, but we still rely on the server being properly set up so that our Capistrano recipes work. In these two posts I'll explain how to use Chef to manage servers, and how to integrate it with Capistrano and perform all of your deployment procedures from a single project. Introducing the sample deployment project After I wrote the previous two posts, I realized I was not fully happy with a few issues of our company's deployment strategy: Duplicate settings: This was the main issue that was puzzling me. I didn't like the fact that we had to duplicate some settings like the application's binding port in both Chef and Capistrano projects. Too many required files (45 to support 3 servers, 5 environments, and 3 applications): While the files were really small, I felt that this situation could be further improved by the use of some conventions. So, I decided to work in a proof-of-concept project that would integrate both Chef and Capistrano and fix these issues. After a weekend working (almost) full time on it, I came up with a sample project so that you can fork it and adapt it to your deployment scenario. The main goal of this project hasn't changed from my previous article. We want to be able to support new environments and servers very quickly by simply adding some settings to the project. Go ahead and clone it. Follow the instructions on the README and it should deploy the Rails Devise sample application into a VirtualBox Virtual Machine (VM) using Vagrant. The following sections will explain how it works and the reasons behind its design. The overall idea While it's possible to accomplish all of your deployment tasks with either Chef or Capistrano alone, I feel that they are more suitable for different tasks. There are many existing recipes that you can take advantage of for both projects, but they usually don't overlap much. There are Chef community cookbooks available to help you install nginx, apache2, java, databases, and much more. You probably want to use Chef to perform administrative tasks like managing services, server backup, installing software, and so on. Capistrano, on the other hand, will help you by deploying the applications itself after the server is ready to go, and after running your Chef recipes. This includes creating releases of your application, which allows you to easily rollback to a previous working version, for example. You'll find existing Capistrano recipes to help you with several application-related tasks like running Bundler, switching between Ruby versions with either rbenv, rvm or chruby, running Rails migrations and assets precompilation, and so on. Capistrano recipes are well integrated with the Capistrano deploy flow. For instance, the capistrano-puma recipe will automatically generate a settings file if it is missing and start puma after the remaining deployment tasks have finished by including this in its recipes: after 'deploy:check', 'puma:check' after 'deploy:finished', 'puma:smart_restart' Another difference between sysadmin and deployment tasks is that usually the former will require superuser privileges while the latter is recommended to be accomplished by a regular user. This way, you can feel safer when deploying Capistrano recipes, since you know it won't affect the server itself, except for the applications managed by that user account. And deploying an application is way more common than installing and configuring programs or changing the proxy's settings. Some of the settings required by Chef and Capistrano recipes overlap. One example is a Chef recipe that generates an nginx settings file that will proxy requests to a Rails application listening on a local port. In this scenario, the binding address used by the Capistrano puma recipe needs to coincide with the port declared in the proxy settings for the nginx configuration file. Managing deployment settings Capistrano and Chef provide different built-in ways of managing their settings. Capistrano will use a Domain Specific Language (DSL) like set/fetch, while Chef will read the attributes following a well described precedence. I strongly advise you to keep with those approaches for settings that are specific for each project. To enable you to remove any duplication by overlapping deployment settings, I introduced another configuration declaration framework for the shared settings using the configatron gem, by taking advantage of the fact that both Chef and Capistrano are written in Ruby. Take a look at the settings directory in the sample project: settings/ ├── applications │ └── rails-devise.rb ├── common.rb ├── environments │ ├── development.rb │ └── production.rb └── servers └── vagrant.rb The settings are split in common, along with those specific for each application, environment, and servers. As you would expect, the Rails Devise application deployed to the production environment in the vagrant server will read the settings from common.rb, servers/vagrant.rb, environments/production.rb, and applications/rails-devise.rb. If some of your settings apply to the Rails Devise running on a given server or environment (or both), it's possible to override the specific settings in other files like rails-devise_production.rb, vagrant_production.rb, or vagrant_production_rails-devise.rb. Here's the definition of load_app_settings in common_helpers/settings_loader.rb: def load_app_settings(app_name, app_server, app_env) cfg.app_name = app_name cfg.app_server = app_server cfg.app_env = app_env [ 'common', "servers/#{app_server}", "environments/#{app_env}", "applications/#{app_name}", "#{app_server}_#{app_env}", "#{app_server}_#{app_name}", "#{app_name}_#{app_env}", "#{app_server}_#{app_env}_#{app_name}", ].each{|s| load_settings s } cfg.lock! end Feel free to change the load path order. The latest settings take precedence over the first ones. So if the binding port is usually 3000 for production but 4000 for your ec2 server, you can add a cfg.my_app.binding_port = 3000 to environments/production.rb and override it on ec2_production.rb. Once those settings are loaded, they are locked and can't be changed by the deployment recipes. As a final note, the settings can also be set using a hash notation, which can be useful if you’re using a dynamic setting attribute. Here’s an example: cfg[:my_app][“binding_#{‘port’}”] = 3000. This is not really useful in this case, but it illustrates the setting capabilities. Calculated settings Two types of calculated settings are supported on this project: delayed and dynamic. Delayed are lazily evaluated the first time they are requested, while dynamic are always evaluated. They are useful for providing default values for some settings that could be overridden by other settings files. I prefer to use delayed attributes for those that are meant to be overridden and dynamic ones for those that are meant to be calculated, even though delayed ones would be suitable for both cases. Here's the common.rb from the sample project to illustrate the idea: require 'set' cfg.chef_runlist = Set.new cfg.deploy_user = 'deploy' cfg.deployment_repo_url = 'git@github.com:rosenfeld/capistrano-chef-deployment.git' cfg.deployment_repo_host = 'github.com' cfg.deployment_repo_symlink = false cfg.nginx.default = false # Delayed attributes: they are set to the block values unless explicitly set to other value cfg.database_name = delayed_attr{ "app1_#{cfg.app_env}" } cfg.nginx.subdomain = delayed_attr{ cfg.app_env } # Dynamic/calculated attributes: those are always evaluated by the block # Those attributes are not meant to be overrideable cfg.nginx.host = dyn_attr{ "#{cfg.nginx.subdomain}.mydomain.com" } cfg.nginx.host in this instance is not meant to be overridden by any other settings file and follows the company's policy. But it would be okay to override the production database name to app1 instead of using the default app1_production. This is just a guideline, but it should give you a good idea of some ways that Chef and Capistrano can be used together. Conclusion I hope you found this post as useful as I did. Being able to fully deploy the whole application stack from a single repository saves us a lot of time and simplifies our deployment a lot, and in the next post, Part 2, I will walk you through that deployment. About The Author Rodrigo Rosenfeld Rosas lives in Vitória-ES, Brazil, with his lovely wife and daughter. He graduated in Electrical Engineering with a Master’s degree in Robotics and Real-time Systems. For the past 5 years Rodrigo has focused on building and maintaining single page web applications. He is the author of some gems including active_record_migrations, rails-web-console, the JS specs runner oojspec, sequel-devise, and the Linux X11 utility ktrayshortcut. Rodrigo was hired by e-Core (Porto Alegre - RS, Brazil) to work from home, building and maintaining software for Matterhorn Transactions Inc. with a team of great developers. Matterhorn's main product, the Market Tracker, is used by LexisNexis clients.

0
0
3034

article-image-mapreduce-openstack-swift-and-zerovm

Lars Butler

30 Jun 2014

6 min read

MapReduce with OpenStack Swift and ZeroVM

Lars Butler

30 Jun 2014

6 min read

Originally coined in a 2004 research paper produced by Google, the term “MapReduce” was defined as a “programming model and an associated implementation for processing and generating large datasets”. While Google’s proprietary implementation of this model is known simply as “MapReduce”, the term has since become overloaded to refer to any software that follows the same general computation model. The philosophy behind the MapReduce computation model is based on a divide and conquer approach: the input dataset is divided into small pieces and distributed among a pool of computation nodes for processing. The advantage here is that many nodes can run in parallel to solve a problem. This can be much quicker than a single machine, provided that the cost of communication (loading input data, relaying intermediate results between compute nodes, and writing final results to persistent storage) does not outweigh the computation time. Indeed, the cost of moving data between storage and computation nodes can diminish the advantage of distributed parallel computation if task payloads are too granular. On the other hand, they must be granular enough in order to be distributed evenly (although achieving perfect distribution is nearly impossible if exact computational complexity of each task cannot be known in advance). A distributed MapReduce solution can be effective for processing large data sets, provided that each task is sufficiently long-running to offset communication costs. If the need for data transfer (between compute and storage nodes) could somehow be reduced or eliminated, the task size limitations would all but disappear, enabling a wider range of use cases. One way to achieve this is to perform the computation closer to the data, that is, on the same physical machine. Stored procedures in relational database systems, for example, are one way to achieve data-local computation. Simpler yet, computation could be run directly on the system where the data is stored in files. In both cases, the same problems are present: changes to the code require direct access to the system, and thus access needs to be carefully controlled (through group management, read/write permissions, and so on). Scaling in these scenarios is also problematic: vertical scaling (beefing up a single server) is the only feasible way to gain performance while maintaining the efficiency of data-local computation. Achieving horizontal scalability by adding more machines is not feasible unless the storage system inherently supports this. Enter Swift and ZeroVM OpenStack Swift is one such example of a horizontally scalable data store. Swift can store petabytes of data in millions of objects across thousands of machines. Swift’s storage operations can also be extended by writing custom middleware applications, which makes it a good foundation for building a “smart” storage system capable of running computations directly on the data. ZeroCloud is a middleware application for Swift which does just that. It embeds ZeroVM inside Swift and exposes a new API method for executing jobs. ZeroVM provides sufficient program isolation and a guarantee of security such that any arbitrary code can be run safely without compromising the storage system. Access to data is governed by Swift’s inherent access control, so there is no need to further lock down the storage system, even in a multi-tenant environment. The combination of Swift and ZeroVM results in something like stored procedures, but much more powerful. First and foremost, user application code can be updated at any time without the need to worry about malicious or otherwise destructive side effects. (The worst thing that can happen is that a user can crash their own machines and delete their own data—but not another user’s.) Second, ZeroVM instances can start very quickly: in about five milliseconds. This means that to process one thousand objects, ZeroCloud can instantaneously spawn one thousand ZeroVM instances (one for each file), run the job, and destroy the instances very quickly. This also means that any job, big or small, can feasibly run on the system. Finally, the networking component of ZeroVM allows intercommunication between instances, enabling the creation of arbitrary job pipelines and multi-stage computations. This converged compute and storage solution makes for a good alternative to popular MapReduce frameworks like Hadoop because little effort is required to set up and tear down computation nodes for a job. Also, once a job is complete, there is no need to extract result data, pipe it across the network, and then save it in a separate persistent data store (which can be an expensive operation if there is a large quantity of data to save). With ZeroCloud, the results can simply be saved back into Swift. The same principle applies to the setup of a job. There is no need to move or copy data to the computation cluster; the data is already where it needs to be. Limitations Currently, ZeroCloud uses a fairly naïve scheduling algorithm. To perform a data-local computation on an object, ZeroCloud determines which machines in the cluster contain a replicated copy of an object and then randomly chooses one to execute a ZeroVM process. While this functions well as a proof-of-concept for data-local computing, it is quite possible for jobs to be spread unevenly across a cluster, resulting in an inefficient use of resources. A research group at the University of Texas at San Antonio (UTSA) sponsored by Rackspace is currently working on developing better algorithms for scheduling workloads running on ZeroVM-based infrastructure. A short video of the research proposal can be found here:http://link.brightcove.com/services/player/bcpid3530100726001?bckey=AQ~~,AAACa24Pu2k~,Q186VLPcl3-oLBDP8npyqxjCNB5jgYcT. Further reading Rackspace has launched a ZeroCloud playground service called Zebra for developers to try out the platform. At the time this was written, Zebra is still in a private beta and invitations are limited. But it also possible to install and run your own copy of ZeroCloud for testing and development; basic installation instructions are here: https://github.com/zerovm/zerocloud/blob/icehouse/doc/Hacking.md. There are also some tutorials (including sample applications) for creating, packaging, deploying, and executing applications on ZeroCloud: http://zerovm.readthedocs.org/en/latest/zebra/tutorial.html. The tutorials are intended for use with the Zebra service, but can be run on any deployment of ZeroCloud. Big Data in the cloud? It's not the future, it's already here! Find more Hadoop tutorials and extra content here, and dive deeper into OpenStack by visiting this page, dedicated to one of the most exciting cloud platforms around today. About the author Lars Butler is a software developer at Rackspace, the open cloud company. He has worked as a software developer in avionics, seismic hazard research, and most recently, on ZeroVM. He can be reached @larsbutler.

0
0
2966

article-image-component-communication-reactjs

Richard Feldman

30 Jun 2014

5 min read

Component Communication in React.js

Richard Feldman

30 Jun 2014

5 min read

You can get a long way in React.js solely by having parent components create child components with varying props, and having each component deal only with its own state. But what happens when a child wants to affect its parent’s state or props? Or when a child wants to inspect that parent’s state or props? Or when a parent wants to inspect its child’s state? With the right techniques, you can handle communication between React components without introducing unnecessary coupling. Child Elements Altering Parents Suppose you have a list of buttons, and when you click one, a label elsewhere on the page updates to reflect which button was most recently clicked. Although any button’s click handler can alter that button’s state, the handler has no intrinsic knowledge of the label that we need to update. So how can we give it access to do what we need? The idiomatic approach is to pass a function through props. Like so: var ExampleParent = React.createClass({ getInitialState: function() { return {lastLabelClicked: "none"} }, render: function() { var me = this; var setLastLabel = function(label) { me.setState({lastLabelClicked: label}); }; return <div> <p>Last clicked: {this.state.lastLabelClicked}</p> <LabeledButton label="Alpha Button" setLastLabel={setLastLabel}/> <LabeledButton label="Beta Button" setLastLabel={setLastLabel}/> <LabeledButton label="Delta Button" setLastLabel={setLastLabel}/> </div>; } }); var LabeledButton = React.createClass({ handleClick: function() { this.props.setLastLabel(this.props.label); }, render: function() { return <button onClick={this.handleClick}>{this.props.label}</button>; } }); Note that this does not actually affect the label’s state directly; rather, it affects the parent component’s state, and doing so will cause the parent to re-render the label as appropriate. What if we wanted to avoid using state here, and instead modify the parent’s props? Since props are externally specified, this would be a lot of extra work. Rather than telling the parent to change, the child would necessarily have to tell its parent’s parent—its grandparent, in other words—to change that grandparent’s child. This is not a route worth pursuing; besides being less idiomatic, there is no real benefit to changing the parent’s props when you could change its state instead. Inspecting Props Once created, the only way for a child’s props to “change” is for the child to be recreated when the parent’s render method is called again. This helpfully guarantees that the parent’s render method has all the information needed to determine the child’s props—not only in the present, but for the indefinite future as well. Thus if another of the parent’s methods needs to know the child’s props, like for example a click handler, it’s simply a matter of making sure that data is available outside the parent’s render method. An easy way to do this is to record it in the parent’s state: var ExampleComponent = React.createClass({ handleClick: function() { var buttonStatus = this.state.buttonStatus; // ...do something based on buttonStatus }, render: function() { // Pretend it took some effort to determine this value var buttonStatus = "btn-disabled"; this.setState({buttonStatus: buttonStatus}); return <button className={buttonStatus} onClick={this.handleClick}> Click this button! </button>; } }); It’s even easier to let a child know about its parent’s props: simply have the parent pass along whatever information is necessary when it creates the child. It’s cleaner to pass along only what the child needs to know, but if all else fails you can go as far as to pass in the parent’s entire set of props: var ParentComponent = React.createClass({ render: function() { return <ChildComponent parentProps={this.props} />; } }); Inspecting State State is trickier to inspect, because it can change on the fly. But is it ever strictly necessary for components to inspect each other’s states, or might there be a universal workaround? Suppose you have a child whose click handler cares about its parent’s state. Is there any way we could refactor things such that the child could always know that value, without having to ask the parent directly? Absolutely! Simply have the parent pass the current value of its state to the child as a prop. Whenever the parent’s state changes, it will re-run its render method, so the child (including its click handler) will automatically be recreated with the new prop. Now the child’s click handler will always have an up-to-date knowledge of the parent’s state, just as we wanted. Suppose instead that we have a parent that cares about its child’s state. As we saw earlier with the buttons-and-labels example, children can affect their parent’s states, so we can use that technique again here to refactor our way into a solution. Simply include in the child’s props a function that updates the parent’s state, and have the child incorporate that function into its relevant state changes. With the child thus keeping the parent’s state up to speed on relevant changes to the child’s state, the parent can obtain whatever information it needed simply by inspecting its own state. Takeaways Idiomatic communication between parent and child components can be easily accomplished by passing state-altering functions through props. When it comes to inspecting props and state, a combination of passing props on a need-to-know basis and refactoring state changes can ensure the relevant parties have all the information they need, whenever they need it. About the Author Richard Feldman is a functional programmer who specializes in pushing the limits of browser-based UIs. He’s built a framework that performantly renders hundreds of thousands of shapes in HTML5 canvas, a writing web app that functions like a desktop app in the absence of an Internet connection, and much more in between.

0
0
6309

How-To Tutorials

article-image-sparrow-ios-game-framework-basics-our-game

Packt

25 Jun 2014

10 min read

Sparrow iOS Game Framework - The Basics of Our Game

Packt

25 Jun 2014

10 min read

(For more resources related to this topic, see here.) Taking care of cross-device compatibility When developing an iOS game, we need to know which device to target. Besides the obvious technical differences between all of the iOS devices, there are two factors we need to actively take care of: screen size and texture size limit. Let's take a closer look at how to deal with the texture size limit and screen sizes. Understanding the texture size limit Every graphics card has a limit for the maximum size texture it can display. If a texture is bigger than the texture size limit, it can't be loaded and will appear black on the screen. A texture size limit has power-of-two dimensions and is a square such as 1024 pixels in width and in height or 2048 x 2048 pixels. When loading a texture, they don't need to have power-of-two dimensions. In fact, the texture does not have to be a square. However, it is a best practice for a texture to have power-of-two dimensions. This limit holds for big images as well as a bunch of small images packed into a big image. The latter is commonly referred to as a sprite sheet. Take a look at the following sample sprite sheet to see how it's structured: How to deal with different screen sizes While the screen size is always measured in pixels, the iOS coordinate system is measured in points. The screen size of an iPhone 3GS is 320 x 480 pixels and also 320 x 480 points. On an iPhone 4, the screen size is 640 x 960 pixels, but is still 320 by 480 points. So, in this case, each point represents four pixels: two in width and two in height. A 100-point wide rectangle will be 200 pixels wide on an iPhone 4 and 100 pixels on an iPhone 3GS. It works similarly for the devices with large display screens, such as the iPhone 5. Instead of 480 points, it's 568 points. Scaling the viewport Let's explain the term viewport first: the viewport is the visible portion of the complete screen area. We need to be clear about which devices we want our game to run on. We take the biggest resolution that we want to support and scale it down to a smaller resolution. This is the easiest option, but it might not lead to the best results; touch areas and the user interface scale down as well. Apple recommends for touch areas to be at least a 40-point square; so, depending on the user interface, some elements might get scaled down so much that they get harder to touch. Take a look at the following screenshot, where we choose the iPad Retina resolution (2048 x 1536 pixels) as our biggest resolution and scale down all display objects on the screen for the iPad resolution (1024 x 768 pixels): Scaling is a popular option for non-iOS environments, especially for PC and Mac games that support resolutions from 1024 x 600 pixels to full HD. Sparrow and the iOS SDK provide some mechanisms that will facilitate handling Retina and non-Retina iPad devices without the need to scale the whole viewport. Black borders Some games in the past have been designed for a 4:3 resolution display but then made to run on a widescreen device that had more screen space. So, the option was to either scale a 4:3 resolution to widescreen, which will distort the whole screen, or put some black borders on either side of the screen to maintain the original scale factor. Showing black borders is something that is now considered as bad practice, especially when there are so many games out there which scale quite well across different screen sizes and platforms. Showing non-interactive screen space If our pirate game is a multiplayer, we may have a player on an iPad and another on an iPhone 5. So, the player with the iPad has a bigger screen and more screen space to maneuver their ship. The worst case will be if the player with the iPad is able to move their ship outside the visual range for the iPhone player to see, which will result in a serious advantage for the iPad player. Luckily for us, we don't require competitive multiplayer functionality. Still, we need to keep a consistent screen space for players to move their ship in for game balance purposes. We wouldn't want to tie the difficulty level to the device someone is playing on. Let's compare the previous screenshot to the black border example. Instead of the ugly black borders, we just show more of the background. In some cases, it's also possible to move some user interface elements to the areas which are not visible on other devices. However, we will need to consider whether we want to keep the same user experience across devices and whether moving these elements will result in a disadvantage for users who don't have this extra screen space on their devices. Rearranging screen elements Rearranging screen elements is probably the most time-intensive and sophisticated way of solving this issue. In this example, we have a big user interface at the top of the screen in the portrait mode. Now, if we were to leave it like this in the landscape mode, the top of the screen will be just the user interface, leaving very little room for the game itself. In this case, we have to be deliberate about what kind of elements we need to see on the screen and which elements are using up too much screen estate. Screen real estate (or screen estate) is the amount of space available on a display for an application or a game to provide output. We will then have to reposition them, cut them up in to smaller pieces, or both. The most prominent example of this technique is Candy Crush (a popular trending game) by King. While this concept applies particularly to device rotation, this does not mean that it can't be used for universal applications. Choosing the best option None of these options are mutually exclusive. For our purposes, we are going to show non-interactive screen space, and if things get complicated, we might also resort to rearranging screen elements depending on our needs. Differences between various devices Let's take a look at the differences in the screen size and the texture size limit between the different iOS devices: Device Screen size (in pixels) Texture size limit (in pixels) iPhone 3GS 480 x 360 2048 x 2048 iPhone 4 (including iPhone 4S) and iPod Touch 4th generation 960 x 640 2048 x 2048 iPhone 5 (including iPhone 5C and iPhone 5S) and iPod Touch 5th generation 1136 x 640 2048 x 2048 iPad 2 1024 x 768 2048 x 2048 iPad (3rd and 4th generations) and iPad Air 2048 x 1536 4096 x 4096 iPad Mini 1024 x 768 4096 x 4096 Utilizing the iOS SDK Both the iOS SDK and Sparrow can aid us in creating a universal application. Universal application is the term for apps that target more than one device, especially for an app that targets the iPhone and iPad device family. The iOS SDK provides a handy mechanism for loading files for specific devices. Let's say we are developing an iPhone application and we have an image that's called my_amazing_image.png. If we load this image on our devices, it will get loaded—no questions asked. However, if it's not a universal application, we can only scale the application using the regular scale button on iPad and iPhone Retina devices. This button appears on the bottom-right of the screen. If we want to target iPad, we have two options: The first option is to load the image as is. The device will scale the image. Depending on the image quality, the scaled image may look bad. In this case, we also need to consider that the device's CPU will do all the scaling work, which might result in some slowdown depending on the app's complexity. The second option is to add an extra image for iPad devices. This one will use the ~ipad suffix, for example, my_amazing_image~ipad.png. When loading the required image, we will still use the filename my_amazing_image.png. The iOS SDK will automatically detect the different sizes of the image supplied and use the correct size for the device. Beginning with Xcode 5 and iOS 7, it is possible to use asset catalogs. Asset catalogs can contain a variety of images grouped into image sets. Image sets contain all the images for the targeted devices. These asset catalogs don't require files with suffixes any more. These can only be used for splash images and application icons. We can't use asset catalogs for textures we load with Sparrow though. The following table shows which suffix is needed for which device: Device Retina File suffix iPhone 3GS No None iPhone 4 (including iPhone 4S) and iPod Touch (4th generation) Yes @2x @2x~iphone iPhone 5 (including iPhone 5C and iPhone 5S) and iPod Touch (5th generation) Yes -568h@2x iPad 2 No ~ipad iPad (3rd and 4th generations) and iPad Air Yes @2x~ipad iPad Mini No ~ipad How does this affect the graphics we wish to display? The non-Retina image will be 128 pixels in width and 128 pixels in height. The Retina image, the one with the @2x suffix, will be exactly double the size of the non-Retina image, that is, 256 pixels in width and 256 pixels in height. Retina and iPad support in Sparrow Sparrow supports all the filename suffixes shown in the previous table, and there is a special case for iPad devices, which we will take a closer look at now. When we take a look at AppDelegate.m in our game's source, note the following line: [_viewController startWithRoot:[Game class] supportHighResolutions:YES doubleOnPad:YES]; The first parameter, supportHighResolutions, tells the application to load Retina images (with the @2x suffix) if they are available. The doubleOnPad parameter is the interesting one. If this is set to true, Sparrow will use the @2x images for iPad devices. So, we don't need to create a separate set of images for iPad, but we can use the Retina iPhone images for the iPad application. In this case, the width and height are 512 and 384 points respectively. If we are targeting iPad Retina devices, Sparrow introduces the @4x suffix, which requires larger images and leaves the coordinate system at 512 x 384 points. App icons and splash images If we are talking about images of different sizes for the actual game content, app icons and splash images are also required to be in different sizes. Splash images (also referred to as launch images) are the images that show up while the application loads. The iOS naming scheme applies for these images as well, so for Retina iPhone devices such as iPhone 4, we will name an image as Default@2x.png, and for iPhone 5 devices, we will name an image as Default-568h@2x.png. For the correct size of app icons, take a look at the following table: Device Retina App icon size iPhone 3GS No 57 x 57 pixels iPhone 4 (including iPhone 4S) and iPod Touch 4th generation Yes 120 x 120 pixels iPhone 5 (including iPhone 5C and iPhone 5S) and iPod Touch 5th generation Yes 120 x 120 pixels iPad 2 No 76 x 76 pixels iPad (3rd and 4th generation) and iPad Air Yes 152 x 152 pixels iPad Mini No 76 x 76 pixels The bottom line The more devices we want to support, the more graphics we need, which directly increases the application file size, of course. Adding iPad support to our application is not a simple task, but Sparrow does some groundwork. One thing we should keep in mind though: if we are only targeting iOS 7.0 and higher, we don't need to include non-Retina iPhone images any more. Using @2x and @4x will be enough in this case, as support for non-Retina devices will soon end. Summary This article deals with setting up our game to work on iPhone, iPod Touch, and iPad in the same manner. Resources for Article: Further resources on this subject: Mobile Game Design [article] Bootstrap 3.0 is Mobile First [article] New iPad Features in iOS 6 [article]

0
0
5386

Packt

25 Jun 2014

10 min read

The Hunt for Data

Packt

25 Jun 2014

10 min read

(For more resources related to this topic, see here.) Examining a JSON file with the aeson package JavaScript Object Notation (JSON) is a way to represent key-value pairs in plain text. The format is described extensively in RFC 4627 (http://www.ietf.org/rfc/rfc4627). In this recipe, we will parse a JSON description about a person. We often encounter JSON in APIs from web applications. Getting ready Install the aeson library from hackage using Cabal. Prepare an input.json file representing data about a mathematician, such as the one in the following code snippet: $ cat input.json {"name":"Gauss", "nationality":"German", "born":1777, "died":1855} We will be parsing this JSON and representing it as a usable data type in Haskell. How to do it... Use the OverloadedStrings language extension to represent strings as ByteString, as shown in the following line of code: {-# LANGUAGE OverloadedStrings #-} Import aeson as well as some helper functions as follows: import Data.Aeson import Control.Applicative import qualified Data.ByteString.Lazy as B Create the data type corresponding to the JSON structure, as shown in the following code: data Mathematician = Mathematician { name :: String , nationality :: String , born :: Int , died :: Maybe Int } Provide an instance for the parseJSON function, as shown in the following code snippet: instance FromJSON Mathematician where parseJSON (Object v) = Mathematician <$> (v .: "name") <*> (v .: "nationality") <*> (v .: "born") <*> (v .:? "died") Define and implement main as follows: main :: IO () main = do Read the input and decode the JSON, as shown in the following code snippet: input <- B.readFile "input.json" let mm = decode input :: Maybe Mathematician case mm of Nothing -> print "error parsing JSON" Just m -> (putStrLn.greet) m Now we will do something interesting with the data as follows: greet m = (show.name) m ++ " was born in the year " ++ (show.born) m We can run the code to see the following output: $ runhaskell Main.hs "Gauss" was born in the year 1777 How it works... Aeson takes care of the complications in representing JSON. It creates native usable data out of a structured text. In this recipe, we use the .: and .:? functions provided by the Data.Aeson module. As the Aeson package uses ByteStrings instead of Strings, it is very helpful to tell the compiler that characters between quotation marks should be treated as the proper data type. This is done in the first line of the code which invokes the OverloadedStrings language extension. We use the decode function provided by Aeson to transform a string into a data type. It has the type FromJSON a => B.ByteString -> Maybe a. Our Mathematician data type must implement an instance of the FromJSON typeclass to properly use this function. Fortunately, the only required function for implementing FromJSON is parseJSON. The syntax used in this recipe for implementing parseJSON is a little strange, but this is because we're leveraging applicative functions and lenses, which are more advanced Haskell topics. The .: function has two arguments, Object and Text, and returns a Parser a data type. As per the documentation, it retrieves the value associated with the given key of an object. This function is used if the key and the value exist in the JSON document. The :? function also retrieves the associated value from the given key of an object, but the existence of the key and value are not mandatory. So, we use .:? for optional key value pairs in a JSON document. There's more… If the implementation of the FromJSON typeclass is too involved, we can easily let GHC automatically fill it out using the DeriveGeneric language extension. The following is a simpler rewrite of the code: {-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE DeriveGeneric #-} import Data.Aeson import qualified Data.ByteString.Lazy as B import GHC.Generics data Mathematician = Mathematician { name :: String , nationality :: String , born :: Int , died :: Maybe Int } deriving Generic instance FromJSON Mathematician main = do input <- B.readFile "input.json" let mm = decode input :: Maybe Mathematician case mm of Nothing -> print "error parsing JSON" Just m -> (putStrLn.greet) m greet m = (show.name) m ++" was born in the year "++ (show.born) m Although Aeson is powerful and generalizable, it may be an overkill for some simple JSON interactions. Alternatively, if we wish to use a very minimal JSON parser and printer, we can use Yocto, which can be downloaded from http://hackage.haskell.org/package/yocto. Reading an XML file using the HXT package Extensible Markup Language (XML) is an encoding of plain text to provide machine-readable annotations on a document. The standard is specified by W3C (http://www.w3.org/TR/2008/REC-xml-20081126/). In this recipe, we will parse an XML document representing an e-mail conversation and extract all the dates. Getting ready We will first set up an XML file called input.xml with the following values, representing an e-mail thread between Databender and Princess on December 18, 2014 as follows: $ cat input.xml <thread> <email> <to>Databender</to> <from>Princess</from> <date>Thu Dec 18 15:03:23 EST 2014</date> <subject>Joke</subject> <body>Why did you divide sin by tan?</body> </email> <email> <to>Princess</to> <from>Databender</from> <date>Fri Dec 19 3:12:00 EST 2014</date> <subject>RE: Joke</subject> <body>Just cos.</body> </email> </thread> Using Cabal, install the HXT library which we use for manipulating XML documents: $ cabal install hxt How to do it... We only need one import, which will be for parsing XML, using the following line of code: import Text.XML.HXT.Core Define and implement main and specify the XML location. For this recipe, the file is retrieved from input.xml. Refer to the following code: main :: IO () main = do input <- readFile "input.xml" Apply the readString function to the input and extract all the date documents. We filter items with a specific name using the hasName :: String -> a XmlTree XmlTree function. Also, we extract the text using the getText :: a XmlTree String function, as shown in the following code snippet: dates <- runX $ readString [withValidate no] input //> hasName "date" //> getText We can now use the list of extracted dates as follows: print dates By running the code, we print the following output: $ runhaskell Main.hs ["Thu Dec 18 15:03:23 EST 2014", "Fri Dec 19 3:12:00 EST 2014"] How it works... The library function, runX, takes in an Arrow. Think of an Arrow as a more powerful version of a Monad. Arrows allow for stateful global XML processing. Specifically, the runX function in this recipe takes in IOSArrow XmlTree String and returns an IO action of the String type. We generate this IOSArrow object using the readString function, which performs a series of operations to the XML data. For a deep insight into the XML document, //> should be used whereas /> only looks at the current level. We use the //> function to look up the date attributes and display all the associated text. As defined in the documentation, the hasName function tests whether a node has a specific name, and the getText function selects the text of a text node. Some other functions include the following: isText: This is used to test for text nodes isAttr: This is used to test for an attribute tree hasAttr: This is used to test whether an element node has an attribute node with a specific name getElemName: This is used to select the name of an element node All the Arrow functions can be found on the Text.XML.HXT.Arrow.XmlArrow documentation at http://hackage.haskell.org/package/hxt/docs/Text-XML-HXT-Arrow-XmlArrow.html. Capturing table rows from an HTML page Mining Hypertext Markup Language (HTML) is often a feat of identifying and parsing only its structured segments. Not all text in an HTML file may be useful, so we find ourselves only focusing on a specific subset. For instance, HTML tables and lists provide a strong and commonly used structure to extract data whereas a paragraph in an article may be too unstructured and complicated to process. In this recipe, we will find a table on a web page and gather all rows to be used in the program. Getting ready We will be extracting the values from an HTML table, so start by creating an input.html file containing a table as shown in the following figure: The HTML behind this table is as follows: $ cat input.html <!DOCTYPE html> <html> <body> <h1>Course Listing</h1> <table> <tr> <th>Course</th> <th>Time</th> <th>Capacity</th> </tr> <tr> <td>CS 1501</td> <td>17:00</td> <td>60</td> </tr> <tr> <td>MATH 7600</td> <td>14:00</td> <td>25</td> </tr> <tr> <td>PHIL 1000</td> <td>9:30</td> <td>120</td> </tr> </table> </body> </html> If not already installed, use Cabal to set up the HXT library and the split library, as shown in the following command lines: $ cabal install hxt $ cabal install split How to do it... We will need the htx package for XML manipulations and the chunksOf function from the split package, as presented in the following code snippet: import Text.XML.HXT.Core import Data.List.Split (chunksOf) Define and implement main to read the input.html file. main :: IO () main = do input <- readFile "input.html" Feed the HTML data into readString, thereby setting withParseHTML to yes and optionally turning off warnings. Extract all the td tags and obtain the remaining text, as shown in the following code: texts <- runX $ readString [withParseHTML yes, withWarnings no] input //> hasName "td" //> getText The data is now usable as a list of strings. It can be converted into a list of lists similar to how CSV was presented in the previous CSV recipe, as shown in the following code: let rows = chunksOf 3 texts print $ findBiggest rows By folding through the data, identify the course with the largest capacity using the following code snippet: findBiggest :: [[String]] -> [String] findBiggest [] = [] findBiggest items = foldl1 (a x -> if capacity x > capacity a then x else a) items capacity [a,b,c] = toInt c capacity _ = -1 toInt :: String -> Int toInt = read Running the code will display the class with the largest capacity as follows: $ runhaskell Main.hs {"PHIL 1000", "9:30", "120"} How it works... This is very similar to XML parsing, except we adjust the options of readString to [withParseHTML yes, withWarnings no].

0
0
1848

Packt

25 Jun 2014

10 min read

Introduction to MapReduce

Packt

25 Jun 2014

10 min read

(For more resources related to this topic, see here.) The Hadoop platform Hadoop can be used for a lot of things. However, when you break it down to its core parts, the primary features of Hadoop are Hadoop Distributed File System (HDFS) and MapReduce. HDFS stores read-only files by splitting them into large blocks and distributing and replicating them across a Hadoop cluster. Two services are involved with the filesystem. The first service, the NameNode acts as a master and keeps the directory tree of all file blocks that exist in the filesystem and tracks where the file data is kept across the cluster. The actual data of the files is stored in multiple DataNode nodes, the second service. MapReduce is a programming model for processing large datasets with a parallel, distributed algorithm in a cluster. The most prominent trait of Hadoop is that it brings processing to the data; so, MapReduce executes tasks closest to the data as opposed to the data travelling to where the processing is performed. Two services are involved in a job execution. A job is submitted to the service JobTracker, which first discovers the location of the data. It then orchestrates the execution of the map and reduce tasks. The actual tasks are executed in multiple TaskTracker nodes. Hadoop handles infrastructure failures such as network issues, node, or disk failures automatically. Overall, it provides a framework for distributed storage within its distributed file system and execution of jobs. Moreover, it provides the service ZooKeeper to maintain configuration and distributed synchronization. Many projects surround Hadoop and complete the ecosystem of available Big Data processing tools such as utilities to import and export data, NoSQL databases, and event/real-time processing systems. The technologies that move Hadoop beyond batch processing focus on in-memory execution models. Overall multiple projects, from batch to hybrid and real-time execution exist. MapReduce Massive parallel processing of large datasets is a complex process. MapReduce simplifies this by providing a design pattern that instructs algorithms to be expressed in map and reduce phases. Map can be used to perform simple transformations on data, and reduce is used to group data together and perform aggregations. By chaining together a number of map and reduce phases, sophisticated algorithms can be achieved. The shared nothing architecture of MapReduce prohibits communication between map tasks of the same phase or reduces tasks of the same phase. Communication that's required happens at the end of each phase. The simplicity of this model allows Hadoop to translate each phase, depending on the amount of data that needs to be processed into tens or even hundreds of tasks being executed in parallel, thus achieving scalable performance. Internally, the map and reduce tasks follow a simplistic data representation. Everything is a key or a value. A map task receives key-value pairs and applies basic transformations emitting new key-value pairs. Data is then partitioned and different partitions are transmitted to different reduce tasks. A reduce task also receives key-value pairs, groups them based on the key, and applies basic transformation to those groups. A MapReduce example To illustrate how MapReduce works, let's look at an example of a log file of total size 1 GB with the following format: INFO MyApp - Entering application. WARNING com.foo.Bar - Timeout accessing DB - Retrying ERROR com.foo.Bar - Did it again! INFO MyApp - Exiting application Once this file is stored in HDFS, it is split into eight 128 MB blocks and distributed in multiple Hadoop nodes. In order to build a MapReduce job to count the amount of INFO, WARNING, and ERROR log lines in the file, we need to think in terms of map and reduce phases. In one map phase, we can read local blocks of the file and map each line to a key and a value. We can use the log level as the key and the number 1 as the value. After it is completed, data is partitioned based on the key and transmitted to the reduce tasks. MapReduce guarantees that the input to every reducer is sorted by key. Shuffle is the process of sorting and copying the output of the map tasks to the reducers to be used as input. By setting the value to 1 on the map phase, we can easily calculate the total in the reduce phase. Reducers receive input sorted by key, aggregate counters, and store results. In the following diagram, every green block represents an INFO message, every yellow block a WARNING message, and every red block an ERROR message: Implementing the preceding MapReduce algorithm in Java requires the following three classes: A Map class to map lines into <key,value> pairs; for example, <"INFO",1> A Reduce class to aggregate counters A Job configuration class to define input and output types for all <key,value> pairs and the input and output files MapReduce abstractions This simple MapReduce example requires more than 50 lines of Java code (mostly because of infrastructure and boilerplate code). In SQL, a similar implementation would just require the following: SELECT level, count(*) FROM table GROUP BY level Hive is a technology originating from Facebook that translates SQL commands, such as the preceding one, into sets of map and reduce phases. SQL offers convenient ubiquity, and it is known by almost everyone. However, SQL is declarative and expresses the logic of a computation without describing its control flow. So, there are use cases that will be unusual to implement in SQL, and some problems are too complex to be expressed in relational algebra. For example, SQL handles joins naturally, but it has no built-in mechanism for splitting data into streams and applying different operations to each substream. Pig is a technology originating from Yahoo that offers a relational data-flow language. It is procedural, supports splits, and provides useful operators for joining and grouping data. Code can be inserted anywhere in the data flow and is appealing because it is easy to read and learn. However, Pig is a purpose-built language; it excels at simple data flows, but it is inefficient for implementing non-trivial algorithms. In Pig, the same example can be implemented as follows: LogLine = load 'file.logs' as (level, message); LevelGroup = group LogLine by level; Result = foreach LevelGroup generate group, COUNT(LogLine); store Result into 'Results.txt'; Both Pig and Hive support extra functionality through loadable user-defined functions (UDF) implemented in Java classes. Cascading is implemented in Java and designed to be expressive and extensible. It is based on the design pattern of pipelines that many other technologies follow. The pipeline is inspired from the original chain of responsibility design pattern and allows ordered lists of actions to be executed. It provides a Java-based API for data-processing flows. Developers with functional programming backgrounds quickly introduced new domain specific languages that leverage its capabilities. Scalding, Cascalog, and PyCascading are popular implementations on top of Cascading, which are implemented in programming languages such as Scala, Clojure, and Python. Introducing Cascading Cascading is an abstraction that empowers us to write efficient MapReduce applications. The API provides a framework for developers who want to think in higher levels and follow Behavior Driven Development (BDD) and Test Driven Development (TDD) to provide more value and quality to the business. Cascading is a mature library that was released as an open source project in early 2008. It is a paradigm shift and introduces new notions that are easier to understand and work with. In Cascading, we define reusable pipes where operations on data are performed. Pipes connect with other pipes to create a pipeline. At each end of a pipeline, a tap is used. Two types of taps exist: source, where input data comes from and sink, where the data gets stored. In the preceding image, three pipes are connected to a pipeline, and two input sources and one output sink complete the flow. A complete pipeline is called a flow, and multiple flows bind together to form a cascade. In the following diagram, three flows form a cascade: The Cascading framework translates the pipes, flows, and cascades into sets of map and reduce phases. The flow and cascade planner ensure that no flow or cascade is executed until all its dependencies are satisfied. The preceding abstraction makes it easy to use a whiteboard to design and discuss data processing logic. We can now work on a productive higher level abstraction and build complex applications for ad targeting, logfile analysis, bioinformatics, machine learning, predictive analytics, web content mining, and for extract, transform and load (ETL) jobs. By abstracting from the complexity of key-value pairs and map and reduce phases of MapReduce, Cascading provides an API that so many other technologies are built on. What happens inside a pipe Inside a pipe, data flows in small containers called tuples. A tuple is like a fixed size ordered list of elements and is a base element in Cascading. Unlike an array or list, a tuple can hold objects with different types. Tuples stream within pipes. Each specific stream is associated with a schema. The schema evolves over time, as at one point in a pipe, a tuple of size one can receive an operation and transform into a tuple of size three. To illustrate this concept, we will use a JSON transformation job. Each line is originally stored in tuples of size one with a schema: 'jsonLine. An operation transforms these tuples into new tuples of size three: 'time, 'user, and 'action. Finally, we extract the epoch, and then the pipe contains tuples of size four: 'epoch, 'time, 'user, and 'action. Pipe assemblies Transformation of tuple streams occurs by applying one of the five types of operations, also called pipe assemblies: Each: To apply a function or a filter to each tuple GroupBy: To create a group of tuples by defining which element to use and to merge pipes that contain tuples with similar schemas Every: To perform aggregations (count, sum) and buffer operations to every group of tuples CoGroup: To apply SQL type joins, for example, Inner, Outer, Left, or Right joins SubAssembly: To chain multiple pipe assemblies into a pipe To implement the pipe for the logfile example with the INFO, WARNING, and ERROR levels, three assemblies are required: The Each assembly generates a tuple with two elements (level/message), the GroupBy assembly is used in the level, and then the Every assembly is applied to perform the count aggregation. We also need a source tap to read from a file and a sink tap to store the results in another file. Implementing this in Cascading requires 20 lines of code; in Scala/Scalding, the boilerplate is reduced to just the following: TextLine(inputFile) .mapTo('line->'level,'message) { line:String => tokenize(line) } .groupBy('level) { _.size } .write(Tsv(outputFile)) Cascading is the framework that provides the notions and abstractions of tuple streams and pipe assemblies. Scalding is a domain-specific language (DSL) that specializes in the particular domain of pipeline execution and further minimizes the amount of code that needs to be typed. Cascading extensions Cascading offers multiple extensions that can be used as taps to either read from or write data to, such as SQL, NoSQL, and several other distributed technologies that fit nicely with the MapReduce paradigm. A data processing application, for example, can use taps to collect data from a SQL database and some more from the Hadoop file system. Then, process the data, use a NoSQL database, and complete a machine learning stage. Finally, it can store some resulting data into another SQL database and update a mem-cache application. Summary This article explains the core technologies used in the distributed model of Hadoop Resources for Article: Further resources on this subject: Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate) [article] Understanding MapReduce [article] Advanced Hadoop MapReduce Administration [article]

0
0
3429

How-To Tutorials

article-image-various-subsystem-configurations

Packt

25 Jun 2014

8 min read

Various subsystem configurations

Packt

25 Jun 2014

8 min read

0
0
3235

How-To Tutorials

Packt

25 Jun 2014

5 min read

Enterprise Geodatabase

Packt

25 Jun 2014

5 min read

0
0
2723

article-image-exact-inference-using-graphical-models

Packt

25 Jun 2014

7 min read

Exact Inference Using Graphical Models

Packt

25 Jun 2014

7 min read

(For more resources related to this topic, see here.) Complexity of inference A graphical model can be used to answer both probability queries and MAP queries. The most straightforward way to use this model is to generate the joint distribution and sum out all the variables, except the ones we are interested in. However, we need to determine and specify the joint distribution where an exponential blowup happens. In worst-case scenarios, we need to determine the exact inference in NP-hard. By the word exact, we mean specifying the probability values with a certain precision (say, five digits after the decimals). Suppose we tone down our precision requirements (for example, only up to two digits after the decimals). Now, is the (approximate) inference task any easier? Unfortunately not—even approximate inference is NP-hard, that is, getting values is far better than random guessing (50 percent or a probability of 0.5), which takes exponential time. It might seem like inference is a hopeless task, but that is only in the worst case. In general cases, we can use exact inference to solve certain classes of real-world problems (such as Bayesian networks that have a small number of discrete random variables). Of course, for larger problems, we have to resort to approximate inference. Real-world issues Since inference is a task that is NP-hard, inference engines are written in languages that are as close to bare metal as possible; usually in C or C++. Use Python implementations of inference algorithms. Complete and mature packages for these are uncommon. Use inference engines that have a Python interface, such as Stan (mc-stan.org). This choice serves a good balance between running the Python code and a fast inference implementation. Use inference engines that do not have a Python interface, which is true for majority of the inference engines out there. A fairly comprehensive list can be found at http://en.wikipedia.org/wiki/Bayesian_network#Software. The use of Python here is limited to creating a file that describes the model in a format that the inference engine can consume. In the article on inference, we will stick to the first two choices in the list. We will use native Python implementations (of inference algorithms) to peek into the interiors of the inference algorithms while running toy-sized problems, and then use an external inference engine with Python interfaces to try out a more real-world problem. The tree algorithm We will now look at another class of exact inference algorithms based on message passing. Message passing is a general mechanism, and there exist many variations of message passing algorithms. We shall look at a short snippet of the clique tree-message passing algorithm (which is sometimes called the junction tree algorithm too). Other versions of the message passing algorithm are used in approximate inference as well. We initiate the discussion by clarifying some of the terms used. A cluster graph is an arrangement of a network where groups of variables are placed in the cluster. It is similar to a factor where each cluster has a set of variables in its scope. The message passing algorithm is all about passing messages between clusters. As an analogy, consider the gossip going on at a party, where Shelly and Clair are in a conversation. If Shelly knows B, C, and D, and she is chatting with Clair who knows D, E, and F (note that the only person they know in common is D), they can share information (or pass messages) about their common friend D. In the message passing algorithm, two clusters are connected by a Separation Set (sepset), which contains variables common to both clusters. Using the preceding example, the two clusters and are connected by the sepset , which contains the only variable common to both clusters. In the next section, we shall learn about the implementation details of the junction tree algorithm. We will first understand the four stages of the algorithm and then use code snippets to learn about it from an implementation perspective. The four stages of the junction tree algorithm In this section, we will discuss the four stages of the junction tree algorithm. In the first stage, the Bayes network is converted into a secondary structure called a join tree (alternate names for this structure in the literature are junction tree, cluster tree, or a clique tree). The transformation from the Bayes network to junction tree proceeds as per the following steps: We will construct a moral graph by changing all the directed edges to undirected edges. All nodes that have V-structures that enter the said node have their parents connected with an edge. We have seen an example of this process (in the VE algorithm) called moralization, which is a possible reference to connect (apparently unmarried) parents that have a child (node). Then, we will selectively add edges to the moral graph to create a triangulated graph. A triangulated graph is an undirected graph where the maximum cycle length between the nodes is 3. From the triangulated graph, we will identify the subsets of nodes (called cliques). Starting with the cliques as clusters, we will arrange the clusters to form an undirected tree called the join tree, which satisfies the running intersection property. This property states that if a node appears in two cliques, it should also appear in all the nodes on the path that connect the two cliques. In the second stage, the potentials at each cluster are initialized. The potentials are similar to a CPD or a table. They have a list of values against each assignment to a variable in their scope. Both clusters and sepsets contain a set of potentials. The term potential is used as opposed to probabilities because in Markov networks, unlike probabilities, the values of the potentials are not obliged to sum to 1. This stage consists of message passing or belief propagation between neighboring clusters. Each message consists of a belief the cluster has about a particular variable. Each message can be passed asynchronously, but it has to wait for information from other clusters before it collates that information and passes it to the next cluster. It can be useful to think of a tree-structured cluster graph, where the message passing happens in two stages: an upward pass stage and a downward pass stage. Only after a node receives messages from the leaf nodes, will it send the message to its parent (in the "upward pass"), and only after the node receives a message from its parents will it send a message to its children (in the "downward pass"). The message passing stage completes when each cluster sepset has consistent beliefs. Recall that a cluster connected to a sepset has common variables. For example, cluster C and sepset S have and variables in its scope. Then, the potential against obtained from either the cluster or the sepset has the same value, which is why it is said that the cluster graph has consistent beliefs or that the cliques are calibrated. Once the whole cluster graph has consistent beliefs, the fourth stage is marginalization, where we can query the marginal distribution for any variable in the graph. Summary We first explored the inference problem where we studied the types of inference. We then learned that inference is NP-hard and understood that, for large networks, exact inference is infeasible. Resources for Article: Further resources on this subject: Getting Started with Spring Python [article] Python Testing: Installing the Robot Framework [article] Discovering Python's parallel programming tools [article]

0
0
3946

Packt

24 Jun 2014

6 min read

Introducing variables

Packt

24 Jun 2014

6 min read

(For more resources related to this topic, see here.) In order to store data, you have to store data in the right kind of variables. We can think of variables as boxes, and what you put in these boxes depends on what type of box it is. In most native programming languages, you have to declare a variable and its type. Number variables Let's go over some of the major types of variables. The first type is number variables. These variables store numbers and not letters. That means, if you tried to put a name in, let's say "John Bura", then the app simply won't work. Integer variables There are numerous different types of number variables. Integer variables, called Int variables, can be positive or negative whole numbers—you cannot have a decimal at all. So, you could put -1 as an integer variable but not 1.2. Real variables Real variables can be positive or negative, and they can be decimal numbers. A real variable can be 1.0, -40.4, or 100.1, for instance. There are other kinds of number variables as well. They are used in more specific situations. For the most part, integer and real variables are the ones you need to know—make sure you don't get them mixed up. If you were to run an app with this kind of mismatch, chances are it won't work. String variables There is another kind of variable that is really important. This type of variable is called a string variable. String variables are variables that comprise letters or words. This means that if you want to record a character's name, then you will have to use a string variable. In most programming languages, string variables have to be in quotes, for example, "John Bura". The quote marks tell the computer that the characters within are actually strings that the computer can use. When you put a number 1 into a string, is it a real number 1 or is it just a fake number? It's a fake number because strings are not numbers—they are strings. Even though the string shows the number 1, it isn't actually the number 1. Strings are meant to display characters, and numbers are meant to do math. Strings are not meant to do math—they just hold characters. If you tried to do math with a string, it wouldn't work (except in JavaScript, which we will talk about shortly). Strings shouldn't be used for calculations—they are meant to hold and display characters. If we have a string "1", it will be recorded as a character rather than an integer that can be used for calculations. Boolean variables The last main type of variable that we need to talk about is Boolean variables. Boolean variables are either true or false, and they are very important when it comes to games. They are used where there can only be two options. The following are some examples of Boolean variables: isAlive isShooting isInAppPurchaseCompleted isConnectedToInternet Most of these variables start off with the word is. This is usually done to signify that the variable that we are using is a Boolean. When you make games, you tend to use a lot of Boolean variables because there are so many states that game objects can be in. Often, these states have only two options, and the best thing to do is use a Boolean. Sometimes, you need to use an integer instead of a Boolean. Usually, 0 equals false and 1 equals true. Other variables When it comes to game production, there are a lot of specific variables that differ from environment to environment. Sometimes, there are GameObject variables, and there can also be a whole bunch of more specific variables. Declaring variables If you want to store any kind of data in variables, you have to declare them first. In the backend of Construct 2, there are a lot of variables that are already declared for you. This means that Construct 2 takes out the work of declaring variables. The variables that are taken care of for you include the following: Keyboard Mouse position Mouse angle Type of web browser Writing variables in code When we use Construct 2, a lot of the backend busywork has already been done for us. So, how do we declare variables in code? Usually, variables are declared at the top of the coding document, as shown in the following code: Int score; Real timescale = 1.2; Bool isDead; Bool isShooting = false; String name = "John Bura"; Let's take a look at all of them. The type of variable is listed first. In this case, we have the Int, Real, Bool (Boolean), and String variables. Next, we have the name of the variable. If you look carefully, you can see that certain variables have an = (equals sign) and some do not. When we have a variable with an equals sign, we initialize it. This means that we set the information in the variable right away. Sometimes, you need to do this and at other times, you do not. For example, a score does not need to be initialized because we are going to change the score as the game progresses. As you already know, you can initialize a Boolean variable to either true or false—these are the only two states a Boolean variable can be in. You will also notice that there are quotes around the string variable. Let's take a look at some examples that won't work: Int score = -1.2; Bool isDead = "false"; String name = John Bura; There is something wrong with all these examples. First of all, the Int variable cannot be a decimal. Second, the Bool variable has quotes around it. Lastly, the String variable has no quotes. In most environments, this will cause the program to not work. However, in HTML5 or JavaScript, the variable is changed to fit the situation. Summary In this article, we learned about the different types of variables and even looked at a few correct and incorrect variable declarations. If you are making a game, get used to making and setting lots of variables. The best part is that Construct 2 makes handling variables really easy. Resources for Article: Further resources on this subject: 2D game development with Monkey [article] Microsoft XNA 4.0 Game Development: Receiving Player Input [article] Flash Game Development: Making of Astro-PANIC! [article]

0
0
11113

How to Convert POJO to JSON Using Gson in Android Studio

Part 1: Deploying Multiple Applications with Capistrano from a Single Project

Using React.js without JSX

Making the Most of Your Hadoop Data Lake, Part 2: Optimized File Formats

Making the Most of Your Hadoop Data Lake, Part 1: Data Compression

Part 1: Managing Multiple Apps and Environments with Capistrano 3 and Chef Solo

MapReduce with OpenStack Swift and ZeroVM

Component Communication in React.js

Sparrow iOS Game Framework - The Basics of Our Game

The Hunt for Data

Trending Topics

Introduction to MapReduce

Various subsystem configurations

Enterprise Geodatabase

Exact Inference Using Graphical Models

Introducing variables

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access