Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-more-line-charts-area-charts-and-scatter-plots

26 Aug 2014

13 min read

More Line Charts, Area Charts, and Scatter Plots

26 Aug 2014

In this article by Scott Gottreu, the author of Learning jqPlot, we'll learn how to import data from remote sources. We will discuss what area charts, stacked area charts, and scatter plots are. Then we will learn how to implement these newly learned charts. We will also learn about trend lines. (For more resources related to this topic, see here.) Working with remote data sources We return from lunch and decide to start on our line chart showing social media conversions. With this chart, we want to pull the data in from other sources. You start to look for some internal data sources, coming across one that returns the data as an object. We can see an excerpt of data returned by the data source. We will need to parse the object and create data arrays for jqPlot: { "twitter":[ ["2012-11-01",289],...["2012-11-30",225] ], "facebook":[ ["2012-11-01",27],...["2012-11-30",48] ] } We solve this issue using a data renderer to pull our data and then format it properly for jqPlot. We can pass a function as a variable to jqPlot and when it is time to render the data, it will call this new function. We start by creating the function to receive our data and then format it. We name it remoteDataSource. jqPlot will pass the following three parameters to our function: url: This is the URL of our data source. plot: The jqPlot object we create is passed by reference, which means we could modify the object from within remoteDataSource. However, it is best to treat it as a read-only object. options: We can pass any type of option in the dataRendererOptions option when we create our jqPlot object. For now, we will not be passing in any options: <script src="../js/jqplot.dateAxisRenderer.min.js"></script> <script> $(document).ready(function(){ var remoteDataSource = function(url, plot, options) { Next we create a new array to hold our formatted data. Then, we use the $.ajax method in jQuery to pull in our data. We set the async option to false. If we don't, the function will continue to run before getting the data and we'll have an empty chart: var data = new Array; $.ajax({ async: false, We set the url option to the url variable that jqPlot passed in. We also set the data type to json: url: url, dataType:"json", success: function(remoteData) { Then we will take the twitter object in our JSON and make that the first element of our data array and make facebook the second element. We then return the whole array back to jqPlot to finish rendering our chart: data.push(remoteData.twitter); data.push(remoteData.facebook); } }); return data; }; With our previous charts, after the id attribute, we would have passed in a data array. This time, instead of passing in a data array, we pass in a URL. Then, within the options, we declare the dataRenderer option and set remoteDataSource as the value. Now when our chart is created, it will call our renderer and pass in all the three parameters we discussed earlier: var socialPlot = $.jqplot ('socialMedia', "./data/social_shares.json", { title:'Social Media Shares', dataRenderer: remoteDataSource, We create labels for both our data series and enable the legend: series:[ { label: 'Twitter' }, { label: 'Facebook' } ], legend: { show: true, placement: 'outsideGrid' }, We enable DateAxisRenderer for the x axis and set min to 0 on the y axis, so jqPlot will not extend the axis below zero: axes:{ xaxis:{ renderer:$.jqplot.DateAxisRenderer, label: 'Days in November' }, yaxis: { min:0, label: 'Number of Shares' } } }); }); </script> <div id="socialMedia" style="width:600px;"></div> If you are running the code samples from your filesystem in Chrome, you will get an error message similar to this: No 'Access-Control-Allow-Origin' header is present on the requested resource. The security settings do not allow AJAX requests to be run against files on the filesystem. It is better to use a local web server such as MAMP, WAMP, or XAMPP. This way, we avoid the access control issues. Further information about cross-site HTTP requests can be found at the Mozilla Developer Network at https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS. We load this new chart in our browser and can see the result. We are likely to run into cross-domain issues when trying to access remote sources that do not allow cross-domain requests. The common practice to overcome this hurdle would be to use the JSONP data type in our AJAX call. jQuery will only run JSONP calls asynchronously. This keeps your web page from hanging if a remote source stops responding. However, because jqPlot requires all the data from the remote source before continuing, we can't use cross-domain sources with our data renderers. We start to think of ways we can use external APIs to pull in data from all kinds of sources. We make a note to contact the server guys to write some scripts to pull from the external APIs we want and pass along the data to our charts. By doing it in this way, we won't have to implement OAuth (OAuth is a standard framework used for authentication), http://oauth.net/2, in our web app or worry about which sources allow cross-domain access. Adding to the project's scope As we continue thinking up new ways to work with this data, Calvin stops by. "Hey guys, I've shown your work to a few of the regional vice-presidents and they love it." Your reply is that all of this is simply an experiment and was not designed for public consumption. Calvin holds up his hands as if to hold our concerns at bay. "Don't worry, they know it's all in beta. They did have a couple of ideas. Can you insert in the expenses with the revenue and profit reports? They also want to see those same charts but formatted differently." He continues, "One VP mentioned that maybe we could have one of those charts where everything under the line is filled in. Oh, and they would like to see these by Wednesday ahead of the meeting." With that, Calvin turns around and makes his customary abrupt exit. Adding a fill between two lines We talk through Calvin's comments. Adding in expenses won't be too much of an issue. We could simply add the expense line to one of our existing reports but that will likely not be what they want. Visually, the gap on our chart between profit and revenue should be the total amount of expenses. You mention that we could fill in the gap between the two lines. We decide to give this a try: We leave the plugins and the data arrays alone. We pass an empty array into our data array as a placeholder for our expenses. Next, we update our title. After this, we add a new series object and label it Expenses: ... var rev_profit = $.jqplot ('revPrfChart', [revenue, profit, [] ], { title:'Monthly Revenue & Profit with Highlighted Expenses', series:[ { label: 'Revenue' }, { label: 'Profit' }, { label: 'Expenses' } ], legend: { show: true, placement: 'outsideGrid' }, To fill in the gap between the two lines, we use the fillBetween option. The only two required options are series1 and series2. These require the positions of the two data series in the data array. So in our chart, series1 would be 0 and series2 would be 1. The other three optional settings are: baseSeries, color, and fill. The baseSeries option tells jqPlot to place the fill on a layer beneath the given series. It will default to 0. If you pick a series above zero, then the fill will hide any series below the fill layer: fillBetween: { series1: 0, series2: 1, We want to assign a different value to color because it will default to the color of the first data series option. The color option will accept either a hexadecimal value or the rgba option, which allows us to change the opacity of the fill. Even though the fill option defaults to true, we explicitly set it. This option also gives us the ability to turn off the fill after the chart is rendered: color: "rgba(232, 44, 12, 0.5)", fill: true }, The settings for the rest of the chart remain unchanged: axes:{ xaxis:{ renderer:$.jqplot.DateAxisRenderer, label: 'Months' }, yaxis:{ label: 'Totals Dollars', tickOptions: { formatString: "$%'d" } } } }); }); </script> <div id="revPrfChart" style="width:600px;"></div> We switch back to our web browser and load the new page. We see the result of our efforts in the following screenshot. This chart layout works but we think Calvin and the others will want something else. We decide we need to make an area chart. Understanding area and stacked area charts Area charts come in two varieties. The default type of area chart is simply a modification of a line chart. Everything from the data point on the y axis all the way to zero is shaded. In the event your numbers are negative, then the data above the line up to zero is shaded in. Each data series you have is laid upon the others. Area charts are best to use when we want to compare similar elements, for example, sales by each division in our company or revenue among product categories. The other variation of an area chart is the stacked area chart. The chart starts off being built in the same way as a normal area chart. The first line is plotted and shaded below the line to zero. The difference occurs with the remaining lines. We simply stack them. To understand what happens, consider this analogy. Each shaded line represents a wall built to the height given in the data series. Instead of building one wall behind another, we stack them on top of each other. What can be hard to understand is the y axis. It now denotes a cumulative total, not the individual data points. For example, if the first y value of a line is 4 and the first y value on the second line is 5, then the second point will be plotted at 9 on our y axis. Consider this more complicated example: if the y value in our first line is 2, 7 for our second line, and 4 for the third line, then the y value for our third line will be plotted at 13. That's why we need to compare similar elements. Creating an area chart We grab the quarterly report with the divisional profits we created this morning. We will extend the data to a year and plot the divisional profits as an area chart: We remove the data arrays for revenue and the overall profit array. We also add data to the three arrays containing the divisional profits: <script src="../js/jqplot.dateAxisRenderer.min.js"></script> <script> $(document).ready(function(){ var electronics = [["2011-11-20", 123487.87], ...]; var media = [["2011-11-20", 66449.15], ...]; var nerd_corral = [["2011-11-20", 2112.55], ...]; var div_profit = $.jqplot ('division_profit', [ media, nerd_corral, electronics ], { title:'12 Month Divisional Profits', Under seriesDefaults, we assign true to fill and fillToZero. Without setting fillToZero to true, the fill would continue to the bottom of the chart. With the option set, the fill will extend downward to zero on the y axis for positive values and stop. For negative data points, the fill will extend upward to zero: seriesDefaults: { fill: true, fillToZero: true }, series:[ { label: 'Media & Software' }, { label: 'Nerd Corral' }, { label: 'Electronics' } ], legend: { show: true, placement: 'outsideGrid' }, For our x axis, we set numberTicks to 6. The rest of our options we leave unchanged: axes:{ xaxis:{ label: 'Months', renderer:$.jqplot.DateAxisRenderer, numberTicks: 6, tickOptions: { formatString: "%B" } }, yaxis: { label: 'Total Dollars', tickOptions: { formatString: "$%'d" } } } }); }); </script> <div id="division_profit" style="width:600px;"></div> We review the results of our changes in our browser. We notice something is wrong: only the Electronics series, shown in brown, is showing. This goes back to how area charts are built. Revisiting our wall analogy, we have built a taller wall in front of our other two walls. We need to order our data series from largest to smallest: We move the Electronics series to be the first one in our data array: var div_profit = $.jqplot ('division_profit', [ electronics, media, nerd_corral ], It's also hard to see where some of the lines go when they move underneath another layer. Thankfully, jqPlot has a fillAlpha option. We pass in a percentage in the form of a decimal and jqPlot will change the opacity of our fill area: ... seriesDefaults: { fill: true, fillToZero: true, fillAlpha: .6 }, ... We reload our chart in our web browser and can see the updated changes. Creating a stacked area chart with revenue Calvin stops by while we're taking a break. "Hey guys, I had a VP call and they want to see revenue broken down by division. Can we do that?" We tell him we can. "Great" he says, before turning away and leaving. We discuss this new request and realize this would be a great chance to use a stacked area chart. We dig around and find the divisional revenue numbers Calvin wanted. We can reuse the chart we just created and just change out the data and some options. We use the same variable names for our divisional data and plug in revenue numbers instead of profit. We use a new variable name for our chart object and a new id attribute for our div. We update our title and add the stackSeries option and set it to true: var div_revenue = $.jqplot ( 'division_revenue' , [electronics, media, nerd_corral], { title: '12 Month Divisional Revenue', stackSeries: true, We leave our series' options alone and the only option we change on our x axis is set numberTicks back to 3: seriesDefaults: { fill: true, fillToZero: true }, series:[ { label: 'Electronics' }, { label: 'Media & Software' }, { label: 'Nerd Corral' } ], legend: { show: true, placement: 'outsideGrid' }, axes:{ xaxis:{ label: 'Months', renderer:$.jqplot.DateAxisRenderer, numberTicks: 3, tickOptions: { formatString: "%B" } }, We finish our changes by updating the ID of our div container: yaxis: { label: 'Total Dollars', tickOptions: { formatString: "$%'d" } } } }); }); </script> <div id=" division_revenue " style="width:600px;"></div> With our changes complete, we load this new chart in our browser. As we can see in the following screenshot, we have a chart with each of the data series stacked on top of each other. Because of the nature of a stacked chart, the individual data points are no longer decipherable; however, with the visualization, this is less of an issue. We decide that this is a good place to stop for the day. We'll start on scatterplots and trend lines tomorrow morning. As we begin gathering our things, Calvin stops by on his way out and we show him our recent work. "This is amazing. You guys are making great progress." We tell him we're going to move on to trend lines tomorrow. "Oh, good," Calvin says. "I've had requests to show trending data for our revenue and profit. Someone else mentioned they would love to see trending data of shares on Twitter for our daily deals site. But, like you said, that can wait till tomorrow. Come on, I'll walk with you two."

0
0
2386

Packt

26 Aug 2014

5 min read

What is a content provider?

Packt

26 Aug 2014

5 min read

This article is written by Sunny Kumar Aditya and Vikash Kumar Karn, the authors of Android SQLite Essentials. There are four essential components in an Android Application: activity, service, broadcast receiver, and content provider. Content provider is used to manage access to structured set of data. They encapsulate the data and provide abstraction as well as the mechanism for defining data security. However, content providers are primarily intended to be used by other applications that access the provider using a provider's client object. Together, providers and provider clients offer a consistent, standard interface to data that also handles inter-process communication and secure data access. (For more resources related to this topic, see here.) A content provider allows one app to share data with other applications. By design, an Android SQLite database created by an application is private to the application; it is excellent if you consider the security point of view but troublesome when you want to share data across different applications. This is where a content provider comes to the rescue; you can easily share data by building your content provider. It is important to note that although our discussion would focus on a database, a content provider is not limited to it. It can also be used to serve file data that normally goes into files, such as photos, audio, or videos. The interaction between Application A and B happens while exchanging data can be seen in the following diagram: Here we have an Application A, whose activity needs to access the database of Application B. As we already read, the database of the Application B is stored in the internal memory and cannot be directly accessed by Application A. This is where the Content Provider comes into the picture; it allows us to share data and modify access to other applications. The content provider implements methods for querying, inserting, updating, and deleting data in databases. Application A now requests the content provider to perform some desired operations on behalf of it. We would explore the use of Content Provider to fetch contacts from a phone's contact database. Using existing content providers Android lists a lot of standard content providers that we can use. Some of them are Browser, CalendarContract, CallLog, Contacts, ContactsContract, MediaStore, userDictionary, and so on. We will fetch contacts from a phone's contact list with the help of system's existing ContentProvider and ContentResolver. We will be using the ContactsContract provider for this purpose. What is a content resolver? The ContentResolver object in the application's context is used to communicate with the provider as a client. The ContentResolver object communicates with the provider object—an instance of a class that implements ContentProvider. The provider object receives data requests from clients, performs the requested action, and returns the results. ContentResolver is the single, global instance in our application that provides access to other application's content provider; we do not need to worry about handling inter-process communication. The ContentResolver methods provide the basic CRUD (create, retrieve, update, and delete) functions of persistent storage; it has methods that call identically named methods in the provider object but does not know the implementation. In the code provided in the corresponding AddNewContactActivity class, we will initiate picking of contact by building an intent object, Intent.ACTION_PICK, that allows us to pick an item from a data source; in addition, all we need to know is the URI of the provider, which in our case is ContactsContract.Contacts.CONTENT_URI: public void pickContact() { try { Intent cIntent = new Intent(Intent.ACTION_PICK, ContactsContract.Contacts.CONTENT_URI); startActivityForResult(cIntent, PICK_CONTACT); } catch (Exception e) { e.printStackTrace(); Log.i(TAG, "Exception while picking contact"); } } The code used in this article is placed at GitHub: https://github.com/sunwicked/Ch3-PersonalContactManager/tree/master This functionality is also provided by Messaging, Gallery, and Contacts. The Contacts screen will pop up allowing us to browse or search for contacts we require to migrate to our new application. In onActivityResult, that is our next stop, we will use the method to handle our corresponding request to pick and use contacts. Let's look at the code we have to add to pick contacts from an Android's contact provider: protected void onActivityResult(int requestCode, int resultCode, Intent data) { . . else if (requestCode == PICK_CONTACT) { if (resultCode == Activity.RESULT_OK) { Uri contactData = data.getData(); Cursor c = getContentResolver().query(contactData, null, null, null, null); if (c.moveToFirst()) { String id = c .getString(c.getColumnIndexOrThrow(ContactsContract.Contacts._ID)); String hasPhone = c .getString(c.getColumnIndex(ContactsContract.Contacts.HAS_PHONE_NUMBER)); if (hasPhone.equalsIgnoreCase("1")) { Cursor phones = getContentResolver().query(ContactsContract. CommonDataKinds.Phone.CONTENT_URI, null, ContactsContract.CommonDataKinds. Phone.CONTACT_ID + " = " + id, null, null); phones.moveToFirst(); contactPhone.setText(phones.getString(phones.getColumnIndex("data1"))); contactName .setText(phones.getString(phones.getColumnIndex(ContactsContract.Contacts. DISPLAY_NAME))); } ….. We start by checking whether the request code is matching ours, and then we cross check resultcode. We get the content resolver object by making a call to getcontentresolver on the Context object; it is a method of the android.content.Context class. As we are in an activity that inherits from Context, we do not need to be explicit in making a call to it; same goes for services. We will now check whether the contact we picked has a phone number or not. After verifying the necessary details, we pull data that we require, such as contact name and phone number, and set them in relevant fields. Summary This article reflects on how to access and share data in Android via content providers and how to construct a content provider. We also talk about content resolvers and how they are used to communicate with the providers as a client. Resources for Article: Further resources on this subject: Reversing Android Applications [article] Saying Hello to Unity and Android [article] Android Fragmentation Management [article]

0
0
10315

Packt

26 Aug 2014

7 min read

Stream Grouping

Packt

26 Aug 2014

7 min read

In this article, by Ankit Jain and Anand Nalya, the authors of the book Learning Storm, we will cover different types of stream groupings. (For more resources related to this topic, see here.) When defining a topology, we create a graph of computation with a number of bolt-processing streams. At a more granular level, each bolt executes as multiple tasks in the topology. A stream will be partitioned into a number of partitions and divided among the bolts' tasks. Thus, each task of a particular bolt will only get a subset of the tuples from the subscribed streams. Stream grouping in Storm provides complete control over how this partitioning of tuples happens among many tasks of a bolt subscribed to a stream. Grouping for a bolt can be defined on the instance of the backtype.storm.topology.InputDeclarer class returned when defining bolts using the backtype.storm.topology.TopologyBuilder.setBolt method. Storm supports the following types of stream groupings: Shuffle grouping Fields grouping All grouping Global grouping Direct grouping Local or shuffle grouping Custom grouping Now, we will look at each of these groupings in detail. Shuffle grouping Shuffle grouping distributes tuples in a uniform, random way across the tasks. An equal number of tuples will be processed by each task. This grouping is ideal when you want to distribute your processing load uniformly across the tasks and where there is no requirement of any data-driven partitioning. Fields grouping Fields grouping enables you to partition a stream on the basis of some of the fields in the tuples. For example, if you want that all the tweets from a particular user should go to a single task, then you can partition the tweet stream using fields grouping on the username field in the following manner: builder.setSpout("1", new TweetSpout()); builder.setBolt("2", new TweetCounter()).fieldsGrouping("1", new Fields("username")) Fields grouping is calculated with the following function: hash (fields) % (no. of tasks) Here, hash is a hashing function. It does not guarantee that each task will get tuples to process. For example, if you have applied fields grouping on a field, say X, with only two possible values, A and B, and created two tasks for the bolt, then it might be possible that both hash (A) % 2 and hash (B) % 2 are equal, which will result in all the tuples being routed to a single task and other tasks being completely idle. Another common usage of fields grouping is to join streams. Since partitioning happens solely on the basis of field values and not the stream type, we can join two streams with any common join fields. The name of the fields do not need to be the same. For example, in order to process domains, we can join the Order and ItemScanned streams when an order is completed: builder.setSpout("1", new OrderSpout()); builder.setSpout("2", new ItemScannedSpout()); builder.setBolt("joiner", new OrderJoiner()) .fieldsGrouping("1", new Fields("orderId")) .fieldsGrouping("2", new Fields("orderRefId")); All grouping All grouping is a special grouping that does not partition the tuples but replicates them to all the tasks, that is, each tuple will be sent to each of the bolt's tasks for processing. One common use case of all grouping is for sending signals to bolts. For example, if you are doing some kind of filtering on the streams, then you have to pass the filter parameters to all the bolts. This can be achieved by sending those parameters over a stream that is subscribed by all bolts' tasks with all grouping. Another example is to send a reset message to all the tasks in an aggregation bolt. The following is an example of all grouping: builder.setSpout("1", new TweetSpout()); builder.setSpout("signals", new SignalSpout()); builder.setBolt("2", new TweetCounter()).fieldsGrouping("1", new Fields("username")).allGrouping("signals"); Here, we are subscribing signals for all the TweetCounter bolt's tasks. Now, we can send different signals to the TweetCounter bolt using SignalSpout. Global grouping Global grouping does not partition the stream but sends the complete stream to the bolt's task with the smallest ID. A general use case of this is when there needs to be a reduce phase in your topology where you want to combine results from previous steps in the topology in a single bolt. Global grouping might seem redundant at first, as you can achieve the same results with defining the parallelism for the bolt as one and setting the number of input streams to one. Though, when you have multiple streams of data coming through different paths, you might want only one of the streams to be reduced and others to be processed in parallel. For example, consider the following topology. In this topology, you might want to route all the tuples coming from Bolt C to a single Bolt D task, while you might still want parallelism for tuples coming from Bolt E to Bolt D. Global grouping This can be achieved with the following code snippet: builder.setSpout("a", new SpoutA()); builder.setSpout("b", new SpoutB()); builder.setBolt("c", new BoltC()); builder.setBolt("e", new BoltE()); builder.setBolt("d", new BoltD()) .globalGrouping("c") .shuffleGrouping("e"); Direct grouping In direct grouping, the emitter decides where each tuple will go for processing. For example, say we have a log stream and we want to process each log entry using a specific bolt task on the basis of the type of resource. In this case, we can use direct grouping. Direct grouping can only be used with direct streams. To declare a stream as a direct stream, use the backtype.storm.topology.OutputFieldsDeclarer.declareStream method that takes a Boolean parameter directly in the following way in your spout: @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declareStream("directStream", true, new Fields("field1")); } Now, we need the number of tasks for the component so that we can specify the taskId parameter while emitting the tuple. This can be done using the backtype.storm.task.TopologyContext.getComponentTasks method in the prepare method of the bolt. The following snippet stores the number of tasks in a bolt field: public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { this.numOfTasks = context.getComponentTasks("my-stream"); this.collector = collector; } Once you have a direct stream to emit to, use the backtype.storm.task.OutputCollector.emitDirect method instead of the emit method to emit it. The emitDirect method takes a taskId parameter to specify the task. In the following snippet, we are emitting to one of the tasks randomly: public void execute(Tuple input) { collector.emitDirect(new Random().nextInt(this.numOfTasks), process(input)); } Local or shuffle grouping If the tuple source and target bolt tasks are running in the same worker, using this grouping will act as a shuffle grouping only between the target tasks running on the same worker, thus minimizing any network hops resulting in increased performance. In case there are no target bolt tasks running on the source worker process, this grouping will act similar to the shuffle grouping mentioned earlier. Custom grouping If none of the preceding groupings fit your use case, you can define your own custom grouping by implementing the backtype.storm.grouping.CustomStreamGrouping interface. The following is a sample custom grouping that partitions a stream on the basis of the category in the tuples: public class CategoryGrouping implements CustomStreamGrouping, Serializable { // Mapping of category to integer values for grouping private static final Map<String, Integer> categories = ImmutableMap.of ( "Financial", 0, "Medical", 1, "FMCG", 2, "Electronics", 3 ); // number of tasks, this is initialized in prepare method private int tasks = 0; public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) { // initialize the number of tasks tasks = targetTasks.size(); } public List<Integer> chooseTasks(int taskId, List<Object> values) { // return the taskId for a given category String category = (String) values.get(0); return ImmutableList.of(categories.get(category) % tasks); } } Now, we can use this grouping in our topologies with the following code snippet: builder.setSpout("a", new SpoutA()); builder.setBolt("b", (IRichBolt)new BoltB()) .customGrouping("a", new CategoryGrouping()); The following diagram represents the Storm groupings graphically: Summary In this article, we discussed stream grouping in Storm and its types. Resources for Article: Further resources on this subject: Integrating Storm and Hadoop [article] Deploying Storm on Hadoop for Advertising Analysis [article] Photo Stream with iCloud [article]

0
0
3647

Packt

26 Aug 2014

23 min read

Classifying Text

Packt

26 Aug 2014

23 min read

0
0
2860

Packt

25 Aug 2014

18 min read

Camera Calibration

Packt

25 Aug 2014

18 min read

This article by Robert Laganière, author of OpenCV Computer Vision Application Programming Cookbook Second Edition, includes that images are generally produced using a digital camera, which captures a scene by projecting light going through its lens onto an image sensor. The fact that an image is formed by the projection of a 3D scene onto a 2D plane implies the existence of important relationships between a scene and its image and between different images of the same scene. Projective geometry is the tool that is used to describe and characterize, in mathematical terms, the process of image formation. In this article, we will introduce you to some of the fundamental projective relations that exist in multiview imagery and explain how these can be used in computer vision programming. You will learn how matching can be made more accurate through the use of projective constraints and how a mosaic from multiple images can be composited using two-view relations. Before we start the recipe, let's explore the basic concepts related to scene projection and image formation. (For more resources related to this topic, see here.) Image formation Fundamentally, the process used to produce images has not changed since the beginning of photography. The light coming from an observed scene is captured by a camera through a frontal aperture; the captured light rays hit an image plane (or an image sensor) located at the back of the camera. Additionally, a lens is used to concentrate the rays coming from the different scene elements. This process is illustrated by the following figure: Here, do is the distance from the lens to the observed object, di is the distance from the lens to the image plane, and f is the focal length of the lens. These quantities are related by the so-called thin lens equation: In computer vision, this camera model can be simplified in a number of ways. First, we can neglect the effect of the lens by considering that we have a camera with an infinitesimal aperture since, in theory, this does not change the image appearance. (However, by doing so, we ignore the focusing effect by creating an image with an infinite depth of field.) In this case, therefore, only the central ray is considered. Second, since most of the time we have do>>di, we can assume that the image plane is located at the focal distance. Finally, we can note from the geometry of the system that the image on the plane is inverted. We can obtain an identical but upright image by simply positioning the image plane in front of the lens. Obviously, this is not physically feasible, but from a mathematical point of view, this is completely equivalent. This simplified model is often referred to as the pin-hole camera model, and it is represented as follows: From this model, and using the law of similar triangles, we can easily derive the basic projective equation that relates a pictured object with its image: The size (hi) of the image of an object (of height ho) is therefore inversely proportional to its distance (do) from the camera, which is naturally true. In general, this relation describes where a 3D scene point will be projected on the image plane given the geometry of the camera. Calibrating a camera From the introduction of this article, we learned that the essential parameters of a camera under the pin-hole model are its focal length and the size of the image plane (which defines the field of view of the camera). Also, since we are dealing with digital images, the number of pixels on the image plane (its resolution) is another important characteristic of a camera. Finally, in order to be able to compute the position of an image's scene point in pixel coordinates, we need one additional piece of information. Considering the line coming from the focal point that is orthogonal to the image plane, we need to know at which pixel position this line pierces the image plane. This point is called the principal point. It might be logical to assume that this principal point is at the center of the image plane, but in practice, this point might be off by a few pixels depending on the precision at which the camera has been manufactured. Camera calibration is the process by which the different camera parameters are obtained. One can obviously use the specifications provided by the camera manufacturer, but for some tasks, such as 3D reconstruction, these specifications are not accurate enough. Camera calibration will proceed by showing known patterns to the camera and analyzing the obtained images. An optimization process will then determine the optimal parameter values that explain the observations. This is a complex process that has been made easy by the availability of OpenCV calibration functions. How to do it... To calibrate a camera, the idea is to show it a set of scene points for which their 3D positions are known. Then, you need to observe where these points project on the image. With the knowledge of a sufficient number of 3D points and associated 2D image points, the exact camera parameters can be inferred from the projective equation. Obviously, for accurate results, we need to observe as many points as possible. One way to achieve this would be to take one picture of a scene with many known 3D points, but in practice, this is rarely feasible. A more convenient way is to take several images of a set of some 3D points from different viewpoints. This approach is simpler but requires you to compute the position of each camera view in addition to the computation of the internal camera parameters, which fortunately is feasible. OpenCV proposes that you use a chessboard pattern to generate the set of 3D scene points required for calibration. This pattern creates points at the corners of each square, and since this pattern is flat, we can freely assume that the board is located at Z=0, with the X and Y axes well-aligned with the grid. In this case, the calibration process simply consists of showing the chessboard pattern to the camera from different viewpoints. Here is one example of a 6x4 calibration pattern image: The good thing is that OpenCV has a function that automatically detects the corners of this chessboard pattern. You simply provide an image and the size of the chessboard used (the number of horizontal and vertical inner corner points). The function will return the position of these chessboard corners on the image. If the function fails to find the pattern, then it simply returns false: // output vectors of image points std::vector<cv::Point2f> imageCorners; // number of inner corners on the chessboard cv::Size boardSize(6,4); // Get the chessboard corners bool found = cv::findChessboardCorners(image, boardSize, imageCorners); The output parameter, imageCorners, will simply contain the pixel coordinates of the detected inner corners of the shown pattern. Note that this function accepts additional parameters if you needs to tune the algorithm, which are not discussed here. There is also a special function that draws the detected corners on the chessboard image, with lines connecting them in a sequence: //Draw the corners cv::drawChessboardCorners(image, boardSize, imageCorners, found); // corners have been found The following image is obtained: The lines that connect the points show the order in which the points are listed in the vector of detected image points. To perform a calibration, we now need to specify the corresponding 3D points. You can specify these points in the units of your choice (for example, in centimeters or in inches); however, the simplest is to assume that each square represents one unit. In that case, the coordinates of the first point would be (0,0,0) (assuming that the board is located at a depth of Z=0), the coordinates of the second point would be (1,0,0), and so on, the last point being located at (5,3,0). There are a total of 24 points in this pattern, which is too small to obtain an accurate calibration. To get more points, you need to show more images of the same calibration pattern from various points of view. To do so, you can either move the pattern in front of the camera or move the camera around the board; from a mathematical point of view, this is completely equivalent. The OpenCV calibration function assumes that the reference frame is fixed on the calibration pattern and will calculate the rotation and translation of the camera with respect to the reference frame. Let's now encapsulate the calibration process in a CameraCalibrator class. The attributes of this class are as follows: class CameraCalibrator { // input points: // the points in world coordinates std::vector<std::vector<cv::Point3f>> objectPoints; // the point positions in pixels std::vector<std::vector<cv::Point2f>> imagePoints; // output Matrices cv::Mat cameraMatrix; cv::Mat distCoeffs; // flag to specify how calibration is done int flag; Note that the input vectors of the scene and image points are in fact made of std::vector of point instances; each vector element is a vector of the points from one view. Here, we decided to add the calibration points by specifying a vector of the chessboard image filename as input: // Open chessboard images and extract corner points int CameraCalibrator::addChessboardPoints( const std::vector<std::string>& filelist, cv::Size & boardSize) { // the points on the chessboard std::vector<cv::Point2f> imageCorners; std::vector<cv::Point3f> objectCorners; // 3D Scene Points: // Initialize the chessboard corners // in the chessboard reference frame // The corners are at 3D location (X,Y,Z)= (i,j,0) for (int i=0; i<boardSize.height; i++) { for (int j=0; j<boardSize.width; j++) { objectCorners.push_back(cv::Point3f(i, j, 0.0f)); } } // 2D Image points: cv::Mat image; // to contain chessboard image int successes = 0; // for all viewpoints for (int i=0; i<filelist.size(); i++) { // Open the image image = cv::imread(filelist[i],0); // Get the chessboard corners bool found = cv::findChessboardCorners( image, boardSize, imageCorners); // Get subpixel accuracy on the corners cv::cornerSubPix(image, imageCorners, cv::Size(5,5), cv::Size(-1,-1), cv::TermCriteria(cv::TermCriteria::MAX_ITER + cv::TermCriteria::EPS, 30, // max number of iterations 0.1)); // min accuracy //If we have a good board, add it to our data if (imageCorners.size() == boardSize.area()) { // Add image and scene points from one view addPoints(imageCorners, objectCorners); successes++; } } return successes; } The first loop inputs the 3D coordinates of the chessboard, and the corresponding image points are the ones provided by the cv::findChessboardCorners function. This is done for all the available viewpoints. Moreover, in order to obtain a more accurate image point location, the cv::cornerSubPix function can be used, and as the name suggests, the image points will then be localized at a subpixel accuracy. The termination criterion that is specified by the cv::TermCriteria object defines the maximum number of iterations and the minimum accuracy in subpixel coordinates. The first of these two conditions that is reached will stop the corner refinement process. When a set of chessboard corners have been successfully detected, these points are added to our vectors of the image and scene points using our addPoints method. Once a sufficient number of chessboard images have been processed (and consequently, a large number of 3D scene point / 2D image point correspondences are available), we can initiate the computation of the calibration parameters as follows: // Calibrate the camera // returns the re-projection error double CameraCalibrator::calibrate(cv::Size &imageSize) { //Output rotations and translations std::vector<cv::Mat> rvecs, tvecs; // start calibration return calibrateCamera(objectPoints, // the 3D points imagePoints, // the image points imageSize, // image size cameraMatrix, // output camera matrix distCoeffs, // output distortion matrix rvecs, tvecs, // Rs, Ts flag); // set options } In practice, 10 to 20 chessboard images are sufficient, but these must be taken from different viewpoints at different depths. The two important outputs of this function are the camera matrix and the distortion parameters. These will be described in the next section. How it works... In order to explain the result of the calibration, we need to go back to the figure in the introduction, which describes the pin-hole camera model. More specifically, we want to demonstrate the relationship between a point in 3D at the position (X,Y,Z) and its image (x,y) on a camera specified in pixel coordinates. Let's redraw this figure by adding a reference frame that we position at the center of the projection as seen here: Note that the y axis is pointing downward to get a coordinate system compatible with the usual convention that places the image origin at the upper-left corner. We learned previously that the point (X,Y,Z) will be projected onto the image plane at (fX/Z,fY/Z). Now, if we want to translate this coordinate into pixels, we need to divide the 2D image position by the pixel's width (px) and height (py), respectively. Note that by dividing the focal length given in world units (generally given in millimeters) by px, we obtain the focal length expressed in (horizontal) pixels. Let's then define this term as fx. Similarly, fy =f/py is defined as the focal length expressed in vertical pixel units. Therefore, the complete projective equation is as follows: Recall that (u0,v0) is the principal point that is added to the result in order to move the origin to the upper-left corner of the image. These equations can be rewritten in the matrix form through the introduction of homogeneous coordinates, in which 2D points are represented by 3-vectors and 3D points are represented by 4-vectors (the extra coordinate is simply an arbitrary scale factor, S, that needs to be removed when a 2D coordinate needs to be extracted from a homogeneous 3-vector). Here is the rewritten projective equation: The second matrix is a simple projection matrix. The first matrix includes all of the camera parameters, which are called the intrinsic parameters of the camera. This 3x3 matrix is one of the output matrices returned by the cv::calibrateCamera function. There is also a function called cv::calibrationMatrixValues that returns the value of the intrinsic parameters given by a calibration matrix. More generally, when the reference frame is not at the projection center of the camera, we will need to add a rotation vector (a 3x3 matrix) and a translation vector (a 3x1 matrix). These two matrices describe the rigid transformation that must be applied to the 3D points in order to bring them back to the camera reference frame. Therefore, we can rewrite the projection equation in its most general form: Remember that in our calibration example, the reference frame was placed on the chessboard. Therefore, there is a rigid transformation (made of a rotation component represented by the matrix entries r1 to r9 and a translation represented by t1, t2, and t3) that must be computed for each view. These are in the output parameter list of the cv::calibrateCamera function. The rotation and translation components are often called the extrinsic parameters of the calibration, and they are different for each view. The intrinsic parameters remain constant for a given camera/lens system. The intrinsic parameters of our test camera obtained from a calibration based on 20 chessboard images are fx=167, fy=178, u0=156, and v0=119. These results are obtained by cv::calibrateCamera through an optimization process aimed at finding the intrinsic and extrinsic parameters that will minimize the difference between the predicted image point position, as computed from the projection of the 3D scene points, and the actual image point position, as observed on the image. The sum of this difference for all the points specified during the calibration is called the re-projection error. Let's now turn our attention to the distortion parameters. So far, we have mentioned that under the pin-hole camera model, we can neglect the effect of the lens. However, this is only possible if the lens that is used to capture an image does not introduce important optical distortions. Unfortunately, this is not the case with lower quality lenses or with lenses that have a very short focal length. You may have already noted that the chessboard pattern shown in the image that we used for our example is clearly distorted—the edges of the rectangular board are curved in the image. Also, note that this distortion becomes more important as we move away from the center of the image. This is a typical distortion observed with a fish-eye lens, and it is called radial distortion. The lenses used in common digital cameras usually do not exhibit such a high degree of distortion, but in the case of the lens used here, these distortions certainly cannot be ignored. It is possible to compensate for these deformations by introducing an appropriate distortion model. The idea is to represent the distortions induced by a lens by a set of mathematical equations. Once established, these equations can then be reverted in order to undo the distortions visible on the image. Fortunately, the exact parameters of the transformation that will correct the distortions can be obtained together with the other camera parameters during the calibration phase. Once this is done, any image from the newly calibrated camera will be undistorted. Therefore, we have added an additional method to our calibration class: // remove distortion in an image (after calibration) cv::Mat CameraCalibrator::remap(const cv::Mat &image) { cv::Mat undistorted; if (mustInitUndistort) { // called once per calibration cv::initUndistortRectifyMap( cameraMatrix, // computed camera matrix distCoeffs, // computed distortion matrix cv::Mat(), // optional rectification (none) cv::Mat(), // camera matrix to generate undistorted image.size(), // size of undistorted CV_32FC1, // type of output map map1, map2); // the x and y mapping functions mustInitUndistort= false; } // Apply mapping functions cv::remap(image, undistorted, map1, map2, cv::INTER_LINEAR); // interpolation type return undistorted; } Running this code results in the following image: As you can see, once the image is undistorted, we obtain a regular perspective image. To correct the distortion, OpenCV uses a polynomial function that is applied to the image points in order to move them at their undistorted position. By default, five coefficients are used; a model made of eight coefficients is also available. Once these coefficients are obtained, it is possible to compute two cv::Mat mapping functions (one for the x coordinate and one for the y coordinate) that will give the new undistorted position of an image point on a distorted image. This is computed by the cv::initUndistortRectifyMap function, and the cv::remap function remaps all the points of an input image to a new image. Note that because of the nonlinear transformation, some pixels of the input image now fall outside the boundary of the output image. You can expand the size of the output image to compensate for this loss of pixels, but you will now obtain output pixels that have no values in the input image (they will then be displayed as black pixels). There's more... More options are available when it comes to camera calibration. Calibration with known intrinsic parameters When a good estimate of the camera's intrinsic parameters is known, it could be advantageous to input them in the cv::calibrateCamera function. They will then be used as initial values in the optimization process. To do so, you just need to add the CV_CALIB_USE_INTRINSIC_GUESS flag and input these values in the calibration matrix parameter. It is also possible to impose a fixed value for the principal point (CV_CALIB_FIX_PRINCIPAL_POINT), which can often be assumed to be the central pixel. You can also impose a fixed ratio for the focal lengths fx and fy (CV_CALIB_FIX_RATIO); in which case, you assume the pixels of the square shape. Using a grid of circles for calibration Instead of the usual chessboard pattern, OpenCV also offers the possibility to calibrate a camera by using a grid of circles. In this case, the centers of the circles are used as calibration points. The corresponding function is very similar to the function we used to locate the chessboard corners: cv::Size boardSize(7,7); std::vector<cv::Point2f> centers; bool found = cv:: findCirclesGrid( image, boardSize, centers); See also The A flexible new technique for camera calibration article by Z. Zhang in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no 11, 2000, is a classic paper on the problem of camera calibration Summary In this article, we explored the projective relations that exist between two images of the same scene. Resources for Article: Further resources on this subject: Creating an Application from Scratch [Article] Wrapping OpenCV [Article] New functionality in OpenCV 3.0 [Article]

0
0
22104

Packt

25 Aug 2014

13 min read

Report Data Filtering

Packt

25 Aug 2014

13 min read

0
0
2066

article-image-recommender-systems-dissected

Packt

21 Aug 2014

4 min read

Recommender systems dissected

Packt

21 Aug 2014

4 min read

In this article by Rik Van Bruggen, the author of Learning Neo4j, we will take a look at so-called recommender systems. Such systems, broadly speaking, consist of two elementary parts: (For more resources related to this topic, see here.) A pattern discovery system: This is the system that somehow figures out what would be a useful recommendation for a particular target group. This discovery can be done in many different ways, but in general we see three ways to do so: A business expert who thoroughly understands the domain of the graph database application will use this understanding to determine useful recommendations. For example, the supervisor of a "do-it-yourself" retail outlet would understand a particular pattern. Suppose that if someone came in to buy multiple pots of paint, they would probably also benefit from getting a gentle recommendation for a promotion of high-end brushes. The store has that promotion going on right then, so the recommendation would be very timely. This process would be a discovery process where the pattern that the business expert has discovered would be applied and used in graph databases in real time as part of a sophisticated recommendation. A visual discovery of a specific pattern in the graph representation of the business domain. We have found that in many different projects, business users have stumbled upon these kinds of patterns while using graph visualizations to look at their data. Specific patterns emerge, unexpected relations jump out, or, in more advanced visualization solutions, specific clusters of activity all of a sudden become visible and require further investigation. Graph databases such as Neo4j, and the visualization solutions that complement it, can play a wonderfully powerful role in this process. An algorithmic discovery of a pattern in a dataset of the business domain using machine learning algorithms to uncover previously unknown patterns in the data. Typically, these processes require an iterative, raw number-crunching approach that has also been used on non-graph data formats in the past. It remains part-art part-science at this point, but can of course yield interesting insights if applied correctly. An overview of recommender systems A pattern application system: All recommender systems will be more or less successful based on not just the patterns that they are able to discover, but also based on the way that they are able to apply these patterns in business applications. We can look at different types of applications for these patterns: Batch-oriented applications: Some applications of these patterns are not as time-critical as one would expect. It does not really matter in any material kind of way if a bulk e-mail with recommendations, or worse, a printed voucher with recommended discounted products, gets delivered to the customer or prospect at 10 am or 11 am. Batch solutions can usually cope with these kinds of requests, even if they do so in the most inefficient way. Real-time oriented applications: Some pattern applications simply have to be delivered in real time, in between a web request and a web response, and cannot be precalculated. For these types of systems, which typically use more complex database queries in order to match for the appropriate recommendation to make, graph databases such as Neo4j are a fantastic tool to have. We will illustrate this going forward. With this classification behind us, we will look at an example dataset and some example queries to make this topic come alive. Using a graph model for recommendations We will be using a very specific data model for our recommender system. All we have changed is that we added a couple of products and brands to the model, and inserted some data into the database correspondingly. In total, we added the following: Ten products Three product brands Fifty relationships between existing person nodes and the mentioned products, highlighting that these persons bought these products These are the products and brands that we added: Adding products and brands to the dataset The following diagram shows the resulting model: In Neo4j, that model will look something like the following: A dataset like this one, while of course a broad simplification, offers us some interesting possibilities for a recommender system. Let's take a look at some queries that could really match this use case, and that would allow us to either visually or in real time exploit the data in this dataset in a product recommendation application. Summary This article elaborately dissected the so-called recommender systems. Resources for Article: Further resources on this subject: Working with a Neo4j Embedded Database [article] Comparative Study of NoSQL Products [article] Getting Started with CouchDB and Futon [article]

0
0
2275

Packt

21 Aug 2014

9 min read

Using Canvas and D3

Packt

21 Aug 2014

9 min read

This article by Pablo Navarro Castillo, author of Mastering D3.js, includes that most of the time, we use D3 to create SVG-based charts and visualizations. If the number of elements to render is huge, or if we need to render raster images, it can be more convenient to render our visualizations using the HTML5 canvas element. In this article, we will learn how to use the force layout of D3 and the canvas element to create an animated network visualization with random data. (For more resources related to this topic, see here.) Creating figures with canvas The HTML canvas element allows you to create raster graphics using JavaScript. It was first introduced in HTML5. It enjoys more widespread support than SVG, and can be used as a fallback option. Before diving deeper into integrating canvas and D3, we will construct a small example with canvas. The canvas element should have the width and height attributes. This alone will create an invisible figure of the specified size: <!— Canvas Element --> <canvas id="canvas-demo" width="650px" height="60px"></canvas> If the browser supports the canvas element, it will ignore any element inside the canvas tags. On the other hand, if the browser doesn't support the canvas, it will ignore the canvas tags, but it will interpret the content of the element. This behavior provides a quick way to handle the lack of canvas support: <!— Canvas Element --> <canvas id="canvas-demo" width="650px" height="60px">  <img src="img/fallback-img.png" width="650" height="60"></img> </canvas> If the browser doesn't support canvas, the fallback image will be displayed. Note that unlike the <img> element, the canvas closing tag (</canvas>) is mandatory. To create figures with canvas, we don't need special libraries; we can create the shapes using the canvas API: <script> // Graphic Variables var barw = 65, barh = 60; // Append a canvas element, set its size and get the node. var canvas = document.getElementById('canvas-demo'); // Get the rendering context. var context = canvas.getContext('2d'); // Array with colors, to have one rectangle of each color. var color = ['#5c3566', '#6c475b', '#7c584f', '#8c6a44', '#9c7c39', '#ad8d2d', '#bd9f22', '#cdb117', '#ddc20b', '#edd400']; // Set the fill color and render ten rectangles. for (var k = 0; k < 10; k += 1) { // Set the fill color. context.fillStyle = color[k]; // Create a rectangle in incremental positions. context.fillRect(k * barw, 0, barw, barh); } </script> We use the DOM API to access the canvas element with the canvas-demo ID and to get the rendering context. Then we set the color using the fillStyle method, and use the fillRect canvas method to create a small rectangle. Note that we need to change fillStyle every time or all the following shapes will have the same color. The script will render a series of rectangles, each one filled with a different color, shown as follows: A graphic created with canvas Canvas uses the same coordinate system as SVG, with the origin in the top-left corner, the horizontal axis augmenting to the right, and the vertical axis augmenting to the bottom. Instead of using the DOM API to get the canvas node, we could have used D3 to create the node, set its attributes, and created scales for the color and position of the shapes. Note that the shapes drawn with canvas don't exists in the DOM tree; so, we can't use the usual D3 pattern of creating a selection, binding the data items, and appending the elements if we are using canvas. Creating shapes Canvas has fewer primitives than SVG. In fact, almost all the shapes must be drawn with paths, and more steps are needed to create a path. To create a shape, we need to open the path, move the cursor to the desired location, create the shape, and close the path. Then, we can draw the path by filling the shape or rendering the outline. For instance, to draw a red semicircle centered in (325, 30) and with a radius of 20, write the following code: // Create a red semicircle. context.beginPath(); context.fillStyle = '#ff0000'; context.moveTo(325, 30); context.arc(325, 30, 20, Math.PI / 2, 3 * Math.PI / 2); context.fill(); The moveTo method is a bit redundant here, because the arc method moves the cursor implicitly. The arguments of the arc method are the x and y coordinates of the arc center, the radius, and the starting and ending angle of the arc. There is also an optional Boolean argument to indicate whether the arc should be drawn counterclockwise. A basic shape created with the canvas API is shown in the following screenshot: Integrating canvas and D3 We will create a small network chart using the force layout of D3 and canvas instead of SVG. To make the graph looks more interesting, we will randomly generate the data. We will generate 250 nodes sparsely connected. The nodes and links will be stored as the attributes of the data object: // Number of Nodes var nNodes = 250, createLink = false; // Dataset Structure var data = {nodes: [],links: []}; We will append nodes and links to our dataset. We will create nodes with a radius attribute randomly assigning it a value of either 2 or 4 as follows: // Iterate in the nodes for (var k = 0; k < nNodes; k += 1) { // Create a node with a random radius. data.nodes.push({radius: (Math.random() > 0.3) ? 2 : 4}); // Create random links between the nodes. } We will create a link with probability of 0.1 only if the difference between the source and target indexes are less than 8. The idea behind this way to create links is to have only a few connections between the nodes: // Create random links between the nodes. for (var j = k + 1; j < nNodes; j += 1) { // Create a link with probability 0.1 createLink = (Math.random() < 0.1) && (Math.abs(k - j) < 8); if (createLink) { // Append a link with variable distance between the nodes data.links.push({ source: k, target: j, dist: 2 * Math.abs(k - j) + 10 }); } } We will use the radius attribute to set the size of the nodes. The links will contain the distance between the nodes and the indexes of the source and target nodes. We will create variables to set the width and height of the figure: // Figure width and height var width = 650, height = 300; We can now create and configure the force layout. As we did in the previous section, we will set the charge strength to be proportional to the area of each node. This time, we will also set the distance between the links, using the linkDistance method of the layout: // Create and configure the force layout var force = d3.layout.force() .size([width, height]) .nodes(data.nodes) .links(data.links) .charge(function(d) { return -1.2 * d.radius * d.radius; }) .linkDistance(function(d) { return d.dist; }) .start(); We can create a canvas element now. Note that we should use the node method to get the canvas element, because the append and attr methods will both return a selection, which don't have the canvas API methods: // Create a canvas element and set its size. var canvas = d3.select('div#canvas-force').append('canvas') .attr('width', width + 'px') .attr('height', height + 'px') .node(); We get the rendering context. Each canvas element has its own rendering context. We will use the '2d' context, to draw two-dimensional figures. At the time of writing this, there are some browsers that support the webgl context; more details are available at https://developer.mozilla.org/en-US/docs/Web/WebGL/Getting_started_with_WebGL. Refer to the following '2d' context: // Get the canvas context. var context = canvas.getContext('2d'); We register an event listener for the force layout's tick event. As canvas doesn't remember previously created shapes, we need to clear the figure and redraw all the elements on each tick: force.on('tick', function() { // Clear the complete figure. context.clearRect(0, 0, width, height); // Draw the links ... // Draw the nodes ... }); The clearRect method cleans the figure under the specified rectangle. In this case, we clean the entire canvas. We can draw the links using the lineTo method. We iterate through the links, beginning a new path for each link, moving the cursor to the position of the source node, and by creating a line towards the target node. We draw the line with the stroke method: // Draw the links data.links.forEach(function(d) { // Draw a line from source to target. context.beginPath(); context.moveTo(d.source.x, d.source.y); context.lineTo(d.target.x, d.target.y); context.stroke(); }); We iterate through the nodes and draw each one. We use the arc method to represent each node with a black circle: // Draw the nodes data.nodes.forEach(function(d, i) { // Draws a complete arc for each node. context.beginPath(); context.arc(d.x, d.y, d.radius, 0, 2 * Math.PI, true); context.fill(); }); We obtain a constellation of disconnected network graphs, slowly gravitating towards the center of the figure. Using the force layout and canvas to create a network chart is shown in the following screenshot: We can think that to erase all the shapes and redraw each shape again and again could have a negative impact on the performance. In fact, sometimes it's faster to draw the figures using canvas, because this way the browser doesn't have to manage the DOM tree of the SVG elements (but it still have to redraw them if the SVG elements are changed). Summary In this article, we will learn how to use the force layout of D3 and the canvas element to create an animated network visualization with random data. Resources for Article: Further resources on this subject: Interacting with Data for Dashboards [Article] Kendo UI DataViz – Advance Charting [Article] Visualizing my Social Graph with d3.js [Article]

0
0
6090

Packt

20 Aug 2014

19 min read

Importing Dynamic Data

Packt

20 Aug 2014

19 min read

In this article by Chad Adams, author of the book Learning Python Data Visualization, we will go over the finer points of pulling data from the Web using the Python language and its built-in libraries and cover parsing XML, JSON, and JSONP data. (For more resources related to this topic, see here.) Since we now have an understanding of how to work with the pygal library and building charts and graphics in general, this is the time to start looking at building an application using Python. In this article, we will take a look at the fundamentals of pulling data from the Web, parsing the data, and adding it to our code base and formatting the data into a useable format, and we will look at how to carry those fundamentals over to our Python code. We will also cover parsing XML and JSON data. Pulling data from the Web For many non-developers, it may seem like witchcraft that developers are magically able to pull data from an online resource and integrate that with an iPhone app, or a Windows Store app, or pull data to a cloud resource that is able to generate various versions of the data upon request. To be fair, they do have a general understanding; data is pulled from the Web and formatted to their app of choice. They just may not get the full background of how that process workflow happens. It's the same case with some developers as well—many developers mainly work on a technology that only works on a locked down environment, or generally, don't use the Internet for their applications. Again, they understand the logic behind it; somehow an RSS feed gets pulled into an application. In many languages, the same task is done in various ways, usually depending on which language is used. Let's take a look at a few examples using Packt's own news RSS feed, using an iOS app pulling in data via Objective-C. Now, if you're reading this and not familiar with Objective-C, that's OK, the important thing is that we have the inner XML contents of an XML file showing up in an iPhone application: #import "ViewController.h" @interfaceViewController () @property (weak, nonatomic) IBOutletUITextView *output; @end @implementationViewController - (void)viewDidLoad { [super viewDidLoad]; // Do any additional setup after loading the view, typically from a nib. NSURL *packtURL = [NSURLURLWithString:@"http://www.packtpub.com/rss.xml"]; NSURLRequest *request = [NSURLRequestrequestWithURL:packtURL]; NSURLConnection *connection = [[NSURLConnectionalloc] initWithRequest:requestdelegate:selfstartImmediately:YES]; [connection start]; } - (void)connection:(NSURLConnection *)connection didReceiveData:(NSData *)data { NSString *downloadstring = [[NSStringalloc] initWithData:dataencoding:NSUTF8StringEncoding]; [self.outputsetText:downloadstring]; } - (void)didReceiveMemoryWarning { [superdidReceiveMemoryWarning]; // Dispose of any resources that can be recreated. } @end Here, we can see in iPhone Simulator that our XML output is pulled dynamically through HTTP from the Web to our iPhone simulator. This is what we'll want to get started with doing in Python: The XML refresher Extensible Markup Language (XML) is a data markup language that sets a series of rules and hierarchy to a data group, which is stored as a static file. Typically, servers update these XML files on the Web periodically to be reused as data sources. XML is really simple to pick up as it's similar to HTML. You can start with the document declaration in this case: <?xml version="1.0" encoding="utf-8"?> Next, a root node is set. A node is like an HTML tag (which is also called a node). You can tell it's a node by the brackets around the node's name. For example, here's a node named root: <root></root> Note that we close the node by creating a same-named node with a backslash. We can also add parameters to the node and assign a value, as shown in the following root node: <root parameter="value"></root> Data in XML is set through a hierarchy. To declare that hierarchy, we create another node and place that inside the parent node, as shown in the following code: <root parameter="value"> <subnode>Subnode's value</subnode> </root> In the preceding parent node, we created a subnode. Inside the subnode, we have an inner value called Subnode's value. Now, in programmatical terms, getting data from an XML data file is a process called parsing. With parsing, we specify where in the XML structure we would like to get a specific value; for instance, we can crawl the XML structure and get the inner contents like this: /root/subnode This is commonly referred to as XPath syntax, a cross-language way of going through an XML file. For more on XML and XPath, check out the full spec at: http://www.w3.org/TR/REC-xml/ and here http://www.w3.org/TR/xpath/ respectively. RSS and the ATOM Really simple syndication (RSS) is simply a variation of XML. RSS is a spec that defines specific nodes that are common for sending data. Typically, many blog feeds include an RSS option for users to pull down the latest information from those sites. Some of the nodes used in RSS include rss, channel, item, title, description, pubDate, link, and GUID. Looking at our iPhone example in this article from the Pulling data from the Web section, we can see what a typical RSS structure entails. RSS feeds are usually easy to spot since the spec requires the root node to be named rss for it to be a true RSS file. In some cases, some websites and services will use .rss rather than .xml; this is typically fine since most readers for RSS content will pull in the RSS data like an XML file, just like in the iPhone example. Another form of XML is called ATOM. ATOM was another spec similar to RSS, but developed much later than RSS. Because of this, ATOM has more features than RSS: XML namespacing, specified content formats (video, or audio-specific URLs), support for internationalization, and multilanguage support, just to name a few. ATOM does have a few different nodes compared to RSS; for instance, the root node to an RSS feed would be <rss>. ATOM's root starts with <feed>, so it's pretty easy to spot the difference. Another difference is that ATOM can also end in .atom or .xml. For more on the RSS and ATOM spec, check out the following sites: http://www.rssboard.org/rss-specification http://tools.ietf.org/html/rfc4287 Understanding HTTP All these samples from the RSS feed of the Packt Publishing website show one commonality that's used regardless of the technology coded in, and that is the method used to pull down these static files is through the Hypertext Transfer Protocol (HTTP). HTTP is the foundation of Internet communication. It's a protocol with two distinct types of requests: a request for data or GET and a push of data called a POST. Typically, when we download data using HTTP, we use the GET method of HTTP in order to pull down the data. The GET request will return a string or another data type if we mention a specific type. We can either use this value directly or save to a variable. With a POST request, we are sending values to a service that handles any incoming values; say we created a new blog post's title and needed to add to a list of current titles, a common way of doing that is with URL parameters. A URL parameter is an existing URL but with a suffixed key-value pair. The following is a mock example of a POST request with a URL parameter: http://www.yourwebsite.com/blogtitles/?addtitle=Your%20New%20Title If our service is set up correctly, it will scan the POST request for a key of addtitle and read the value, in this case: Your New Title. We may notice %20 in our title for our request. This is an escape character that allows us to send a value with spaces; in this case, %20 is a placehoder for a space in our value. Using HTTP in Python The RSS samples from the Packt Publishing website show a few commonalities we use in programming when working in HTTP; one is that we always account for the possibility of something potentially going wrong with a connection and we always close our request when finished. Here's an example on how the same RSS feed request is done in Python using a built-in library called urllib2: #!/usr/bin/env python # -*- coding: utf-8 -*- import urllib2 try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') #Read the file to a variable we named 'xml' xml = response.read() #print to the console. print(xml) #Finally, close our open network. response.close() except: #If we have an issue show a message and alert the user. print('Unable to connect to RSS...') If we look in the following console output, we can see the XML contents just as we saw in our iOS code example: In the example, notice that we wrapped our HTTP request around a try except block. For those coming from another language, except can be considered the same as a catch statement. This allows us to detect if an error occurs, which might be an incorrect URL or lack of connectivity, for example, to programmatically set an alternate result to our Python script. Parsing XML in Python with HTTP With these examples including our Python version of the script, it's still returning a string of some sorts, which by itself isn't of much use to grab values from the full string. In order to grab specific strings and values from an XML pulled through HTTP, we need to parse it. Luckily, Python has a built-in object in the Python main library for this, called as ElementTree, which is a part of the XML library in Python. Let's incorporate ElementTree into our example and see how that works: # -*- coding: utf-8 -*- import urllib2 from xml.etree import ElementTree try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') tree = ElementTree.parse(response) root = tree.getroot() #Create an 'Element' group from our XPATH using findall. news_post_title = root.findall("channel//title") #Iterate in all our searched elements and print the inner text for each. for title in news_post_title: print title.text #Finally, close our open network. response.close() except Exception as e: #If we have an issue show a message and alert the user. print(e) The following screenshot shows the results of our script: As we can see, our output shows each title for each blog post. Notice how we used channel//item for our findall() method. This is using XPath, which allows us to write in a shorthand manner on how to iterate an XML structure. It works like this; let's use the http://www.packtpub.com feed as an example. We have a root, followed by channel, then a series of title elements. The findall() method found each element and saved them as an Element type specific to the XML library ElementTree uses in Python, and saved each of those into an array. We can then use a for in loop to iterate each one and print the inner text using the text property specific to the Element type. You may notice in the recent example that I changed except with a bit of extra code and added Exception as e. This allows us to help debug issues and print them to a console or display a more in-depth feedback to the user. An Exception allows for generic alerts that the Python libraries have built-in warnings and errors to be printed back either through a console or handled with the code. They also have more specific exceptions we can use such as IOException, which is specific for working with file reading and writing. About JSON Now, another data type that's becoming more and more common when working with web data is JSON. JSON is an acronym for JavaScript Object Notation, and as the name implies, is indeed true JavaScript. It has become popular in recent years with the rise of mobile apps, and Rich Internet Applications (RIA). JSON is great for JavaScript developers; it's easier to work with when working in HTML content, compared to XML. Because of this, JSON is becoming a more common data type for web and mobile application development. Parsing JSON in Python with HTTP To parse JSON data in Python is a pretty similar process; however, in this case, our ElementTree library isn't needed, since that only works with XML. We need a library designed to parse JSON using Python. Luckily, the Python library creators already have a library for us, simply called json. Let's build an example similar to our XML code using the json import; of course, we need to use a different data source since we won't be working in XML. One thing we may note is that there aren't many public JSON feeds to use, many of which require using a code that gives a developer permission to generate a JSON feed through a developer API, such as Twitter's JSON API. To avoid this, we will use a sample URL from Google's Books API, which will show demo data of Pride and Prejudice, Jane Austen. We can view the JSON (or download the file), by typing in the following URL: https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ Notice the API uses HTTPS, many JSON APIs are moving to secure methods of transmitting data, so be sure to include the secure in HTTP, with HTTPS. Let's take a look at the JSON output: { "kind": "books#volume", "id": "s1gVAAAAYAAJ", "etag": "yMBMZ85ENrc", "selfLink": "https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ", "volumeInfo": { "title": "Pride and Prejudice", "authors": [ "Jane Austen" ], "publisher": "C. Scribner's Sons", "publishedDate": "1918", "description": "Austen's most celebrated novel tells the story of Elizabeth Bennet, a bright, lively young woman with four sisters, and a mother determined to marry them to wealthy men. At a party near the Bennets' home in the English countryside, Elizabeth meets the wealthy, proud Fitzwilliam Darcy. Elizabeth initially finds Darcy haughty and intolerable, but circumstances continue to unite the pair. Mr. Darcy finds himself captivated by Elizabeth's wit and candor, while her reservations about his character slowly vanish. The story is as much a social critique as it is a love story, and the prose crackles with Austen's wry wit.", "readingModes": { "text": true, "image": true }, "pageCount": 401, "printedPageCount": 448, "dimensions": { "height": "18.00 cm" }, "printType": "BOOK", "averageRating": 4.0, "ratingsCount": 433, "contentVersion": "1.1.5.0.full.3", "imageLinks": { "smallThumbnail": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec =frontcover&img=1&zoom=5&edge=curl&imgtk=AFLRE73F8btNqKpVjGX6q7V3XS77 QA2PftQUxcEbU3T3njKNxezDql_KgVkofGxCPD3zG1yq39u0XI8s4wjrqFahrWQ- 5Epbwfzfkoahl12bMQih5szbaOw&source=gbs_api", "thumbnail": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=1&edge=curl&imgtk=AFLRE70tVS8zpcFltWh_ 7K_5Nh8BYugm2RgBSLg4vr9tKRaZAYoAs64RK9aqfLRECSJq7ATs_j38JRI3D4P48-2g_ k4-EY8CRNVReZguZFMk1zaXlzhMNCw&source=gbs_api", "small": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec =frontcover&img=1&zoom=2&edge=curl&imgtk=AFLRE71qcidjIs37x0jN2dGPstn 6u2pgeXGWZpS1ajrGgkGCbed356114HPD5DNxcR5XfJtvU5DKy5odwGgkrwYl9gC9fo3y- GM74ZIR2Dc-BqxoDuUANHg&source=gbs_api", "medium": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=3&edge=curl&imgtk=AFLRE73hIRCiGRbfTb0uNIIXKW 4vjrqAnDBSks_ne7_wHx3STluyMa0fsPVptBRW4yNxNKOJWjA4Od5GIbEKytZAR3Nmw_ XTmaqjA9CazeaRofqFskVjZP0&source=gbs_api", "large": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=4&edge=curl&imgtk=AFLRE73mlnrDv-rFsL- n2AEKcOODZmtHDHH0QN56oG5wZsy9XdUgXNnJ_SmZ0sHGOxUv4sWK6GnMRjQm2eEwnxIV4dcF9eBhghMcsx -S2DdZoqgopJHk6Ts&source=gbs_api", "extraLarge": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=6&edge=curl&imgtk=AFLRE73KIXHChszn TbrXnXDGVs3SHtYpl8tGncDPX_7GH0gd7sq7SA03aoBR0mDC4-euzb4UCIDiDNLYZUBJwMJxVX_ cKG5OAraACPLa2QLDcfVkc1pcbC0&source=gbs_api" }, "language": "en", "previewLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&source=gbs_api", "infoLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&source=gbs_api", "canonicalVolumeLink": "http://books.google.com/books/about/ Pride_and_Prejudice.html?hl=&id=s1gVAAAAYAAJ" }, "layerInfo": { "layers": [ { "layerId": "geo", "volumeAnnotationsVersion": "6" } ] }, "saleInfo": { "country": "US", "saleability": "FREE", "isEbook": true, "buyLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&buy=&source=gbs_api" }, "accessInfo": { "country": "US", "viewability": "ALL_PAGES", "embeddable": true, "publicDomain": true, "textToSpeechPermission": "ALLOWED", "epub": { "isAvailable": true, "downloadLink": "http://books.google.com/books/download /Pride_and_Prejudice.epub?id=s1gVAAAAYAAJ&hl=&output=epub &source=gbs_api" }, "pdf": { "isAvailable": true, "downloadLink": "http://books.google.com/books/download/Pride_and_Prejudice.pdf ?id=s1gVAAAAYAAJ&hl=&output=pdf&sig=ACfU3U3dQw5JDWdbVgk2VRHyDjVMT4oIaA &source=gbs_api" }, "webReaderLink": "http://books.google.com/books/reader ?id=s1gVAAAAYAAJ&hl=&printsec=frontcover& output=reader&source=gbs_api", "accessViewStatus": "FULL_PUBLIC_DOMAIN", "quoteSharingAllowed": false } } One downside to JSON is that it can be hard to read complex data. So, if we run across a complex JSON feed, we can visualize it using a JSON Visualizer. Visual Studio includes one with all its editions, and a web search will also show multiple online sites where you can paste JSON and an easy-to-understand data structure will be displayed. Here's an example using http://jsonviewer.stack.hu/ loading our example JSON URL: Next, let's reuse some of our existing Python code using our urllib2 library to request our JSON feed, and then we will parse it with the Python's JSON library. Let's parse the volumeInfo node of the book by starting with the JSON (root) node that is followed by volumeInfo as the subnode. Here's our example from the XML section, reworked using JSON to parse all child elements: # -*- coding: utf-8 -*- import urllib2 import json try: #Set a URL variable. url = 'https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ' #Open the file via HTTP. response = urllib2.urlopen(url) #Read the request as one string. bookdata = response.read() #Convert the string to a JSON object in Python. data = json.loads(bookdata) for r in data ['volumeInfo']: print r #Close our response. response.close() except: #If we have an issue show a message and alert the user. print('Unable to connect to JSON API...') Here's our output. It should match the child nodes of volumeInfo, which it does in the output screen, as shown in the following screenshot: Well done! Now, let's grab the value for title. Look at the following example and notice we have two brackets: one for volumeInfo and another for title. This is similar to navigating our XML hierarchy: # -*- coding: utf-8 -*- import urllib2 import json try: #Set a URL variable. url = 'https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ' #Open the file via HTTP. response = urllib2.urlopen(url) #Read the request as one string. bookdata = response.read() #Convert the string to a JSON object in Python. data = json.loads(bookdata) print data['volumeInfo']['title'] #Close our response. response.close() except Exception as e: #If we have an issue show a message and alert the user. #'Unable to connect to JSON API...' print(e) The following screenshot shows the results of our script: As you can see in the preceding screenshot, we return one line with Pride and Prejudice parsed from our JSON data. About JSONP JSONP, or JSON with Padding, is actually JSON but it is set up differently compared to traditional JSON files. JSONP is a workaround for web cross-browser scripting. Some web services can serve up JSONP rather than pure JSON JavaScript files. The issue with that is JSONP isn't compatible with many JSON Python-based parsers including one covered here, so you will want to avoid JSONP style JSON whenever possible. So how can we spot JSONP files; do they have a different extension? No, it's simply a wrapper for JSON data; here's an example without JSONP: /* *Regular JSON */ { authorname: 'Chad Adams' } The same example with JSONP: /* * JSONP */ callback({ authorname: 'Chad Adams' }); Notice we wrapped our JSON data with a function wrapper, or a callback. Typically, this is what breaks in our parsers and is a giveaway that this is a JSONP-formatted JSON file. In JavaScript, we can even call it in code like this: /* * Using JSONP in JavaScript */ callback = function (data) { alert(data.authorname); }; JSONP with Python We can get around a JSONP data source though, if we need to; it just requires a bit of work. We can use the str.replace() method in Python to strip out the callback before running the string through our JSON parser. If we were parsing our example JSONP file in our JSON parser example, the string would look something like this: #Convert the string to a JSON object in Python. data = json.loads(bookdata.replace('callback(', '').) .replace(')', '')) Summary In this article, we covered HTTP concepts and methodologies for pulling strings and data from the Web. We learned how to do that with Python using the urllib2 library, and parsed XML data and JSON data. We discussed the differences between JSON and JSONP, and how to work around JSONP if needed. Resources for Article: Further resources on this subject: Introspecting Maya, Python, and PyMEL [Article] Exact Inference Using Graphical Models [Article] Automating Your System Administration and Deployment Tasks Over SSH [Article]

0
0
16218

article-image-amazon-dynamodb-modelling-relationships-error-handling

Packt

20 Aug 2014

7 min read

Amazon DynamoDB - Modelling relationships, Error handling

Packt

20 Aug 2014

7 min read

In this article by Tanmay Deshpande author of Mastering DynamoDB we are going to revise our concepts about the DynamoDB and will try to discover more about its features and implementation. (For more resources related to this topic, see here.) Amazon DynamoDB is a fully managed, cloud hosted, NoSQL database. It provides fast and predictable performance with the ability to scale seamlessly. It allows you to store and retrieve any amount of data , serving any level of network traffic without having any operational burden. DynamoDB gives numerous other advantages like consistent and predictable performance, flexible data modeling and durability. With just few clicks on Amazon Web Service console, you would be able to create your own DynamoDB database table, scale up or scale down provision throughput without taking down your application even for a millisecond. DynamoDB uses solid state disks (SSD) to store the data which confirms the durability of the work you are doing. It also automatically replicates the data across other AWS Availability Zones which provides built-in high availability and reliability. Before we start discussion details about DynamoDB let's try to understand what NoSQL databases are and when to choose DynamoDB over RDBMS. With the rise in data volume, variety and velocity, RDBMS were neither designed to cope up with the scale and flexibility challenges developers are facing to build the modern day applications nor were they able to take advantage of cheap commodity hardware. Also we need to provide schema before we start adding data which was restricting developers from making their application flexible. On the other hand, NoSQL databases are fast, provide flexible schema operations and do the effective use of cheap storage. Considering all these things, NoSQL is becoming popular very quickly amongst the developer community. But one has to be very cautious about when to go for NoSQL and when to stick to RDBMS. Sticking to relational databases makes sense when you know the schema is more over static, strong consistency is must and the data is not going to be that big in volume. But when you want to build an application which is Internet scalable, schema is more likely to get evolved over the time, the storage is going to be really big and the operations involved are ok to be eventually consistent then NoSQL is the way to go. There are various types of NoSQL databases. Following is the list of NoSQL database types and popular examples Document Store – MongoDB, CouchDB, MarkLogic etc. Column Store – Hbase, Cassandra etc. Key Value Store – DynamoDB, Azure, Redis etc. Graph Databases – Neo4J, DEX etc. Most of these NoSQL solutions are open source except few like DynamoDB, Azure which are available as service over Internet. DynamoDB being a key-value store indexes data only upon primary keys and one need to go through primary key to access certain attribute. Let's start learning more about DynamoDB by having a look at its history. DynamoDB's History Amazon's ecommerce platform had a huge set of decoupled services developed and managed individually and each and every service had API to be used and consumed for others. Earlier each service had direct database access which was a major bottleneck. In terms of scalability, Amazon's requirements were more than any third party vendors could provide at that time. DynamoDB was built to address Amazon's high availability, extreme scalability and durability needs. Earlier Amazon used to store its production data in relational databases and services had been provided for all required operations. But later they realized that most of the services access data only through its primary key and need not use complex queries to fetch the required data, plus maintaining these RDBMS systems required high end hardware and skilled personnel. So to overcome all such issues, Amazon's engineering team built a NoSQL database which addresses all above mentioned issues. In 2007, Amazon released one research paper on Dynamo which was combining the best of ideas from database and key value store worlds which was inspiration for many open source projects at the time. Cassandra, Voldemort and Riak were one of them. You can find the above mentioned paper at http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf Even though Dynamo had great features which would take care of all engineering needs, it was not widely accepted at that time in Amazon itself as it was not a fully managed service. When Amazon released S3 and SimpleDB, engineering teams were quite excited to adopt those compared to Dynamo as DynamoDB was bit expensive at that time due to SSDs. So finally after rounds of improvement, Amazon released Dynamo as cloud based service and since then it is one the most widely used NoSQL database. Before releasing to public cloud in 2012, DynamoDB has been the core storage service for Amazon's e-commerce platform, which was started shopping cart and session management service. Any downtime or degradation in performance had major impact on Amazon's business and any financial impact was strictly not acceptable and DynamoDB proved itself to be the best choice at the end. Now let's try to understand in more detail about DynamoDB. What is DynamoDB? DynamoDB is a fully managed, Internet scalable and easily administered, cost effective NoSQL database. It is a part of database as a service offering pane of Amazon Web Services. The above diagram shows how Amazon offers its various cloud services and where DynamoDB is exactly placed. AWS RDS is relational database as a service over Internet from Amazon while Simple DB and DynamoDB are NoSQL database as services. Both SimpleDB and DynamoDB are fully managed, non-relational services. DynamoDB is build considering fast, seamless scalability, and high performance. It runs on SSDs to provide faster responses and has no limits on request capacity and storage. It automatically partitions your data throughout the cluster to meet the expectations while in SimpleDB we have storage limit of 10 GB and can only take limited requests per second. Also in SimpleDB we have to manage our own partitions. So depending upon your need you have to choose the correct solution. To use DynamoDB, the first and foremost requirement is having an AWS account. Through easy to use AWS management console, you can directly create new tables, providing necessary information and can start loading data into the tables in few minutes. Modelling Relationships Like any other database, modeling relationships is quite interesting even though DynamoDB is a NoSQL database. Most of the time, people get confused on how do I model the relationships between various tables, in this section, we are trying make an effort to simplify this problem. To understand the relationships better, let's try to understand that using our example of Book Store where we have entities like Book, Author, Publisher, and so on. One to One In this type of relationship, one entity record of a table is related only one entity record of other table. In our book store application, we have BookInfo and Book-Details tables, BookInfo table can have short information about the book which can be used to display book information on web page whereas BookDetails table would be used when someone explicitly needs to see all the details of book. This design helps us keeping our system healthy as even if there are high request on one table, the other table would always be to up and running to fulfil what it is supposed to do. Following diagram shows how the table structure would look like. One to many In this type of relationship, one record from an entity is related to more than one record in another entity. In book store application, we can have Publisher Book Table which would keep information about the book and publisher relationship. Here we can have Publisher Id as hash key and Book Id as range key. Following diagram shows how a table structure would like. Many to many Many to many relationship means many records from an entity is related to many records from another entity. In case of book store application, we can say that a book can be authored by multiple authors and an author can write multiple books. In this we should use two tables with both and range keys.

0
0
7638

Packt

21 Jul 2014

9 min read

Sharding in Action

Packt

21 Jul 2014

9 min read

0
0
4599

Packt

15 Jul 2014

6 min read

Securing the WAL Stream

Packt

15 Jul 2014

6 min read

(For more resources related to this topic, see here.) The primary mechanism that PostgreSQL uses to provide a data durability guarantee is through its Write Ahead Log (WAL). All transactional data is written to this location before ever being committed to database files. Once WAL files are no longer necessary for crash recovery, PostgreSQL will either delete or archive them. For the purposes of a highly available server, we recommend that you keep these important files as long as possible. There are several reasons for this; they are as follows: Archived WAL files can be used for Point In Time Recovery (PITR) If you are using streaming replication, interrupted streams can be re-established by applying WAL files until the replica has caught up WAL files can be reused to service multiple server copies In order to gain these benefits, we need to enable PostgreSQL WAL archiving and save these files until we no longer need them. This section will address our recommendations for long term storage of WAL files. Getting ready In order to properly archive WAL files, we recommend that you provision a server dedicated to backups or file storage. Depending on the transaction volume, an active PostgreSQL database might produce thousands of these on a daily basis. At 16 MB apiece, this is not an idle concern. For instance, for a 1 TB database, we recommend at least 3 TB of storage space. In addition, we will be using rsync as a daemon on this archive server. To install this on a Debian-based server, execute the following command as a root-level user: sudo apt-get install rsync Red-Hat-based systems will need this command instead: sudo yum install rsync xinetd How to do it... Our archive server has a 3 TB mount at the /db directory and is named arc_server on our network. The PostgreSQL source server resides at 192.168.56.10. Follow these steps for long-term storage of important WAL files on an archive server Enable rsync to run as a daemon on the archive server. On Debian based systems, edit the /etc/default/rsync file and change the RSYNC_ENABLE variable to true. On Red-Hat-based systems, edit the /etc/xinet.d/rsync file and change the disable parameter to no. Create a directory to store archived WAL files as the postgres user with these commands: sudo mkdir /db/pg_archived sudo chown postgres:postgres /db/pg_archived Create a file named /etc/rsyncd.conf and fill it with the following contents: [wal_store] path = /db/pg_archived comment = DB WAL Archives uid = postgres gid = postgres read only = false hosts allow = 192.168.56.10 hosts deny = * Start the rsync daemon. Debian-based systems should execute the following command: sudo service rsync start Red-Hat-based systems can start rsync with this command instead: sudo service xinetd start Change the archive_mode and archive_command parameters in postgresql.conf to read the following: archive_mode = on archive_command = 'rsync -aq %p arc_server::wal_store/%f' Restart the PostgreSQL server with a command similar to this: pg_ctl -D $PGDATA restart How it works The rsync utility is normally used to transfer files between two servers. However, we can take advantage of using it as a daemon to avoid connection overhead imposed by using SSH as an rsync protocol. Our first step is to ensure that the service is not disabled in some manner, which would make the rest of this guide moot. Next, we need a place to store archived WAL files on the archive server. Assuming that we have 3 TB of space in the /db directory, we simply claim /db/pg_archived as the desired storage location. There should be enough space to use /db for backups as well, but we won't discuss that here. Next, we create a file named /etc/rsyncd.conf, which will configure how rsync operates as a daemon. Here, we name the /db/pg_archived directory wal_store so that we can address the path by its name when sending files. We give it a human-readable name and ensure that files are owned by the postgres user, as this user also controls most of the PostgreSQL-related services. The next, and possibly the most important step, is to block all hosts but the primary PostgreSQL server from writing to this location. We set hosts deny to *, which blocks every server. Then, we set hosts allow to the primary database server's IP address so that only it has access. If everything goes well, we can start the rsync (or xinetd on Red Hat systems) service and we can see that in the following screenshot: Next, we enable archive_mode by setting it to on. With archive mode enabled, we can specify a command that will execute when PostgreSQL no longer needs a WAL file for crash recovery. In this case, we invoke the rsync command with the -a parameter to preserve elements such as file ownership, timestamps, and so on. In addition, we specify the -q setting to suppress output, as PostgreSQL only checks the command exit status to determine its success. In the archive_command setting, the %p value represents the full path to the WAL file, and %f resolves to the filename. In this context, we're sending the WAL file to the archive server at the wal_store module we defined in rsyncd.conf. Once we restart PostgreSQL, it will start storing all the old WAL files by sending them to the archive server. In case any rsync command fails because the archive server is unreachable, PostgreSQL will keep trying to send it until it is successful. If the archive server is unreachable for too long, we suggest that you change the archive_command setting to store files elsewhere. This prevents accidentally overfilling the PostgreSQL server storage. There's more... As we will likely want to use the WAL files on other servers, we suggest that you make a list of all the servers that could need WAL files. Then, modify the rsyncd.conf file on the archive server and add this section: [wal_fetch] path = /db/pg_archived comment = DB WAL Archive Retrieval uid = postgres gid = postgres read only = true hosts allow = host1, host2, host3 hosts deny = * Now, we can fetch WAL files from any of the hosts in hosts allow. As these are dedicated PostgreSQL replicas, recovery servers, or other defined roles, this makes the archive server a central location for all our WAL needs. See also We suggest that you read more about the archive_mode and archive_command settings on the PostgreSQL site. We've included a link here: http://www.postgresql.org/docs/9.3/static/runtime-config-wal.html The rsyncd.conf file should also have its own manual page. Read it with this command to learn more about the available settings: man rsyncd.conf Summary In this article, we've successfully learned how to secure the WAL stream by following the given steps. Resources for Article: Further resources on this subject: PostgreSQL 9: Reliable Controller and Disk Setup [article] Backup in PostgreSQL 9 [article] Recovery in PostgreSQL 9 [article]

0
0
2292

article-image-making-most-your-hadoop-data-lake-part-2-optimized-file-formats

Kristen Hardwick

30 Jun 2014

5 min read

Making the Most of Your Hadoop Data Lake, Part 2: Optimized File Formats

Kristen Hardwick

30 Jun 2014

5 min read

One major factor of making the conversion to Hadoop is the concept of the Data Lake. That idea suggests that users keep as much data as possible in HDFS in order to prepare for future use cases and as-yet-unknown data integration points. As your data grows, it is important to make sure that the data is being stored in a way that prolongs that behavior. Data compression is not the only technique that can be used to speed up job performance and improve cluster organization. In addition to the Text and Sequence File options that are typically used by default, Hadoop offers a few more optimized file formats that are specifically designed to improve the process of interacting with the data. In the second part of this two-part series, “Making the Most of Your Hadoop Data Lake”, we will address another important factor in improving manageability—optimized file formats. Using a smarter file for your data: RCFile RCFile stands for Record Columnar File, and it serves as an ideal format for storing relational data that will be accessed through Hive. This format offers performance improvements by storing the data in an optimized way. First, the data is partitioned horizontally, into groups of rows. Then each row group is partitioned vertically, into collections of columns. Finally, the data in each column collection is compressed and stored in column-row format, as if it were a column-oriented database. The first benefit of this altered storage mechanism is apparent at the row level. All HDFS blocks used to form RCFiles will be made up of the horizontally partitioned collections of rows. This is significant because it ensures that no row of data will be split across multiple blocks, and will therefore always be on the same machine. This is not the case for traditional HDFS file formats, which typically use data size to split the file. This optimized data storage will reduce the amount of network bandwidth that is required to serve queries. The second benefit comes from optimizations at the column level, in the form of disk I/O reduction. Since the columns are stored vertically within each row group, the system will be able to seek directly to the required column position in the file, rather than being required to scan across all columns and filter out data that is not necessary. This is extremely useful in queries that only require access to a small subset of the existing columns. RCFiles can be used natively in both Hive and Pig with very little configuration. In Hive CREATE TABLE … STORED AS RCFILE; ALTER TABLE … SET FILEFORMAT RCFILE; SET hive.default.fileformat=RCFile; In Pig: register …/piggybank.jar; a = LOAD '/user/hive/warehouse/table' USING org.apache.pig.piggybank.storage.hiverc.HiveRCInputFormat(…); The Pig jar file referenced here is just one option for enabling the RCFile. At the time of writing, there was also an RCFilePigStorageclass available through Twitter’s Elephant Bird open source library. Hortonworks’ ORCFile and Cloudera’s Parquet formats RCFiles provide optimization for relational files primarily by implementing modifications at the storage level. New innovations have provided improvements on the RCFile format, namely the ORCFile format from Hortonworks and the Parquet format from Cloudera. When storing data using the Optimized Row Columnar file or Parquet formats, several pieces of metadata are automatically written at the column level within each row group; for example, minimum and maximum values for numeric data types and dictionary-style metadata for text data types. The specific metadata is also configurable. One such use case would be for a user to configure the dataset to be sorted on a given set of columns for efficient access. This excess metadata allows for queries to take advantage of an improvement on the original RCFiles–predicate pushdown. That technique allows Hive to evaluate the where clause during the record gathering process, instead of filtering data after all records have been collected. The predicate pushdown technique will evaluate the conditions of the query against the metadata associated with a particular row group, allowing it to skip over entire file blocks if possible, or to seek directly to the correct row. One major benefit of this process is that the more complex a particular where clause is, the more potential there is for row groups and columns to be filtered as irrelevant to the final result. Cloudera’s Parquet format is typically used in conjunction with Impala, but just like with RCFiles, ORCFiles can be incorporated into both Hive and Pig. HCatalog can be used as the primary method to read and write ORCFiles using Pig. The commands for Hive are provided below: In Hive: CREATE TABLE … STORED AS ORC; ALTER TABLE … SET FILEFORMAT ORC SET hive.default.fileformat=Orc Conclusion This post has detailed the alternatives to the default file formats that can be used in Hadoop in order to optimize data access and storage. This information combined with the compression techniques described in the previous post (part 1) will provide some guidelines that can be used to ensure that users can make the most of the Hadoop Data Lake. About the author Kristen Hardwick has been gaining professional experience with software development in parallel computing environments in the private, public, and government sectors since 2007. She has interfaced with several different parallel paradigms including Grid, Cluster, and Cloud. She started her software development career with Dynetics in Huntsville, AL, and then moved to Baltimore, MD, to work for Dynamics Research Corporation. She now works at Spry where her focus is on designing and developing big data analytics for the Hadoop ecosystem.

0
0
2089

article-image-making-most-your-hadoop-data-lake-part-1-data-compression

Kristen Hardwick

30 Jun 2014

6 min read

Making the Most of Your Hadoop Data Lake, Part 1: Data Compression

Kristen Hardwick

30 Jun 2014

6 min read

In the world of big data, the Data Lake concept reigns supreme. Hadoop users are encouraged to keep all data in order to prepare for future use cases and as-yet-unknown data integration points. This concept is part of what makes Hadoop and HDFS so appealing, so it is important to make sure that the data is being stored in a way that prolongs that behavior. In the first part of this two-part series, “Making the Most of Your Hadoop Data Lake”, we will address one important factor in improving manageability—data compression. Data compression is an area that is often overlooked in the context of Hadoop. In many cluster environments, compression is disabled by default, putting the burden on the user. In this post, we will discuss the tradeoffs involved in deciding how to take advantage of compression techniques and the advantages and disadvantages of specific compression codec options with respect to Hadoop. To compress or not to compress Whenever data is converted to something other than its raw data format, that naturally implies some overhead involved in completing the conversion process. When data compression is being discussed, it is important to take that overhead into account with respect to the benefits of reducing the data footprint. One obvious benefit is that compressed data will reduce the amount of disk space that is required for storage of a particular dataset. In the big data environment, this benefit is especially significant—either the Hadoop cluster will be able to keep data for a larger time range, or storing data for the same time range will require fewer nodes, or the disk usage ratios will remain lower for longer. In addition, the smaller file sizes will mean lower data transfer times—either internally for MapReduce jobs or when performing exports of data results. The cost of these benefits, however, is that the data must be decompressed at every point where the data needs to be read, and compressed before being inserted into HDFS. With respect to MapReduce jobs, this processing overhead at both the map phase and the reduce phase will increase the CPU processing time. Fortunately, by making informed choices about the specific compression codecs used at any given phase in the data transformation process, the cluster administrator or user can ensure that the advantages of compression outweigh the disadvantages. Choosing the right codec for each phase Hadoop provides the user with some flexibility on which compression codec is used at each step of the data transformation process. It is important to realize that certain codecs are optimal for some stages, and non-optimal for others. In the next sections, we will cover some important notes for each choice. zlib The major benefit of using this codec is that it is the easiest way to get the benefits of data compression from a cluster and job configuration standpoint—the zlib codec is the default compression option. From the data transformation perspective, this codec will decrease the data footprint on disk, but will not provide much of a benefit in terms of job performance. gzip The gzip codec available in Hadoop is the same one that is used outside of the Hadoop ecosystem. It is common practice to use this as the codec for compressing the final output from a job, simply for the benefit of being able to share the compressed result with others (possibly outside of Hadoop) using a standard file format. bzip2 There are two important benefits for the bzip2 codec. First, if reducing the data footprint is a high priority, this algorithm will compress the data more than the default zlib option. Second, this is the only supported codec that produces “splittable” compressed data. A major characteristic of Hadoop is the idea of splitting the data so that they can be handled on each node independently. With the other compression codecs, there is an initial requirement to gather all parts of the compressed file in order to have all information necessary to decompress the data. With this format, the data can be decompressed in parallel. This splittable quality makes this format ideal for compressing data that will be used as input to a map function, either in a single step or as part of a series of chained jobs. LZO, LZ4, Snappy These three codecs are ideal for compressing intermediate data—the data output from the mappers that will be immediately read in by the reducers. All three codecs heavily favor compression speed over file size ratio, but the detailed specifications for each algorithm should be examined based on the specific licensing, cluster, and job requirements. Enabling compression Once the appropriate compression codec for any given transformation phase has been selected, there are a few configuration properties that need to be adjusted in order to have the changes take effect in the cluster. Intermediate data to reducer mapreduce.map.output.compress = true (Optional) mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.SnappyCodec Final output from a job mapreduce.output.fileoutputformat.compress = true (Optional) mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.BZip2Codec These compression codecs are also available within some of the ecosystem tools like Hive and Pig. In most cases, the tools will default to the Hadoop-configured values for particular codecs, but the tools also provide the option to compress the data generated between steps. Pig pig.tmpfilecompression = true (Optional) pig.tmpfilecompression.codec = snappy Hive hive.exec.compress.intermediate = true hive.exec.compress.output = true Conclusion This post detailed the benefits and disadvantages of data compression, along with some helpful guidelines on how to choose a codec and enable it at various stages in the data transformation workflow. In the next post, we will go through some additional techniques that can be used to ensure that users can make the most of the Hadoop Data Lake. For more Big Data and Hadoop tutorials and insight, visit our dedicated Hadoop page. About the author Kristen Hardwick has been gaining professional experience with software development in parallel computing environments in the private, public, and government sectors since 2007. She has interfaced with several different parallel paradigms including Grid, Cluster, and Cloud. She started her software development career with Dynetics in Huntsville, AL, and then moved to Baltimore, MD, to work for Dynamics Research Corporation. She now works at Spry where her focus is on designing and developing big data analytics for the Hadoop ecosystem.

0
0
6671

article-image-mapreduce-openstack-swift-and-zerovm

Lars Butler

30 Jun 2014

6 min read

MapReduce with OpenStack Swift and ZeroVM

Lars Butler

30 Jun 2014

6 min read

Originally coined in a 2004 research paper produced by Google, the term “MapReduce” was defined as a “programming model and an associated implementation for processing and generating large datasets”. While Google’s proprietary implementation of this model is known simply as “MapReduce”, the term has since become overloaded to refer to any software that follows the same general computation model. The philosophy behind the MapReduce computation model is based on a divide and conquer approach: the input dataset is divided into small pieces and distributed among a pool of computation nodes for processing. The advantage here is that many nodes can run in parallel to solve a problem. This can be much quicker than a single machine, provided that the cost of communication (loading input data, relaying intermediate results between compute nodes, and writing final results to persistent storage) does not outweigh the computation time. Indeed, the cost of moving data between storage and computation nodes can diminish the advantage of distributed parallel computation if task payloads are too granular. On the other hand, they must be granular enough in order to be distributed evenly (although achieving perfect distribution is nearly impossible if exact computational complexity of each task cannot be known in advance). A distributed MapReduce solution can be effective for processing large data sets, provided that each task is sufficiently long-running to offset communication costs. If the need for data transfer (between compute and storage nodes) could somehow be reduced or eliminated, the task size limitations would all but disappear, enabling a wider range of use cases. One way to achieve this is to perform the computation closer to the data, that is, on the same physical machine. Stored procedures in relational database systems, for example, are one way to achieve data-local computation. Simpler yet, computation could be run directly on the system where the data is stored in files. In both cases, the same problems are present: changes to the code require direct access to the system, and thus access needs to be carefully controlled (through group management, read/write permissions, and so on). Scaling in these scenarios is also problematic: vertical scaling (beefing up a single server) is the only feasible way to gain performance while maintaining the efficiency of data-local computation. Achieving horizontal scalability by adding more machines is not feasible unless the storage system inherently supports this. Enter Swift and ZeroVM OpenStack Swift is one such example of a horizontally scalable data store. Swift can store petabytes of data in millions of objects across thousands of machines. Swift’s storage operations can also be extended by writing custom middleware applications, which makes it a good foundation for building a “smart” storage system capable of running computations directly on the data. ZeroCloud is a middleware application for Swift which does just that. It embeds ZeroVM inside Swift and exposes a new API method for executing jobs. ZeroVM provides sufficient program isolation and a guarantee of security such that any arbitrary code can be run safely without compromising the storage system. Access to data is governed by Swift’s inherent access control, so there is no need to further lock down the storage system, even in a multi-tenant environment. The combination of Swift and ZeroVM results in something like stored procedures, but much more powerful. First and foremost, user application code can be updated at any time without the need to worry about malicious or otherwise destructive side effects. (The worst thing that can happen is that a user can crash their own machines and delete their own data—but not another user’s.) Second, ZeroVM instances can start very quickly: in about five milliseconds. This means that to process one thousand objects, ZeroCloud can instantaneously spawn one thousand ZeroVM instances (one for each file), run the job, and destroy the instances very quickly. This also means that any job, big or small, can feasibly run on the system. Finally, the networking component of ZeroVM allows intercommunication between instances, enabling the creation of arbitrary job pipelines and multi-stage computations. This converged compute and storage solution makes for a good alternative to popular MapReduce frameworks like Hadoop because little effort is required to set up and tear down computation nodes for a job. Also, once a job is complete, there is no need to extract result data, pipe it across the network, and then save it in a separate persistent data store (which can be an expensive operation if there is a large quantity of data to save). With ZeroCloud, the results can simply be saved back into Swift. The same principle applies to the setup of a job. There is no need to move or copy data to the computation cluster; the data is already where it needs to be. Limitations Currently, ZeroCloud uses a fairly naïve scheduling algorithm. To perform a data-local computation on an object, ZeroCloud determines which machines in the cluster contain a replicated copy of an object and then randomly chooses one to execute a ZeroVM process. While this functions well as a proof-of-concept for data-local computing, it is quite possible for jobs to be spread unevenly across a cluster, resulting in an inefficient use of resources. A research group at the University of Texas at San Antonio (UTSA) sponsored by Rackspace is currently working on developing better algorithms for scheduling workloads running on ZeroVM-based infrastructure. A short video of the research proposal can be found here:http://link.brightcove.com/services/player/bcpid3530100726001?bckey=AQ~~,AAACa24Pu2k~,Q186VLPcl3-oLBDP8npyqxjCNB5jgYcT. Further reading Rackspace has launched a ZeroCloud playground service called Zebra for developers to try out the platform. At the time this was written, Zebra is still in a private beta and invitations are limited. But it also possible to install and run your own copy of ZeroCloud for testing and development; basic installation instructions are here: https://github.com/zerovm/zerocloud/blob/icehouse/doc/Hacking.md. There are also some tutorials (including sample applications) for creating, packaging, deploying, and executing applications on ZeroCloud: http://zerovm.readthedocs.org/en/latest/zebra/tutorial.html. The tutorials are intended for use with the Zebra service, but can be run on any deployment of ZeroCloud. Big Data in the cloud? It's not the future, it's already here! Find more Hadoop tutorials and extra content here, and dive deeper into OpenStack by visiting this page, dedicated to one of the most exciting cloud platforms around today. About the author Lars Butler is a software developer at Rackspace, the open cloud company. He has worked as a software developer in avionics, seismic hazard research, and most recently, on ZeroVM. He can be reached @larsbutler.

0
0
2966

How-To Tutorials - Data

More Line Charts, Area Charts, and Scatter Plots

What is a content provider?

Stream Grouping

Classifying Text

Camera Calibration

Report Data Filtering

Recommender systems dissected

Using Canvas and D3

Importing Dynamic Data

Amazon DynamoDB - Modelling relationships, Error handling

Trending Topics

Sharding in Action

Securing the WAL Stream

Making the Most of Your Hadoop Data Lake, Part 2: Optimized File Formats

Making the Most of Your Hadoop Data Lake, Part 1: Data Compression

MapReduce with OpenStack Swift and ZeroVM

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access