Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-more-line-charts-area-charts-and-scatter-plots
Packt
26 Aug 2014
13 min read
Save for later

More Line Charts, Area Charts, and Scatter Plots

Packt
26 Aug 2014
13 min read
In this article by Scott Gottreu, the author of Learning jqPlot, we'll learn how to import data from remote sources. We will discuss what area charts, stacked area charts, and scatter plots are. Then we will learn how to implement these newly learned charts. We will also learn about trend lines. (For more resources related to this topic, see here.) Working with remote data sources We return from lunch and decide to start on our line chart showing social media conversions. With this chart, we want to pull the data in from other sources. You start to look for some internal data sources, coming across one that returns the data as an object. We can see an excerpt of data returned by the data source. We will need to parse the object and create data arrays for jqPlot: { "twitter":[ ["2012-11-01",289],...["2012-11-30",225] ], "facebook":[ ["2012-11-01",27],...["2012-11-30",48] ] } We solve this issue using a data renderer to pull our data and then format it properly for jqPlot. We can pass a function as a variable to jqPlot and when it is time to render the data, it will call this new function. We start by creating the function to receive our data and then format it. We name it remoteDataSource. jqPlot will pass the following three parameters to our function: url: This is the URL of our data source. plot: The jqPlot object we create is passed by reference, which means we could modify the object from within remoteDataSource. However, it is best to treat it as a read-only object. options: We can pass any type of option in the dataRendererOptions option when we create our jqPlot object. For now, we will not be passing in any options: <script src="../js/jqplot.dateAxisRenderer.min.js"></script> <script> $(document).ready(function(){ var remoteDataSource = function(url, plot, options) { Next we create a new array to hold our formatted data. Then, we use the $.ajax method in jQuery to pull in our data. We set the async option to false. If we don't, the function will continue to run before getting the data and we'll have an empty chart: var data = new Array; $.ajax({ async: false, We set the url option to the url variable that jqPlot passed in. We also set the data type to json: url: url, dataType:"json", success: function(remoteData) { Then we will take the twitter object in our JSON and make that the first element of our data array and make facebook the second element. We then return the whole array back to jqPlot to finish rendering our chart: data.push(remoteData.twitter); data.push(remoteData.facebook); } }); return data; }; With our previous charts, after the id attribute, we would have passed in a data array. This time, instead of passing in a data array, we pass in a URL. Then, within the options, we declare the dataRenderer option and set remoteDataSource as the value. Now when our chart is created, it will call our renderer and pass in all the three parameters we discussed earlier: var socialPlot = $.jqplot ('socialMedia', "./data/social_shares.json", { title:'Social Media Shares', dataRenderer: remoteDataSource, We create labels for both our data series and enable the legend: series:[ { label: 'Twitter' }, { label: 'Facebook' } ], legend: { show: true, placement: 'outsideGrid' }, We enable DateAxisRenderer for the x axis and set min to 0 on the y axis, so jqPlot will not extend the axis below zero: axes:{ xaxis:{ renderer:$.jqplot.DateAxisRenderer, label: 'Days in November' }, yaxis: { min:0, label: 'Number of Shares' } } }); }); </script> <div id="socialMedia" style="width:600px;"></div> If you are running the code samples from your filesystem in Chrome, you will get an error message similar to this: No 'Access-Control-Allow-Origin' header is present on the requested resource. The security settings do not allow AJAX requests to be run against files on the filesystem. It is better to use a local web server such as MAMP, WAMP, or XAMPP. This way, we avoid the access control issues. Further information about cross-site HTTP requests can be found at the Mozilla Developer Network at https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS. We load this new chart in our browser and can see the result. We are likely to run into cross-domain issues when trying to access remote sources that do not allow cross-domain requests. The common practice to overcome this hurdle would be to use the JSONP data type in our AJAX call. jQuery will only run JSONP calls asynchronously. This keeps your web page from hanging if a remote source stops responding. However, because jqPlot requires all the data from the remote source before continuing, we can't use cross-domain sources with our data renderers. We start to think of ways we can use external APIs to pull in data from all kinds of sources. We make a note to contact the server guys to write some scripts to pull from the external APIs we want and pass along the data to our charts. By doing it in this way, we won't have to implement OAuth (OAuth is a standard framework used for authentication), http://oauth.net/2, in our web app or worry about which sources allow cross-domain access. Adding to the project's scope As we continue thinking up new ways to work with this data, Calvin stops by. "Hey guys, I've shown your work to a few of the regional vice-presidents and they love it." Your reply is that all of this is simply an experiment and was not designed for public consumption. Calvin holds up his hands as if to hold our concerns at bay. "Don't worry, they know it's all in beta. They did have a couple of ideas. Can you insert in the expenses with the revenue and profit reports? They also want to see those same charts but formatted differently." He continues, "One VP mentioned that maybe we could have one of those charts where everything under the line is filled in. Oh, and they would like to see these by Wednesday ahead of the meeting." With that, Calvin turns around and makes his customary abrupt exit. Adding a fill between two lines We talk through Calvin's comments. Adding in expenses won't be too much of an issue. We could simply add the expense line to one of our existing reports but that will likely not be what they want. Visually, the gap on our chart between profit and revenue should be the total amount of expenses. You mention that we could fill in the gap between the two lines. We decide to give this a try: We leave the plugins and the data arrays alone. We pass an empty array into our data array as a placeholder for our expenses. Next, we update our title. After this, we add a new series object and label it Expenses: ... var rev_profit = $.jqplot ('revPrfChart', [revenue, profit, [] ], { title:'Monthly Revenue & Profit with Highlighted Expenses', series:[ { label: 'Revenue' }, { label: 'Profit' }, { label: 'Expenses' } ], legend: { show: true, placement: 'outsideGrid' }, To fill in the gap between the two lines, we use the fillBetween option. The only two required options are series1 and series2. These require the positions of the two data series in the data array. So in our chart, series1 would be 0 and series2 would be 1. The other three optional settings are: baseSeries, color, and fill. The baseSeries option tells jqPlot to place the fill on a layer beneath the given series. It will default to 0. If you pick a series above zero, then the fill will hide any series below the fill layer: fillBetween: { series1: 0, series2: 1, We want to assign a different value to color because it will default to the color of the first data series option. The color option will accept either a hexadecimal value or the rgba option, which allows us to change the opacity of the fill. Even though the fill option defaults to true, we explicitly set it. This option also gives us the ability to turn off the fill after the chart is rendered: color: "rgba(232, 44, 12, 0.5)", fill: true }, The settings for the rest of the chart remain unchanged: axes:{ xaxis:{ renderer:$.jqplot.DateAxisRenderer, label: 'Months' }, yaxis:{ label: 'Totals Dollars', tickOptions: { formatString: "$%'d" } } } }); }); </script> <div id="revPrfChart" style="width:600px;"></div> We switch back to our web browser and load the new page. We see the result of our efforts in the following screenshot. This chart layout works but we think Calvin and the others will want something else. We decide we need to make an area chart. Understanding area and stacked area charts Area charts come in two varieties. The default type of area chart is simply a modification of a line chart. Everything from the data point on the y axis all the way to zero is shaded. In the event your numbers are negative, then the data above the line up to zero is shaded in. Each data series you have is laid upon the others. Area charts are best to use when we want to compare similar elements, for example, sales by each division in our company or revenue among product categories. The other variation of an area chart is the stacked area chart. The chart starts off being built in the same way as a normal area chart. The first line is plotted and shaded below the line to zero. The difference occurs with the remaining lines. We simply stack them. To understand what happens, consider this analogy. Each shaded line represents a wall built to the height given in the data series. Instead of building one wall behind another, we stack them on top of each other. What can be hard to understand is the y axis. It now denotes a cumulative total, not the individual data points. For example, if the first y value of a line is 4 and the first y value on the second line is 5, then the second point will be plotted at 9 on our y axis. Consider this more complicated example: if the y value in our first line is 2, 7 for our second line, and 4 for the third line, then the y value for our third line will be plotted at 13. That's why we need to compare similar elements. Creating an area chart We grab the quarterly report with the divisional profits we created this morning. We will extend the data to a year and plot the divisional profits as an area chart: We remove the data arrays for revenue and the overall profit array. We also add data to the three arrays containing the divisional profits: <script src="../js/jqplot.dateAxisRenderer.min.js"></script> <script> $(document).ready(function(){ var electronics = [["2011-11-20", 123487.87], ...]; var media = [["2011-11-20", 66449.15], ...]; var nerd_corral = [["2011-11-20", 2112.55], ...]; var div_profit = $.jqplot ('division_profit', [ media, nerd_corral, electronics ], { title:'12 Month Divisional Profits', Under seriesDefaults, we assign true to fill and fillToZero. Without setting fillToZero to true, the fill would continue to the bottom of the chart. With the option set, the fill will extend downward to zero on the y axis for positive values and stop. For negative data points, the fill will extend upward to zero: seriesDefaults: { fill: true, fillToZero: true }, series:[ { label: 'Media & Software' }, { label: 'Nerd Corral' }, { label: 'Electronics' } ], legend: { show: true, placement: 'outsideGrid' }, For our x axis, we set numberTicks to 6. The rest of our options we leave unchanged: axes:{ xaxis:{ label: 'Months', renderer:$.jqplot.DateAxisRenderer, numberTicks: 6, tickOptions: { formatString: "%B" } }, yaxis: { label: 'Total Dollars', tickOptions: { formatString: "$%'d" } } } }); }); </script> <div id="division_profit" style="width:600px;"></div> We review the results of our changes in our browser. We notice something is wrong: only the Electronics series, shown in brown, is showing. This goes back to how area charts are built. Revisiting our wall analogy, we have built a taller wall in front of our other two walls. We need to order our data series from largest to smallest: We move the Electronics series to be the first one in our data array: var div_profit = $.jqplot ('division_profit', [ electronics, media, nerd_corral ], It's also hard to see where some of the lines go when they move underneath another layer. Thankfully, jqPlot has a fillAlpha option. We pass in a percentage in the form of a decimal and jqPlot will change the opacity of our fill area: ... seriesDefaults: { fill: true, fillToZero: true, fillAlpha: .6 }, ... We reload our chart in our web browser and can see the updated changes. Creating a stacked area chart with revenue Calvin stops by while we're taking a break. "Hey guys, I had a VP call and they want to see revenue broken down by division. Can we do that?" We tell him we can. "Great" he says, before turning away and leaving. We discuss this new request and realize this would be a great chance to use a stacked area chart. We dig around and find the divisional revenue numbers Calvin wanted. We can reuse the chart we just created and just change out the data and some options. We use the same variable names for our divisional data and plug in revenue numbers instead of profit. We use a new variable name for our chart object and a new id attribute for our div. We update our title and add the stackSeries option and set it to true: var div_revenue = $.jqplot ( 'division_revenue' , [electronics, media, nerd_corral], { title: '12 Month Divisional Revenue', stackSeries: true, We leave our series' options alone and the only option we change on our x axis is set numberTicks back to 3: seriesDefaults: { fill: true, fillToZero: true }, series:[ { label: 'Electronics' }, { label: 'Media & Software' }, { label: 'Nerd Corral' } ], legend: { show: true, placement: 'outsideGrid' }, axes:{ xaxis:{ label: 'Months', renderer:$.jqplot.DateAxisRenderer, numberTicks: 3, tickOptions: { formatString: "%B" } }, We finish our changes by updating the ID of our div container: yaxis: { label: 'Total Dollars', tickOptions: { formatString: "$%'d" } } } }); }); </script> <div id=" division_revenue " style="width:600px;"></div> With our changes complete, we load this new chart in our browser. As we can see in the following screenshot, we have a chart with each of the data series stacked on top of each other. Because of the nature of a stacked chart, the individual data points are no longer decipherable; however, with the visualization, this is less of an issue. We decide that this is a good place to stop for the day. We'll start on scatterplots and trend lines tomorrow morning. As we begin gathering our things, Calvin stops by on his way out and we show him our recent work. "This is amazing. You guys are making great progress." We tell him we're going to move on to trend lines tomorrow. "Oh, good," Calvin says. "I've had requests to show trending data for our revenue and profit. Someone else mentioned they would love to see trending data of shares on Twitter for our daily deals site. But, like you said, that can wait till tomorrow. Come on, I'll walk with you two."
Read more
  • 0
  • 0
  • 2386

article-image-what-content-provider
Packt
26 Aug 2014
5 min read
Save for later

What is a content provider?

Packt
26 Aug 2014
5 min read
This article is written by Sunny Kumar Aditya and Vikash Kumar Karn, the authors of Android SQLite Essentials. There are four essential components in an Android Application: activity, service, broadcast receiver, and content provider. Content provider is used to manage access to structured set of data. They encapsulate the data and provide abstraction as well as the mechanism for defining data security. However, content providers are primarily intended to be used by other applications that access the provider using a provider's client object. Together, providers and provider clients offer a consistent, standard interface to data that also handles inter-process communication and secure data access. (For more resources related to this topic, see here.) A content provider allows one app to share data with other applications. By design, an Android SQLite database created by an application is private to the application; it is excellent if you consider the security point of view but troublesome when you want to share data across different applications. This is where a content provider comes to the rescue; you can easily share data by building your content provider. It is important to note that although our discussion would focus on a database, a content provider is not limited to it. It can also be used to serve file data that normally goes into files, such as photos, audio, or videos. The interaction between Application A and B happens while exchanging data can be seen in the following diagram: Here we have an Application A, whose activity needs to access the database of Application B. As we already read, the database of the Application B is stored in the internal memory and cannot be directly accessed by Application A. This is where the Content Provider comes into the picture; it allows us to share data and modify access to other applications. The content provider implements methods for querying, inserting, updating, and deleting data in databases. Application A now requests the content provider to perform some desired operations on behalf of it. We would explore the use of Content Provider to fetch contacts from a phone's contact database. Using existing content providers Android lists a lot of standard content providers that we can use. Some of them are Browser, CalendarContract, CallLog, Contacts, ContactsContract, MediaStore, userDictionary, and so on. We will fetch contacts from a phone's contact list with the help of system's existing ContentProvider and ContentResolver. We will be using the ContactsContract provider for this purpose. What is a content resolver? The ContentResolver object in the application's context is used to communicate with the provider as a client. The ContentResolver object communicates with the provider object—an instance of a class that implements ContentProvider. The provider object receives data requests from clients, performs the requested action, and returns the results. ContentResolver is the single, global instance in our application that provides access to other application's content provider; we do not need to worry about handling inter-process communication. The ContentResolver methods provide the basic CRUD (create, retrieve, update, and delete) functions of persistent storage; it has methods that call identically named methods in the provider object but does not know the implementation. In the code provided in the corresponding AddNewContactActivity class, we will initiate picking of contact by building an intent object, Intent.ACTION_PICK, that allows us to pick an item from a data source; in addition, all we need to know is the URI of the provider, which in our case is ContactsContract.Contacts.CONTENT_URI: public void pickContact() { try { Intent cIntent = new Intent(Intent.ACTION_PICK, ContactsContract.Contacts.CONTENT_URI); startActivityForResult(cIntent, PICK_CONTACT); } catch (Exception e) { e.printStackTrace(); Log.i(TAG, "Exception while picking contact"); } } The code used in this article is placed at GitHub: https://github.com/sunwicked/Ch3-PersonalContactManager/tree/master This functionality is also provided by Messaging, Gallery, and Contacts. The Contacts screen will pop up allowing us to browse or search for contacts we require to migrate to our new application. In onActivityResult, that is our next stop, we will use the method to handle our corresponding request to pick and use contacts. Let's look at the code we have to add to pick contacts from an Android's contact provider: protected void onActivityResult(int requestCode, int resultCode, Intent data) { . . else if (requestCode == PICK_CONTACT) { if (resultCode == Activity.RESULT_OK) { Uri contactData = data.getData(); Cursor c = getContentResolver().query(contactData, null, null, null, null); if (c.moveToFirst()) { String id = c .getString(c.getColumnIndexOrThrow(ContactsContract.Contacts._ID)); String hasPhone = c .getString(c.getColumnIndex(ContactsContract.Contacts.HAS_PHONE_NUMBER)); if (hasPhone.equalsIgnoreCase("1")) { Cursor phones = getContentResolver().query(ContactsContract. CommonDataKinds.Phone.CONTENT_URI, null, ContactsContract.CommonDataKinds. Phone.CONTACT_ID + " = " + id, null, null); phones.moveToFirst(); contactPhone.setText(phones.getString(phones.getColumnIndex("data1"))); contactName .setText(phones.getString(phones.getColumnIndex(ContactsContract.Contacts. DISPLAY_NAME))); } ….. We start by checking whether the request code is matching ours, and then we cross check resultcode. We get the content resolver object by making a call to getcontentresolver on the Context object; it is a method of the android.content.Context class. As we are in an activity that inherits from Context, we do not need to be explicit in making a call to it; same goes for services. We will now check whether the contact we picked has a phone number or not. After verifying the necessary details, we pull data that we require, such as contact name and phone number, and set them in relevant fields. Summary This article reflects on how to access and share data in Android via content providers and how to construct a content provider. We also talk about content resolvers and how they are used to communicate with the providers as a client. Resources for Article: Further resources on this subject: Reversing Android Applications [article] Saying Hello to Unity and Android [article] Android Fragmentation Management [article]
Read more
  • 0
  • 0
  • 10315

article-image-stream-grouping
Packt
26 Aug 2014
7 min read
Save for later

Stream Grouping

Packt
26 Aug 2014
7 min read
In this article, by Ankit Jain and Anand Nalya, the authors of the book Learning Storm, we will cover different types of stream groupings. (For more resources related to this topic, see here.) When defining a topology, we create a graph of computation with a number of bolt-processing streams. At a more granular level, each bolt executes as multiple tasks in the topology. A stream will be partitioned into a number of partitions and divided among the bolts' tasks. Thus, each task of a particular bolt will only get a subset of the tuples from the subscribed streams. Stream grouping in Storm provides complete control over how this partitioning of tuples happens among many tasks of a bolt subscribed to a stream. Grouping for a bolt can be defined on the instance of the backtype.storm.topology.InputDeclarer class returned when defining bolts using the backtype.storm.topology.TopologyBuilder.setBolt method. Storm supports the following types of stream groupings: Shuffle grouping Fields grouping All grouping Global grouping Direct grouping Local or shuffle grouping Custom grouping Now, we will look at each of these groupings in detail. Shuffle grouping Shuffle grouping distributes tuples in a uniform, random way across the tasks. An equal number of tuples will be processed by each task. This grouping is ideal when you want to distribute your processing load uniformly across the tasks and where there is no requirement of any data-driven partitioning. Fields grouping Fields grouping enables you to partition a stream on the basis of some of the fields in the tuples. For example, if you want that all the tweets from a particular user should go to a single task, then you can partition the tweet stream using fields grouping on the username field in the following manner: builder.setSpout("1", new TweetSpout()); builder.setBolt("2", new TweetCounter()).fieldsGrouping("1", new Fields("username")) Fields grouping is calculated with the following function: hash (fields) % (no. of tasks) Here, hash is a hashing function. It does not guarantee that each task will get tuples to process. For example, if you have applied fields grouping on a field, say X, with only two possible values, A and B, and created two tasks for the bolt, then it might be possible that both hash (A) % 2 and hash (B) % 2 are equal, which will result in all the tuples being routed to a single task and other tasks being completely idle. Another common usage of fields grouping is to join streams. Since partitioning happens solely on the basis of field values and not the stream type, we can join two streams with any common join fields. The name of the fields do not need to be the same. For example, in order to process domains, we can join the Order and ItemScanned streams when an order is completed: builder.setSpout("1", new OrderSpout()); builder.setSpout("2", new ItemScannedSpout()); builder.setBolt("joiner", new OrderJoiner()) .fieldsGrouping("1", new Fields("orderId")) .fieldsGrouping("2", new Fields("orderRefId")); All grouping All grouping is a special grouping that does not partition the tuples but replicates them to all the tasks, that is, each tuple will be sent to each of the bolt's tasks for processing. One common use case of all grouping is for sending signals to bolts. For example, if you are doing some kind of filtering on the streams, then you have to pass the filter parameters to all the bolts. This can be achieved by sending those parameters over a stream that is subscribed by all bolts' tasks with all grouping. Another example is to send a reset message to all the tasks in an aggregation bolt. The following is an example of all grouping: builder.setSpout("1", new TweetSpout()); builder.setSpout("signals", new SignalSpout()); builder.setBolt("2", new TweetCounter()).fieldsGrouping("1", new Fields("username")).allGrouping("signals"); Here, we are subscribing signals for all the TweetCounter bolt's tasks. Now, we can send different signals to the TweetCounter bolt using SignalSpout. Global grouping Global grouping does not partition the stream but sends the complete stream to the bolt's task with the smallest ID. A general use case of this is when there needs to be a reduce phase in your topology where you want to combine results from previous steps in the topology in a single bolt. Global grouping might seem redundant at first, as you can achieve the same results with defining the parallelism for the bolt as one and setting the number of input streams to one. Though, when you have multiple streams of data coming through different paths, you might want only one of the streams to be reduced and others to be processed in parallel. For example, consider the following topology. In this topology, you might want to route all the tuples coming from Bolt C to a single Bolt D task, while you might still want parallelism for tuples coming from Bolt E to Bolt D. Global grouping This can be achieved with the following code snippet: builder.setSpout("a", new SpoutA()); builder.setSpout("b", new SpoutB()); builder.setBolt("c", new BoltC()); builder.setBolt("e", new BoltE()); builder.setBolt("d", new BoltD()) .globalGrouping("c") .shuffleGrouping("e"); Direct grouping In direct grouping, the emitter decides where each tuple will go for processing. For example, say we have a log stream and we want to process each log entry using a specific bolt task on the basis of the type of resource. In this case, we can use direct grouping. Direct grouping can only be used with direct streams. To declare a stream as a direct stream, use the backtype.storm.topology.OutputFieldsDeclarer.declareStream method that takes a Boolean parameter directly in the following way in your spout: @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declareStream("directStream", true, new Fields("field1")); } Now, we need the number of tasks for the component so that we can specify the taskId parameter while emitting the tuple. This can be done using the backtype.storm.task.TopologyContext.getComponentTasks method in the prepare method of the bolt. The following snippet stores the number of tasks in a bolt field: public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { this.numOfTasks = context.getComponentTasks("my-stream"); this.collector = collector; } Once you have a direct stream to emit to, use the backtype.storm.task.OutputCollector.emitDirect method instead of the emit method to emit it. The emitDirect method takes a taskId parameter to specify the task. In the following snippet, we are emitting to one of the tasks randomly: public void execute(Tuple input) { collector.emitDirect(new Random().nextInt(this.numOfTasks), process(input)); } Local or shuffle grouping If the tuple source and target bolt tasks are running in the same worker, using this grouping will act as a shuffle grouping only between the target tasks running on the same worker, thus minimizing any network hops resulting in increased performance. In case there are no target bolt tasks running on the source worker process, this grouping will act similar to the shuffle grouping mentioned earlier. Custom grouping If none of the preceding groupings fit your use case, you can define your own custom grouping by implementing the backtype.storm.grouping.CustomStreamGrouping interface. The following is a sample custom grouping that partitions a stream on the basis of the category in the tuples: public class CategoryGrouping implements CustomStreamGrouping, Serializable { // Mapping of category to integer values for grouping private static final Map<String, Integer> categories = ImmutableMap.of ( "Financial", 0, "Medical", 1, "FMCG", 2, "Electronics", 3 ); // number of tasks, this is initialized in prepare method private int tasks = 0; public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) { // initialize the number of tasks tasks = targetTasks.size(); } public List<Integer> chooseTasks(int taskId, List<Object> values) { // return the taskId for a given category String category = (String) values.get(0); return ImmutableList.of(categories.get(category) % tasks); } } Now, we can use this grouping in our topologies with the following code snippet: builder.setSpout("a", new SpoutA()); builder.setBolt("b", (IRichBolt)new BoltB()) .customGrouping("a", new CategoryGrouping()); The following diagram represents the Storm groupings graphically: Summary In this article, we discussed stream grouping in Storm and its types. Resources for Article: Further resources on this subject: Integrating Storm and Hadoop [article] Deploying Storm on Hadoop for Advertising Analysis [article] Photo Stream with iCloud [article]
Read more
  • 0
  • 0
  • 3647

article-image-classifying-text
Packt
26 Aug 2014
23 min read
Save for later

Classifying Text

Packt
26 Aug 2014
23 min read
In this article by Jacob Perkins, author of Python 3 Text Processing with NLTK 3 Cookbook, we will learn how to transform text into feature dictionaries, and how to train a text classifier for sentiment analysis. (For more resources related to this topic, see here.) Bag of words feature extraction Text feature extraction is the process of transforming what is essentially a list of words into a feature set that is usable by a classifier. The NLTK classifiers expect dict style feature sets, so we must therefore transform our text into a dict. The bag of words model is the simplest method; it constructs a word presence feature set from all the words of an instance. This method doesn't care about the order of the words, or how many times a word occurs, all that matters is whether the word is present in a list of words. How to do it... The idea is to convert a list of words into a dict, where each word becomes a key with the value True. The bag_of_words() function in featx.py looks like this: def bag_of_words(words): return dict([(word, True) for word in words]) We can use it with a list of words; in this case, the tokenized sentence the quick brown fox: >>> from featx import bag_of_words >>> bag_of_words(['the', 'quick', 'brown', 'fox']) {'quick': True, 'brown': True, 'the': True, 'fox': True} The resulting dict is known as a bag of words because the words are not in order, and it doesn't matter where in the list of words they occurred, or how many times they occurred. All that matters is that the word is found at least once. You can use different values than True, but it is important to keep in mind that the NLTK classifiers learn from the unique combination of (key, value). That means that ('fox', 1) is treated as a different feature than ('fox', 2). How it works... The bag_of_words() function is a very simple list comprehension that constructs a dict from the given words, where every word gets the value True. Since we have to assign a value to each word in order to create a dict, True is a logical choice for the value to indicate word presence. If we knew the universe of all possible words, we could assign the value False to all the words that are not in the given list of words. But most of the time, we don't know all the possible words beforehand. Plus, the dict that would result from assigning False to every possible word would be very large (assuming all words in the English language are possible). So instead, to keep feature extraction simple and use less memory, we stick to assigning the value True to all words that occur at least once. We don't assign the value False to any word since we don't know what the set of possible words are; we only know about the words we are given. There's more... In the default bag of words model, all words are treated equally. But that's not always a good idea. As we already know, some words are so common that they are practically meaningless. If you have a set of words that you want to exclude, you can use the bag_of_words_not_in_set() function in featx.py: def bag_of_words_not_in_set(words, badwords): return bag_of_words(set(words) - set(badwords)) This function can be used, among other things, to filter stopwords. Here's an example where we filter the word the from the quick brown fox: >>> from featx import bag_of_words_not_in_set >>> bag_of_words_not_in_set(['the', 'quick', 'brown', 'fox'], ['the']) {'quick': True, 'brown': True, 'fox': True} As expected, the resulting dict has quick, brown, and fox, but not the. Filtering stopwords Stopwords are words that are often useless in NLP, in that they don't convey much meaning, such as the word the. Here's an example of using the bag_of_words_not_in_set() function to filter all English stopwords: from nltk.corpus import stopwords def bag_of_non_stopwords(words, stopfile='english'): badwords = stopwords.words(stopfile) return bag_of_words_not_in_set(words, badwords) You can pass a different language filename as the stopfile keyword argument if you are using a language other than English. Using this function produces the same result as the previous example: >>> from featx import bag_of_non_stopwords >>> bag_of_non_stopwords(['the', 'quick', 'brown', 'fox']) {'quick': True, 'brown': True, 'fox': True} Here, the is a stopword, so it is not present in the returned dict. Including significant bigrams In addition to single words, it often helps to include significant bigrams. As significant bigrams are less common than most individual words, including them in the bag of words model can help the classifier make better decisions. We can use the BigramCollocationFinder class to find significant bigrams. The bag_of_bigrams_words() function found in featx.py will return a dict of all words along with the 200 most significant bigrams: from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures def bag_of_bigrams_words(words, score_fn=BigramAssocMeasures.chi_sq, n=200): bigram_finder = BigramCollocationFinder.from_words(words) bigrams = bigram_finder.nbest(score_fn, n) return bag_of_words(words + bigrams) The bigrams will be present in the returned dict as (word1, word2) and will have the value as True. Using the same example words as we did earlier, we get all words plus every bigram: >>> from featx import bag_of_bigrams_words >>> bag_of_bigrams_words(['the', 'quick', 'brown', 'fox']) {'brown': True, ('brown', 'fox'): True, ('the', 'quick'): True, 'fox': True, ('quick', 'brown'): True, 'quick': True, 'the': True} You can change the maximum number of bigrams found by altering the keyword argument n. See also In the next recipe, we will train a NaiveBayesClassifier class using feature sets created with the bag of words model. Training a Naive Bayes classifier Now that we can extract features from text, we can train a classifier. The easiest classifier to get started with is the NaiveBayesClassifier class. It uses the Bayes theorem to predict the probability that a given feature set belongs to a particular label. The formula is: P(label | features) = P(label) * P(features | label) / P(features) The following list describes the various parameters from the previous formula: P(label): This is the prior probability of the label occurring, which is the likelihood that a random feature set will have the label. This is based on the number of training instances with the label compared to the total number of training instances. For example, if 60/100 training instances have the label, the prior probability of the label is 60%. P(features | label): This is the prior probability of a given feature set being classified as that label. This is based on which features have occurred with each label in the training data. P(features): This is the prior probability of a given feature set occurring. This is the likelihood of a random feature set being the same as the given feature set, and is based on the observed feature sets in the training data. For example, if the given feature set occurs twice in 100 training instances, the prior probability is 2%. P(label | features): This tells us the probability that the given features should have that label. If this value is high, then we can be reasonably confident that the label is correct for the given features. Getting ready We are going to be using the movie_reviews corpus for our initial classification examples. This corpus contains two categories of text: pos and neg. These categories are exclusive, which makes a classifier trained on them a binary classifier. Binary classifiers have only two classification labels, and will always choose one or the other. Each file in the movie_reviews corpus is composed of either positive or negative movie reviews. We will be using each file as a single instance for both training and testing the classifier. Because of the nature of the text and its categories, the classification we will be doing is a form of sentiment analysis. If the classifier returns pos, then the text expresses a positive sentiment, whereas if we get neg, then the text expresses a negative sentiment. How to do it... For training, we need to first create a list of labeled feature sets. This list should be of the form [(featureset, label)], where the featureset variable is a dict and label is the known class label for the featureset. The label_feats_from_corpus() function in featx.py takes a corpus, such as movie_reviews, and a feature_detector function, which defaults to bag_of_words. It then constructs and returns a mapping of the form {label: [featureset]}. We can use this mapping to create a list of labeled training instances and testing instances. The reason to do it this way is to get a fair sample from each label. It is important to get a fair sample, because parts of the corpus may be (unintentionally) biased towards one label or the other. Getting a fair sample should eliminate this possible bias: import collections def label_feats_from_corpus(corp, feature_detector=bag_of_words): label_feats = collections.defaultdict(list) for label in corp.categories(): for fileid in corp.fileids(categories=[label]): feats = feature_detector(corp.words(fileids=[fileid])) label_feats[label].append(feats) return label_feats Once we can get a mapping of label | feature sets, we want to construct a list of labeled training instances and testing instances. The split_label_feats() function in featx.py takes a mapping returned from label_feats_from_corpus() and splits each list of feature sets into labeled training and testing instances: def split_label_feats(lfeats, split=0.75): train_feats = [] test_feats = [] for label, feats in lfeats.items(): cutoff = int(len(feats) * split) train_feats.extend([(feat, label) for feat in feats[:cutoff]]) test_feats.extend([(feat, label) for feat in feats[cutoff:]]) return train_feats, test_feats Using these functions with the movie_reviews corpus gives us the lists of labeled feature sets we need to train and test a classifier: >>> from nltk.corpus import movie_reviews >>> from featx import label_feats_from_corpus, split_label_feats >>> movie_reviews.categories() ['neg', 'pos'] >>> lfeats = label_feats_from_corpus(movie_reviews) >>> lfeats.keys() dict_keys(['neg', 'pos']) >>> train_feats, test_feats = split_label_feats(lfeats, split=0.75) >>> len(train_feats) 1500 >>> len(test_feats) 500 So there are 1000 pos files, 1000 neg files, and we end up with 1500 labeled training instances and 500 labeled testing instances, each composed of equal parts of pos and neg. If we were using a different dataset, where the classes were not balanced, our training and testing data would have the same imbalance. Now we can train a NaiveBayesClassifier class using its train() class method: >>> from nltk.classify import NaiveBayesClassifier >>> nb_classifier = NaiveBayesClassifier.train(train_feats) >>> nb_classifier.labels() ['neg', 'pos'] Let's test the classifier on a couple of made up reviews. The classify() method takes a single argument, which should be a feature set. We can use the same bag_of_words() feature detector on a list of words to get our feature set: >>> from featx import bag_of_words >>> negfeat = bag_of_words(['the', 'plot', 'was', 'ludicrous']) >>> nb_classifier.classify(negfeat) 'neg' >>> posfeat = bag_of_words(['kate', 'winslet', 'is', 'accessible']) >>> nb_classifier.classify(posfeat) 'pos' How it works... The label_feats_from_corpus() function assumes that the corpus is categorized, and that a single file represents a single instance for feature extraction. It iterates over each category label, and extracts features from each file in that category using the feature_detector() function, which defaults to bag_of_words(). It returns a dict whose keys are the category labels, and the values are lists of instances for that category. If we had label_feats_from_corpus() return a list of labeled feature sets instead of a dict, it would be much harder to get balanced training data. The list would be ordered by label, and if you took a slice of it, you would almost certainly be getting far more of one label than another. By returning a dict, you can take slices from the feature sets of each label, in the same proportion that exists in the data. Now we need to split the labeled feature sets into training and testing instances using split_label_feats(). This function allows us to take a fair sample of labeled feature sets from each label, using the split keyword argument to determine the size of the sample. The split argument defaults to 0.75, which means the first 75% of the labeled feature sets for each label will be used for training, and the remaining 25% will be used for testing. Once we have gotten our training and testing feats split up, we train a classifier using the NaiveBayesClassifier.train() method. This class method builds two probability distributions for calculating prior probabilities. These are passed into the NaiveBayesClassifier constructor. The label_probdist constructor contains the prior probability for each label, or P(label). The feature_probdist constructor contains P(feature name = feature value | label). In our case, it will store P(word=True | label). Both are calculated based on the frequency of occurrence of each label and each feature name and value in the training data. The NaiveBayesClassifier class inherits from ClassifierI, which requires subclasses to provide a labels() method, and at least one of the classify() or prob_classify() methods. The following diagram shows other methods, which will be covered shortly: There's more... We can test the accuracy of the classifier using nltk.classify.util.accuracy() and the test_feats variable created previously: >>> from nltk.classify.util import accuracy >>> accuracy(nb_classifier, test_feats) 0.728 This tells us that the classifier correctly guessed the label of nearly 73% of the test feature sets. The code in this article is run with the PYTHONHASHSEED=0 environment variable so that accuracy calculations are consistent. If you run the code with a different value for PYTHONHASHSEED, or without setting this environment variable, your accuracy values may differ. Classification probability While the classify() method returns only a single label, you can use the prob_classify() method to get the classification probability of each label. This can be useful if you want to use probability thresholds for classification: >>> probs = nb_classifier.prob_classify(test_feats[0][0]) >>> probs.samples() dict_keys(['neg', 'pos']) >>> probs.max() 'pos' >>> probs.prob('pos') 0.9999999646430913 >>> probs.prob('neg') 3.535688969240647e-08 In this case, the classifier says that the first test instance is nearly 100% likely to be pos. Other instances may have more mixed probabilities. For example, if the classifier says an instance is 60% pos and 40% neg, that means the classifier is 60% sure the instance is pos, but there is a 40% chance that it is neg. It can be useful to know this for situations where you only want to use strongly classified instances, with a threshold of 80% or greater. Most informative features The NaiveBayesClassifier class has two methods that are quite useful for learning about your data. Both methods take a keyword argument n to control how many results to show. The most_informative_features() method returns a list of the form [(feature name, feature value)] ordered by most informative to least informative. In our case, the feature value will always be True: >>> nb_classifier.most_informative_features(n=5)[('magnificent', True), ('outstanding', True), ('insulting', True),('vulnerable', True), ('ludicrous', True)] The show_most_informative_features() method will print out the results from most_informative_features() and will also include the probability of a feature pair belonging to each label: >>> nb_classifier.show_most_informative_features(n=5) Most Informative Features magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0 The informativeness, or information gain, of each feature pair is based on the prior probability of the feature pair occurring for each label. More informative features are those that occur primarily in one label and not on the other. The less informative features are those that occur frequently with both labels. Another way to state this is that the entropy of the classifier decreases more when using a more informative feature. See https://en.wikipedia.org/wiki/Information_gain_in_decision_trees for more on information gain and entropy (while it specifically mentions decision trees, the same concepts are applicable to all classifiers). Training estimator During training, the NaiveBayesClassifier class constructs probability distributions for each feature using an estimator parameter, which defaults to nltk.probability.ELEProbDist. The estimator is used to calculate the probability of a label parameter given a specific feature. In ELEProbDist, ELE stands for Expected Likelihood Estimate, and the formula for calculating the label probabilities for a given feature is (c+0.5)/(N+B/2). Here, c is the count of times a single feature occurs, N is the total number of feature outcomes observed, and B is the number of bins or unique features in the feature set. In cases where the feature values are all True, N == B. In other cases, where the number of times a feature occurs is recorded, then N >= B. You can use any estimator parameter you want, and there are quite a few to choose from. The only constraints are that it must inherit from nltk.probability.ProbDistI and its constructor must take a bins keyword argument. Here's an example using the LaplaceProdDist class, which uses the formula (c+1)/(N+B): >>> from nltk.probability import LaplaceProbDist >>> nb_classifier = NaiveBayesClassifier.train(train_feats, estimator=LaplaceProbDist) >>> accuracy(nb_classifier, test_feats) 0.716 As you can see, accuracy is slightly lower, so choose your estimator parameter carefully. You cannot use nltk.probability.MLEProbDist as the estimator, or any ProbDistI subclass that does not take the bins keyword argument. Training will fail with TypeError: __init__() got an unexpected keyword argument 'bins'. Manual training You don't have to use the train() class method to construct a NaiveBayesClassifier. You can instead create the label_probdist and feature_probdist variables manually. The label_probdist variable should be an instance of ProbDistI, and should contain the prior probabilities for each label. The feature_probdist variable should be a dict whose keys are tuples of the form (label, feature name) and whose values are instances of ProbDistI that have the probabilities for each feature value. In our case, each ProbDistI should have only one value, True=1. Here's a very simple example using a manually constructed DictionaryProbDist class: >>> from nltk.probability import DictionaryProbDist >>> label_probdist = DictionaryProbDist({'pos': 0.5, 'neg': 0.5}) >>> true_probdist = DictionaryProbDist({True: 1}) >>> feature_probdist = {('pos', 'yes'): true_probdist, ('neg', 'no'): true_probdist} >>> classifier = NaiveBayesClassifier(label_probdist, feature_probdist) >>> classifier.classify({'yes': True}) 'pos' >>> classifier.classify({'no': True}) 'neg' See also In the next recipe, we will train the DecisionTreeClassifier classifier. Training a decision tree classifier The DecisionTreeClassifier class works by creating a tree structure, where each node corresponds to a feature name and the branches correspond to the feature values. Tracing down the branches, you get to the leaves of the tree, which are the classification labels. How to do it... Using the same train_feats and test_feats variables we created from the movie_reviews corpus in the previous recipe, we can call the DecisionTreeClassifier.train() class method to get a trained classifier. We pass binary=True because all of our features are binary: either the word is present or it's not. For other classification use cases where you have multivalued features, you will want to stick to the default binary=False. In this context, binary refers to feature values, and is not to be confused with a binary classifier. Our word features are binary because the value is either True or the word is not present. If our features could take more than two values, we would have to use binary=False. A binary classifier, on the other hand, is a classifier that only chooses between two labels. In our case, we are training a binary DecisionTreeClassifier on binary features. But it's also possible to have a binary classifier with non-binary features, or a non-binary classifier with binary features. The following is the code for training and evaluating the accuracy of a DecisionTreeClassifier class: >>> dt_classifier = DecisionTreeClassifier.train(train_feats,binary=True, entropy_cutoff=0.8, depth_cutoff=5, support_cutoff=30)>>> accuracy(dt_classifier, test_feats)0.688 The DecisionTreeClassifier class can take much longer to train than the NaiveBayesClassifier class. For that reason, I have overridden the default parameters so it trains faster. These parameters will be explained later. How it works... The DecisionTreeClassifier class, like the NaiveBayesClassifier class, is also an instance of ClassifierI, as shown in the following diagram: During training, the DecisionTreeClassifier class creates a tree where the child nodes are also instances of DecisionTreeClassifier. The leaf nodes contain only a single label, while the intermediate child nodes contain decision mappings for each feature. These decisions map each feature value to another DecisionTreeClassifier, which itself may contain decisions for another feature, or it may be a final leaf node with a classification label. The train() class method builds this tree from the ground up, starting with the leaf nodes. It then refines itself to minimize the number of decisions needed to get to a label by putting the most informative features at the top. To classify, the DecisionTreeClassifier class looks at the given feature set and traces down the tree, using known feature names and values to make decisions. Because we are creating a binary tree, each DecisionTreeClassifier instance also has a default decision tree, which it uses when a known feature is not present in the feature set being classified. This is a common occurrence in text-based feature sets, and indicates that a known word was not in the text being classified. This also contributes information towards a classification decision. There's more... The parameters passed into DecisionTreeClassifier.train() can be tweaked to improve accuracy or decrease training time. Generally, if you want to improve accuracy, you must accept a longer training time and if you want to decrease the training time, the accuracy will most likely decrease as well. But be careful not to optimize for accuracy too much. A really high accuracy may indicate overfitting, which means the classifier will be excellent at classifying the training data, but not so good on data it has never seen. See https://en.wikipedia.org/wiki/Over_fitting for more on this concept. Controlling uncertainty with entropy_cutoff Entropy is the uncertainty of the outcome. As entropy approaches 1.0, uncertainty increases. Conversely, as entropy approaches 0.0, uncertainty decreases. In other words, when you have similar probabilities, the entropy will be high as each probability has a similar likelihood (or uncertainty of occurrence). But the more the probabilities differ, the lower the entropy will be. The entropy_cutoff value is used during the tree refinement process. The tree refinement process is how the decision tree decides to create new branches. If the entropy of the probability distribution of label choices in the tree is greater than the entropy_cutoff value, then the tree is refined further by creating more branches. But if the entropy is lower than the entropy_cutoff value, then tree refinement is halted. Entropy is calculated by giving nltk.probability.entropy() a MLEProbDist value created from a FreqDist of label counts. Here's an example showing the entropy of various FreqDist values. The value of 'pos' is kept at 30, while the value of 'neg' is manipulated to show that when 'neg' is close to 'pos', entropy increases, but when it is closer to 1, entropy decreases: >>> from nltk.probability import FreqDist, MLEProbDist, entropy >>> fd = FreqDist({'pos': 30, 'neg': 10}) >>> entropy(MLEProbDist(fd)) 0.8112781244591328 >>> fd['neg'] = 25 >>> entropy(MLEProbDist(fd)) 0.9940302114769565 >>> fd['neg'] = 30 >>> entropy(MLEProbDist(fd)) 1.0 >>> fd['neg'] = 1 >>> entropy(MLEProbDist(fd)) 0.20559250818508304 What this all means is that if the label occurrence is very skewed one way or the other, the tree doesn't need to be refined because entropy/uncertainty is low. But when the entropy is greater than entropy_cutoff, then the tree must be refined with further decisions to reduce the uncertainty. Higher values of entropy_cutoff will decrease both accuracy and training time. Controlling tree depth with depth_cutoff The depth_cutoff value is also used during refinement to control the depth of the tree. The final decision tree will never be deeper than the depth_cutoff value. The default value is 100, which means that classification may require up to 100 decisions before reaching a leaf node. Decreasing the depth_cutoff value will decrease the training time and most likely decrease the accuracy as well. Controlling decisions with support_cutoff The support_cutoff value controls how many labeled feature sets are required to refine the tree. As the DecisionTreeClassifier class refines itself, labeled feature sets are eliminated once they no longer provide value to the training process. When the number of labeled feature sets is less than or equal to support_cutoff, refinement stops, at least for that section of the tree. Another way to look at it is that support_cutoff specifies the minimum number of instances that are required to make a decision about a feature. If support_cutoff is 20, and you have less than 20 labeled feature sets with a given feature, then you don't have enough instances to make a good decision, and refinement around that feature must come to a stop. See also The previous recipe covered the creation of training and test feature sets from the movie_reviews corpus. Summary In this article, we learned how to transform text into feature dictionaries, and how to train a text classifier for sentiment analysis. Resources for Article: Further resources on this subject: Python Libraries for Geospatial Development [article] Python Testing: Installing the Robot Framework [article] Ten IPython essentials [article]
Read more
  • 0
  • 0
  • 2860

article-image-camera-calibration
Packt
25 Aug 2014
18 min read
Save for later

Camera Calibration

Packt
25 Aug 2014
18 min read
This article by Robert Laganière, author of OpenCV Computer Vision Application Programming Cookbook Second Edition, includes that images are generally produced using a digital camera, which captures a scene by projecting light going through its lens onto an image sensor. The fact that an image is formed by the projection of a 3D scene onto a 2D plane implies the existence of important relationships between a scene and its image and between different images of the same scene. Projective geometry is the tool that is used to describe and characterize, in mathematical terms, the process of image formation. In this article, we will introduce you to some of the fundamental projective relations that exist in multiview imagery and explain how these can be used in computer vision programming. You will learn how matching can be made more accurate through the use of projective constraints and how a mosaic from multiple images can be composited using two-view relations. Before we start the recipe, let's explore the basic concepts related to scene projection and image formation. (For more resources related to this topic, see here.) Image formation Fundamentally, the process used to produce images has not changed since the beginning of photography. The light coming from an observed scene is captured by a camera through a frontal aperture; the captured light rays hit an image plane (or an image sensor) located at the back of the camera. Additionally, a lens is used to concentrate the rays coming from the different scene elements. This process is illustrated by the following figure: Here, do is the distance from the lens to the observed object, di is the distance from the lens to the image plane, and f is the focal length of the lens. These quantities are related by the so-called thin lens equation: In computer vision, this camera model can be simplified in a number of ways. First, we can neglect the effect of the lens by considering that we have a camera with an infinitesimal aperture since, in theory, this does not change the image appearance. (However, by doing so, we ignore the focusing effect by creating an image with an infinite depth of field.) In this case, therefore, only the central ray is considered. Second, since most of the time we have do>>di, we can assume that the image plane is located at the focal distance. Finally, we can note from the geometry of the system that the image on the plane is inverted. We can obtain an identical but upright image by simply positioning the image plane in front of the lens. Obviously, this is not physically feasible, but from a mathematical point of view, this is completely equivalent. This simplified model is often referred to as the pin-hole camera model, and it is represented as follows: From this model, and using the law of similar triangles, we can easily derive the basic projective equation that relates a pictured object with its image: The size (hi) of the image of an object (of height ho) is therefore inversely proportional to its distance (do) from the camera, which is naturally true. In general, this relation describes where a 3D scene point will be projected on the image plane given the geometry of the camera. Calibrating a camera From the introduction of this article, we learned that the essential parameters of a camera under the pin-hole model are its focal length and the size of the image plane (which defines the field of view of the camera). Also, since we are dealing with digital images, the number of pixels on the image plane (its resolution) is another important characteristic of a camera. Finally, in order to be able to compute the position of an image's scene point in pixel coordinates, we need one additional piece of information. Considering the line coming from the focal point that is orthogonal to the image plane, we need to know at which pixel position this line pierces the image plane. This point is called the principal point. It might be logical to assume that this principal point is at the center of the image plane, but in practice, this point might be off by a few pixels depending on the precision at which the camera has been manufactured. Camera calibration is the process by which the different camera parameters are obtained. One can obviously use the specifications provided by the camera manufacturer, but for some tasks, such as 3D reconstruction, these specifications are not accurate enough. Camera calibration will proceed by showing known patterns to the camera and analyzing the obtained images. An optimization process will then determine the optimal parameter values that explain the observations. This is a complex process that has been made easy by the availability of OpenCV calibration functions. How to do it... To calibrate a camera, the idea is to show it a set of scene points for which their 3D positions are known. Then, you need to observe where these points project on the image. With the knowledge of a sufficient number of 3D points and associated 2D image points, the exact camera parameters can be inferred from the projective equation. Obviously, for accurate results, we need to observe as many points as possible. One way to achieve this would be to take one picture of a scene with many known 3D points, but in practice, this is rarely feasible. A more convenient way is to take several images of a set of some 3D points from different viewpoints. This approach is simpler but requires you to compute the position of each camera view in addition to the computation of the internal camera parameters, which fortunately is feasible. OpenCV proposes that you use a chessboard pattern to generate the set of 3D scene points required for calibration. This pattern creates points at the corners of each square, and since this pattern is flat, we can freely assume that the board is located at Z=0, with the X and Y axes well-aligned with the grid. In this case, the calibration process simply consists of showing the chessboard pattern to the camera from different viewpoints. Here is one example of a 6x4 calibration pattern image: The good thing is that OpenCV has a function that automatically detects the corners of this chessboard pattern. You simply provide an image and the size of the chessboard used (the number of horizontal and vertical inner corner points). The function will return the position of these chessboard corners on the image. If the function fails to find the pattern, then it simply returns false: // output vectors of image points std::vector<cv::Point2f> imageCorners; // number of inner corners on the chessboard cv::Size boardSize(6,4); // Get the chessboard corners bool found = cv::findChessboardCorners(image, boardSize, imageCorners); The output parameter, imageCorners, will simply contain the pixel coordinates of the detected inner corners of the shown pattern. Note that this function accepts additional parameters if you needs to tune the algorithm, which are not discussed here. There is also a special function that draws the detected corners on the chessboard image, with lines connecting them in a sequence: //Draw the corners cv::drawChessboardCorners(image, boardSize, imageCorners, found); // corners have been found The following image is obtained: The lines that connect the points show the order in which the points are listed in the vector of detected image points. To perform a calibration, we now need to specify the corresponding 3D points. You can specify these points in the units of your choice (for example, in centimeters or in inches); however, the simplest is to assume that each square represents one unit. In that case, the coordinates of the first point would be (0,0,0) (assuming that the board is located at a depth of Z=0), the coordinates of the second point would be (1,0,0), and so on, the last point being located at (5,3,0). There are a total of 24 points in this pattern, which is too small to obtain an accurate calibration. To get more points, you need to show more images of the same calibration pattern from various points of view. To do so, you can either move the pattern in front of the camera or move the camera around the board; from a mathematical point of view, this is completely equivalent. The OpenCV calibration function assumes that the reference frame is fixed on the calibration pattern and will calculate the rotation and translation of the camera with respect to the reference frame. Let's now encapsulate the calibration process in a CameraCalibrator class. The attributes of this class are as follows: class CameraCalibrator { // input points: // the points in world coordinates std::vector<std::vector<cv::Point3f>> objectPoints; // the point positions in pixels std::vector<std::vector<cv::Point2f>> imagePoints; // output Matrices cv::Mat cameraMatrix; cv::Mat distCoeffs; // flag to specify how calibration is done int flag; Note that the input vectors of the scene and image points are in fact made of std::vector of point instances; each vector element is a vector of the points from one view. Here, we decided to add the calibration points by specifying a vector of the chessboard image filename as input: // Open chessboard images and extract corner points int CameraCalibrator::addChessboardPoints( const std::vector<std::string>& filelist, cv::Size & boardSize) { // the points on the chessboard std::vector<cv::Point2f> imageCorners; std::vector<cv::Point3f> objectCorners; // 3D Scene Points: // Initialize the chessboard corners // in the chessboard reference frame // The corners are at 3D location (X,Y,Z)= (i,j,0) for (int i=0; i<boardSize.height; i++) { for (int j=0; j<boardSize.width; j++) { objectCorners.push_back(cv::Point3f(i, j, 0.0f)); } } // 2D Image points: cv::Mat image; // to contain chessboard image int successes = 0; // for all viewpoints for (int i=0; i<filelist.size(); i++) { // Open the image image = cv::imread(filelist[i],0); // Get the chessboard corners bool found = cv::findChessboardCorners( image, boardSize, imageCorners); // Get subpixel accuracy on the corners cv::cornerSubPix(image, imageCorners, cv::Size(5,5), cv::Size(-1,-1), cv::TermCriteria(cv::TermCriteria::MAX_ITER + cv::TermCriteria::EPS, 30, // max number of iterations 0.1)); // min accuracy //If we have a good board, add it to our data if (imageCorners.size() == boardSize.area()) { // Add image and scene points from one view addPoints(imageCorners, objectCorners); successes++; } } return successes; } The first loop inputs the 3D coordinates of the chessboard, and the corresponding image points are the ones provided by the cv::findChessboardCorners function. This is done for all the available viewpoints. Moreover, in order to obtain a more accurate image point location, the cv::cornerSubPix function can be used, and as the name suggests, the image points will then be localized at a subpixel accuracy. The termination criterion that is specified by the cv::TermCriteria object defines the maximum number of iterations and the minimum accuracy in subpixel coordinates. The first of these two conditions that is reached will stop the corner refinement process. When a set of chessboard corners have been successfully detected, these points are added to our vectors of the image and scene points using our addPoints method. Once a sufficient number of chessboard images have been processed (and consequently, a large number of 3D scene point / 2D image point correspondences are available), we can initiate the computation of the calibration parameters as follows: // Calibrate the camera // returns the re-projection error double CameraCalibrator::calibrate(cv::Size &imageSize) { //Output rotations and translations std::vector<cv::Mat> rvecs, tvecs; // start calibration return calibrateCamera(objectPoints, // the 3D points imagePoints, // the image points imageSize, // image size cameraMatrix, // output camera matrix distCoeffs, // output distortion matrix rvecs, tvecs, // Rs, Ts flag); // set options } In practice, 10 to 20 chessboard images are sufficient, but these must be taken from different viewpoints at different depths. The two important outputs of this function are the camera matrix and the distortion parameters. These will be described in the next section. How it works... In order to explain the result of the calibration, we need to go back to the figure in the introduction, which describes the pin-hole camera model. More specifically, we want to demonstrate the relationship between a point in 3D at the position (X,Y,Z) and its image (x,y) on a camera specified in pixel coordinates. Let's redraw this figure by adding a reference frame that we position at the center of the projection as seen here: Note that the y axis is pointing downward to get a coordinate system compatible with the usual convention that places the image origin at the upper-left corner. We learned previously that the point (X,Y,Z) will be projected onto the image plane at (fX/Z,fY/Z). Now, if we want to translate this coordinate into pixels, we need to divide the 2D image position by the pixel's width (px) and height (py), respectively. Note that by dividing the focal length given in world units (generally given in millimeters) by px, we obtain the focal length expressed in (horizontal) pixels. Let's then define this term as fx. Similarly, fy =f/py is defined as the focal length expressed in vertical pixel units. Therefore, the complete projective equation is as follows: Recall that (u0,v0) is the principal point that is added to the result in order to move the origin to the upper-left corner of the image. These equations can be rewritten in the matrix form through the introduction of homogeneous coordinates, in which 2D points are represented by 3-vectors and 3D points are represented by 4-vectors (the extra coordinate is simply an arbitrary scale factor, S, that needs to be removed when a 2D coordinate needs to be extracted from a homogeneous 3-vector). Here is the rewritten projective equation: The second matrix is a simple projection matrix. The first matrix includes all of the camera parameters, which are called the intrinsic parameters of the camera. This 3x3 matrix is one of the output matrices returned by the cv::calibrateCamera function. There is also a function called cv::calibrationMatrixValues that returns the value of the intrinsic parameters given by a calibration matrix. More generally, when the reference frame is not at the projection center of the camera, we will need to add a rotation vector (a 3x3 matrix) and a translation vector (a 3x1 matrix). These two matrices describe the rigid transformation that must be applied to the 3D points in order to bring them back to the camera reference frame. Therefore, we can rewrite the projection equation in its most general form: Remember that in our calibration example, the reference frame was placed on the chessboard. Therefore, there is a rigid transformation (made of a rotation component represented by the matrix entries r1 to r9 and a translation represented by t1, t2, and t3) that must be computed for each view. These are in the output parameter list of the cv::calibrateCamera function. The rotation and translation components are often called the extrinsic parameters of the calibration, and they are different for each view. The intrinsic parameters remain constant for a given camera/lens system. The intrinsic parameters of our test camera obtained from a calibration based on 20 chessboard images are fx=167, fy=178, u0=156, and v0=119. These results are obtained by cv::calibrateCamera through an optimization process aimed at finding the intrinsic and extrinsic parameters that will minimize the difference between the predicted image point position, as computed from the projection of the 3D scene points, and the actual image point position, as observed on the image. The sum of this difference for all the points specified during the calibration is called the re-projection error. Let's now turn our attention to the distortion parameters. So far, we have mentioned that under the pin-hole camera model, we can neglect the effect of the lens. However, this is only possible if the lens that is used to capture an image does not introduce important optical distortions. Unfortunately, this is not the case with lower quality lenses or with lenses that have a very short focal length. You may have already noted that the chessboard pattern shown in the image that we used for our example is clearly distorted—the edges of the rectangular board are curved in the image. Also, note that this distortion becomes more important as we move away from the center of the image. This is a typical distortion observed with a fish-eye lens, and it is called radial distortion. The lenses used in common digital cameras usually do not exhibit such a high degree of distortion, but in the case of the lens used here, these distortions certainly cannot be ignored. It is possible to compensate for these deformations by introducing an appropriate distortion model. The idea is to represent the distortions induced by a lens by a set of mathematical equations. Once established, these equations can then be reverted in order to undo the distortions visible on the image. Fortunately, the exact parameters of the transformation that will correct the distortions can be obtained together with the other camera parameters during the calibration phase. Once this is done, any image from the newly calibrated camera will be undistorted. Therefore, we have added an additional method to our calibration class: // remove distortion in an image (after calibration) cv::Mat CameraCalibrator::remap(const cv::Mat &image) { cv::Mat undistorted; if (mustInitUndistort) { // called once per calibration cv::initUndistortRectifyMap( cameraMatrix, // computed camera matrix distCoeffs, // computed distortion matrix cv::Mat(), // optional rectification (none) cv::Mat(), // camera matrix to generate undistorted image.size(), // size of undistorted CV_32FC1, // type of output map map1, map2); // the x and y mapping functions mustInitUndistort= false; } // Apply mapping functions cv::remap(image, undistorted, map1, map2, cv::INTER_LINEAR); // interpolation type return undistorted; } Running this code results in the following image: As you can see, once the image is undistorted, we obtain a regular perspective image. To correct the distortion, OpenCV uses a polynomial function that is applied to the image points in order to move them at their undistorted position. By default, five coefficients are used; a model made of eight coefficients is also available. Once these coefficients are obtained, it is possible to compute two cv::Mat mapping functions (one for the x coordinate and one for the y coordinate) that will give the new undistorted position of an image point on a distorted image. This is computed by the cv::initUndistortRectifyMap function, and the cv::remap function remaps all the points of an input image to a new image. Note that because of the nonlinear transformation, some pixels of the input image now fall outside the boundary of the output image. You can expand the size of the output image to compensate for this loss of pixels, but you will now obtain output pixels that have no values in the input image (they will then be displayed as black pixels). There's more... More options are available when it comes to camera calibration. Calibration with known intrinsic parameters When a good estimate of the camera's intrinsic parameters is known, it could be advantageous to input them in the cv::calibrateCamera function. They will then be used as initial values in the optimization process. To do so, you just need to add the CV_CALIB_USE_INTRINSIC_GUESS flag and input these values in the calibration matrix parameter. It is also possible to impose a fixed value for the principal point (CV_CALIB_FIX_PRINCIPAL_POINT), which can often be assumed to be the central pixel. You can also impose a fixed ratio for the focal lengths fx and fy (CV_CALIB_FIX_RATIO); in which case, you assume the pixels of the square shape. Using a grid of circles for calibration Instead of the usual chessboard pattern, OpenCV also offers the possibility to calibrate a camera by using a grid of circles. In this case, the centers of the circles are used as calibration points. The corresponding function is very similar to the function we used to locate the chessboard corners: cv::Size boardSize(7,7); std::vector<cv::Point2f> centers; bool found = cv:: findCirclesGrid( image, boardSize, centers); See also The A flexible new technique for camera calibration article by Z. Zhang in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no 11, 2000, is a classic paper on the problem of camera calibration Summary In this article, we explored the projective relations that exist between two images of the same scene. Resources for Article: Further resources on this subject: Creating an Application from Scratch [Article] Wrapping OpenCV [Article] New functionality in OpenCV 3.0 [Article]
Read more
  • 0
  • 0
  • 22104

article-image-report-data-filtering
Packt
25 Aug 2014
13 min read
Save for later

Report Data Filtering

Packt
25 Aug 2014
13 min read
In this article, written by Yoav Yahav, author of the book SAP BusinessObjects Reporting Cookbook, we will cover the following recipes: Applying a simple filter Working with the filter bar Using input controls Working with an element link (For more resources related to this topic, see here.) Filtering data can be done in several ways. We can filter the results at the query level when there is a requirement to use a mandatory filter or set of filters that will fetch only specific types of rows that will correspond to the business question; otherwise, the report won't be accurate or useful. The other level of filtering is performed at the report level. This level of filtering interacts with the data that was retrieved by the user and enables us to eliminate irrelevant rows. The main question that arises when using a report-level filter is why shouldn't we implement filters in the query level? Well, the answer has various reasons, which are as follows: We need to compare and analyze just a part of the entire data that the query retrieved (for example, filtering the first quarter's data out of the current year's entire dataset) We need to view the data separately, for example, each tab can be filtered by a different value (for example, each report tab can display a different region's data) We need to filter measure objects that are different from the aggregative level of the query; for example, we have retrieved a well-detailed query displaying sales of various products at the customer level, but we also need to display only the products that had income of more than one million dollars in another report tab The business user requires interactive functionality from the filter: a drop-down box, a checklist, a spinner, or a slider—capabilities that can't be performed by a query filter We need to perform additional calculations on a variable in the report and apply a filter to it In this article, we will explore the different types of filters that can be applied in reports: simple ones, interactive ones, and filters that can combine interactivity and a custom look and feel adjusted by the business user. Applying a simple filter The first type of filter is a basic one that enables us to implement quick and simple filter logic, which is similar to the way we build it on the query panel. Getting ready We have created a query that retrieves a dataset displaying the Net Sales by Product, Line, and Year. Using a simple filter, we would like to filter only the year 2008 records as well as the Sports Line. How to do it... Perform the following steps to apply a simple filter: We will navigate to the Analysis toolbar, and in the Filters tab, click on the Filter icon and choose Add Filter, as shown in the following screenshot: In the Report Filter window, as shown in the following screenshot, we will be able to add filters, edit them, and apply them on a specific table or on the entire report tab: By clicking on the Add filter icon located in the top-left corner, we will be able to add a condition. Clicking on this button will open the list of existing objects in the report; by choosing the Year object, we will add our first filter, as shown in the following screenshot: After we choose the Year object, a filter condition structure will appear in the top-right corner of the window, enabling us to pick an operator and a value similar to the way we establish query filters, as shown in the following screenshot: We will add a second filter as well using the Add filter button and adding the Line object to the filter area. The AND operator will appear between the two filters, establishing an intersection relationship between them. This operator can be easily changed to the OR operator by clicking on it. The table will be affected accordingly and will display only the year 2008 and the Sports Line records, as shown: In order to edit the filter, we can either access it through the Analysis toolbar or mark one of the filtered columns, enabling us to get an easier edit using the toolbar or the right-click menu, as shown in the following screenshot: How it works... The report filter simply corresponds to the values defined in the filtered set of conditions that are established by simple and easy use of the filter panel. The filters can be applied on any type of data display, table, or chart. Like the query filters, the report filters use the logic of the operators AND/OR as well and can be used by clicking on the operator name. In order to view the filters that have been applied to the report tabs and tables, you can navigate to the document structure and filter's left-hand side panel and click on the Filter button. There's more... Filters can be applied on a specific table in the report or on the entire report. In order to switch between these options, when you create the filter, you need to mark the report area. To create a report-level filter or a specific column in a table, you need to filter a specific table in the report tab. Working with the filter bar Another great functionality that filters can provide us is interaction with the report data. There are cases when we are required to perform quick filtering as well as switch dynamically between values as we need to analyze different filtered datasets. Working with the filter bar can address these requirements simply and easily. Getting ready We want to perform a quick dynamic filtering on our existing table by adding the Country dimension object to the filter bar. How to do it.... Perform the following steps: By navigating to the Analysis toolbar and then to the Interact tab, we will click on the Filter Bar icon: By doing so, a gray filter pane area will appear under the formula bar with a guiding message saying Drop objects here to add report filters, as shown in the following screenshot: In order to create a new filter, we can either drag an object directly to the filter pane area from the Available Objects pane or use the quick add icon located in the filter bar on the left-hand side of the message. In our scenario, we will use the Available Objects pane and drag the Country dimension object directly to the filter bar: By adding the Country object to the filter bar, a quick drop-down list filter will be created, enabling us to filter the table data by choosing any country value: This filter will enable us to quickly create filtered sets of data using the drop-down list as well as using an interactive component that doesn't have to be in the table. How it works... The filter bar is an interactive component that enables us to create as many dynamic filters as we need and to locate them in a single filtering area for easy control of the data, and they aren't even required to appear in the table itself. The filter bar is restricted to filter only single values; in order to filter several values, we will need to either use a different type of filter, such as input control, or create a grouped values logic. There's more... When we drop several dimension objects onto the filter pane, they will be displayed accordingly; however, a cascading effect of filters (picking a specific country in the first filter and in the second filter seeing only that country) will be supported only if hierarchies have been defined in the universe. Using input controls Input controls are another type of filter that enable us to interact with the report data. An input control performs as an interactive left-hand side panel, which can be created in various types that have a different look and feel as well as a different functionality. We can use an input control to address the following tasks: Applying a different look and feel to the filter—making filters more intuitive and easy to operate (using radio buttons, comboboxes, and other filter types) Applying multiple values Applying dynamic filters to measure values using input control components, such as spinners and sliders Enabling a default value option and a custom list of values Getting ready In this example, we will filter several components in the report area, a chart and a table, using the Region dimension object. We will be using the multiple value option to enhance the filter functionality. First, we will navigate to the input control panel located in the left-hand side area as the third option from the top and click on the New option. How to do it... Perform the following steps: We will choose the object that we need to filter with the table and the chart, as shown in the following screenshot: After choosing the Region object, we will move on to the Define Input Control window. Input controls enable multiple-value functionality, and in the Choose Control Type window, we will choose the Check boxes input control type, as shown in the following screenshot: In the input control properties located at the right-hand side, we can also add a description to the input control, set a default value, and customize the list of values if we need specific values. After choosing the Check boxes component, we will advance to the next window, choosing the data element we want to apply the control on. We will tick both of the components, the chart and the table, in order to affect all the data components using a single control, as shown in the following screenshot: By clicking on the Finish button, the input control will appear at the left-hand side: We can easily change the selected values to all values (Select All), one, or several values, filtering both of the tables as shown: How it works... As we have seen, input controls act as special interactive filters that can be used by picking one of them from the input control templates, that is, the type that is the most suitable to filter the data in the report. Our main consideration when choosing an input control is to determine the type of list we need to pick: single or multiple. The second consideration should be the interactive functionality that we need from such a control: a simple value pick or perhaps an arithmetic operator, such as a greater or less than operator, which can be applied to a measure object. There's more... An input control can also be created using the Analysis toolbar and the Filter tab. In order to edit the existing input control, we can access the mini toolbar above the input control. Here, we will be able to edit the control, show its dependencies (the elements that are affected by it), or delete it, as shown in the following screenshot: We can also display a single informative cell describing which table and report a filter has been applied on. This useful option can be applied by navigating to the Report Element toolbar, choosing Report Filter Summary from the Cell subtoolbar, and dragging it to the report area, as shown in the following screenshot: By clicking on the Map button, we will switch to a graphical tree view of the input control, showing the values that were picked in the filter as well as its dependencies: If you need to display the values of the input control in the report area, simply drag the object that you used in the control to the report area, turn it into a horizontal table, and then edit the dependencies of the control so that it will be applied on the new table as well. Working with an element link An element link is a feature designed to pass a value from an existing table or a chart to another data component in the report area. Element links transform the values in a table or a chart into dynamic values that can filter other data elements. The main difference between element links and other types of filtering is that when using an element link, we are actually using and working within a table, keeping its structure and passing a parameter from it to another structure. This feature can be great to work with when we are using a detailed table and want to use its values to filter another chart that will visualize the summarized data and vice versa. How to do it... Perform the following steps: We will pass the Country value from the detailed table to the line quantity sales pie chart, enabling the business user to filter the pie dynamically while working with the detailed table. By clicking on the Country column, we will navigate in the speed menu to Linking | Add Element Link, as shown in the following screenshot: In the next window, we will choose the passing method and decide whether to pass the entire row values or a Single object value. In our example, we will use the Single object option, as shown in the following screenshot: In the next screen, we will be able to add a description of our choice to the element link. And finally, we will define the dependencies via the report elements that we want to pass the country value to, as shown in the following screenshot: By clicking on the Finish button, we will switch to the report view, and by marking the Country column or any other column, we will be able to pass the Country value to the pie chart, as shown in the following screenshot: By clicking on a different Country value, such as Colombia, we will be able to pass it to the pie and filter the results accordingly: Notice that the pie chart results have changed and that the country value is marked in bold inside the tooltip box, showing the column that was actually used to pass the value. How it works... The element link simply links the tables and other data components. It is actually a type of input control designed to work directly from a table rather than a filter component panel or a bar. By clicking on any country value, we simply pass it to the dependency component that uses the value as an input in order to present the relevant data. There's more... An element link can be edited and adjusted in a way similar to the way in which an input control is edited. By right-clicking on the Element Link icon located on the header of the rightmost column, we will be able to edit it, as shown in the following screenshot: Another good way to view the element link status and edit it is to switch to the Input Controls panel where you can view it as well, as shown in the following screenshot: Summary In this article, we came to know about the filtering techniques we can apply to the report tables and charts. Resources for Article: Further resources on this subject: Exporting SAP BusinessObjects Dashboards into Different Environments [Article] SAP BusinessObjects: Customizing the Dashboard [Article] SAP HANA Architecture [Article]
Read more
  • 0
  • 0
  • 2066
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-recommender-systems-dissected
Packt
21 Aug 2014
4 min read
Save for later

Recommender systems dissected

Packt
21 Aug 2014
4 min read
In this article by Rik Van Bruggen, the author of Learning Neo4j, we will take a look at so-called recommender systems. Such systems, broadly speaking, consist of two elementary parts: (For more resources related to this topic, see here.) A pattern discovery system: This is the system that somehow figures out what would be a useful recommendation for a particular target group. This discovery can be done in many different ways, but in general we see three ways to do so: A business expert who thoroughly understands the domain of the graph database application will use this understanding to determine useful recommendations. For example, the supervisor of a "do-it-yourself" retail outlet would understand a particular pattern. Suppose that if someone came in to buy multiple pots of paint, they would probably also benefit from getting a gentle recommendation for a promotion of high-end brushes. The store has that promotion going on right then, so the recommendation would be very timely. This process would be a discovery process where the pattern that the business expert has discovered would be applied and used in graph databases in real time as part of a sophisticated recommendation. A visual discovery of a specific pattern in the graph representation of the business domain. We have found that in many different projects, business users have stumbled upon these kinds of patterns while using graph visualizations to look at their data. Specific patterns emerge, unexpected relations jump out, or, in more advanced visualization solutions, specific clusters of activity all of a sudden become visible and require further investigation. Graph databases such as Neo4j, and the visualization solutions that complement it, can play a wonderfully powerful role in this process. An algorithmic discovery of a pattern in a dataset of the business domain using machine learning algorithms to uncover previously unknown patterns in the data. Typically, these processes require an iterative, raw number-crunching approach that has also been used on non-graph data formats in the past. It remains part-art part-science at this point, but can of course yield interesting insights if applied correctly. An overview of recommender systems A pattern application system: All recommender systems will be more or less successful based on not just the patterns that they are able to discover, but also based on the way that they are able to apply these patterns in business applications. We can look at different types of applications for these patterns: Batch-oriented applications: Some applications of these patterns are not as time-critical as one would expect. It does not really matter in any material kind of way if a bulk e-mail with recommendations, or worse, a printed voucher with recommended discounted products, gets delivered to the customer or prospect at 10 am or 11 am. Batch solutions can usually cope with these kinds of requests, even if they do so in the most inefficient way. Real-time oriented applications: Some pattern applications simply have to be delivered in real time, in between a web request and a web response, and cannot be precalculated. For these types of systems, which typically use more complex database queries in order to match for the appropriate recommendation to make, graph databases such as Neo4j are a fantastic tool to have. We will illustrate this going forward. With this classification behind us, we will look at an example dataset and some example queries to make this topic come alive. Using a graph model for recommendations We will be using a very specific data model for our recommender system. All we have changed is that we added a couple of products and brands to the model, and inserted some data into the database correspondingly. In total, we added the following: Ten products Three product brands Fifty relationships between existing person nodes and the mentioned products, highlighting that these persons bought these products These are the products and brands that we added: Adding products and brands to the dataset The following diagram shows the resulting model: In Neo4j, that model will look something like the following: A dataset like this one, while of course a broad simplification, offers us some interesting possibilities for a recommender system. Let's take a look at some queries that could really match this use case, and that would allow us to either visually or in real time exploit the data in this dataset in a product recommendation application. Summary This article elaborately dissected the so-called recommender systems. Resources for Article: Further resources on this subject: Working with a Neo4j Embedded Database [article] Comparative Study of NoSQL Products [article] Getting Started with CouchDB and Futon [article]
Read more
  • 0
  • 0
  • 2275

article-image-using-canvas-and-d3
Packt
21 Aug 2014
9 min read
Save for later

Using Canvas and D3

Packt
21 Aug 2014
9 min read
This article by Pablo Navarro Castillo, author of Mastering D3.js, includes that most of the time, we use D3 to create SVG-based charts and visualizations. If the number of elements to render is huge, or if we need to render raster images, it can be more convenient to render our visualizations using the HTML5 canvas element. In this article, we will learn how to use the force layout of D3 and the canvas element to create an animated network visualization with random data. (For more resources related to this topic, see here.) Creating figures with canvas The HTML canvas element allows you to create raster graphics using JavaScript. It was first introduced in HTML5. It enjoys more widespread support than SVG, and can be used as a fallback option. Before diving deeper into integrating canvas and D3, we will construct a small example with canvas. The canvas element should have the width and height attributes. This alone will create an invisible figure of the specified size: <!— Canvas Element --> <canvas id="canvas-demo" width="650px" height="60px"></canvas> If the browser supports the canvas element, it will ignore any element inside the canvas tags. On the other hand, if the browser doesn't support the canvas, it will ignore the canvas tags, but it will interpret the content of the element. This behavior provides a quick way to handle the lack of canvas support: <!— Canvas Element --> <canvas id="canvas-demo" width="650px" height="60px"> <!-- Fallback image --> <img src="img/fallback-img.png" width="650" height="60"></img> </canvas> If the browser doesn't support canvas, the fallback image will be displayed. Note that unlike the <img> element, the canvas closing tag (</canvas>) is mandatory. To create figures with canvas, we don't need special libraries; we can create the shapes using the canvas API: <script> // Graphic Variables var barw = 65, barh = 60; // Append a canvas element, set its size and get the node. var canvas = document.getElementById('canvas-demo'); // Get the rendering context. var context = canvas.getContext('2d'); // Array with colors, to have one rectangle of each color. var color = ['#5c3566', '#6c475b', '#7c584f', '#8c6a44', '#9c7c39', '#ad8d2d', '#bd9f22', '#cdb117', '#ddc20b', '#edd400']; // Set the fill color and render ten rectangles. for (var k = 0; k < 10; k += 1) { // Set the fill color. context.fillStyle = color[k]; // Create a rectangle in incremental positions. context.fillRect(k * barw, 0, barw, barh); } </script> We use the DOM API to access the canvas element with the canvas-demo ID and to get the rendering context. Then we set the color using the fillStyle method, and use the fillRect canvas method to create a small rectangle. Note that we need to change fillStyle every time or all the following shapes will have the same color. The script will render a series of rectangles, each one filled with a different color, shown as follows: A graphic created with canvas Canvas uses the same coordinate system as SVG, with the origin in the top-left corner, the horizontal axis augmenting to the right, and the vertical axis augmenting to the bottom. Instead of using the DOM API to get the canvas node, we could have used D3 to create the node, set its attributes, and created scales for the color and position of the shapes. Note that the shapes drawn with canvas don't exists in the DOM tree; so, we can't use the usual D3 pattern of creating a selection, binding the data items, and appending the elements if we are using canvas. Creating shapes Canvas has fewer primitives than SVG. In fact, almost all the shapes must be drawn with paths, and more steps are needed to create a path. To create a shape, we need to open the path, move the cursor to the desired location, create the shape, and close the path. Then, we can draw the path by filling the shape or rendering the outline. For instance, to draw a red semicircle centered in (325, 30) and with a radius of 20, write the following code: // Create a red semicircle. context.beginPath(); context.fillStyle = '#ff0000'; context.moveTo(325, 30); context.arc(325, 30, 20, Math.PI / 2, 3 * Math.PI / 2); context.fill(); The moveTo method is a bit redundant here, because the arc method moves the cursor implicitly. The arguments of the arc method are the x and y coordinates of the arc center, the radius, and the starting and ending angle of the arc. There is also an optional Boolean argument to indicate whether the arc should be drawn counterclockwise. A basic shape created with the canvas API is shown in the following screenshot: Integrating canvas and D3 We will create a small network chart using the force layout of D3 and canvas instead of SVG. To make the graph looks more interesting, we will randomly generate the data. We will generate 250 nodes sparsely connected. The nodes and links will be stored as the attributes of the data object: // Number of Nodes var nNodes = 250, createLink = false; // Dataset Structure var data = {nodes: [],links: []}; We will append nodes and links to our dataset. We will create nodes with a radius attribute randomly assigning it a value of either 2 or 4 as follows: // Iterate in the nodes for (var k = 0; k < nNodes; k += 1) { // Create a node with a random radius. data.nodes.push({radius: (Math.random() > 0.3) ? 2 : 4}); // Create random links between the nodes. } We will create a link with probability of 0.1 only if the difference between the source and target indexes are less than 8. The idea behind this way to create links is to have only a few connections between the nodes: // Create random links between the nodes. for (var j = k + 1; j < nNodes; j += 1) { // Create a link with probability 0.1 createLink = (Math.random() < 0.1) && (Math.abs(k - j) < 8); if (createLink) { // Append a link with variable distance between the nodes data.links.push({ source: k, target: j, dist: 2 * Math.abs(k - j) + 10 }); } } We will use the radius attribute to set the size of the nodes. The links will contain the distance between the nodes and the indexes of the source and target nodes. We will create variables to set the width and height of the figure: // Figure width and height var width = 650, height = 300; We can now create and configure the force layout. As we did in the previous section, we will set the charge strength to be proportional to the area of each node. This time, we will also set the distance between the links, using the linkDistance method of the layout: // Create and configure the force layout var force = d3.layout.force() .size([width, height]) .nodes(data.nodes) .links(data.links) .charge(function(d) { return -1.2 * d.radius * d.radius; }) .linkDistance(function(d) { return d.dist; }) .start(); We can create a canvas element now. Note that we should use the node method to get the canvas element, because the append and attr methods will both return a selection, which don't have the canvas API methods: // Create a canvas element and set its size. var canvas = d3.select('div#canvas-force').append('canvas') .attr('width', width + 'px') .attr('height', height + 'px') .node(); We get the rendering context. Each canvas element has its own rendering context. We will use the '2d' context, to draw two-dimensional figures. At the time of writing this, there are some browsers that support the webgl context; more details are available at https://developer.mozilla.org/en-US/docs/Web/WebGL/Getting_started_with_WebGL. Refer to the following '2d' context: // Get the canvas context. var context = canvas.getContext('2d'); We register an event listener for the force layout's tick event. As canvas doesn't remember previously created shapes, we need to clear the figure and redraw all the elements on each tick: force.on('tick', function() { // Clear the complete figure. context.clearRect(0, 0, width, height); // Draw the links ... // Draw the nodes ... }); The clearRect method cleans the figure under the specified rectangle. In this case, we clean the entire canvas. We can draw the links using the lineTo method. We iterate through the links, beginning a new path for each link, moving the cursor to the position of the source node, and by creating a line towards the target node. We draw the line with the stroke method: // Draw the links data.links.forEach(function(d) { // Draw a line from source to target. context.beginPath(); context.moveTo(d.source.x, d.source.y); context.lineTo(d.target.x, d.target.y); context.stroke(); }); We iterate through the nodes and draw each one. We use the arc method to represent each node with a black circle: // Draw the nodes data.nodes.forEach(function(d, i) { // Draws a complete arc for each node. context.beginPath(); context.arc(d.x, d.y, d.radius, 0, 2 * Math.PI, true); context.fill(); }); We obtain a constellation of disconnected network graphs, slowly gravitating towards the center of the figure. Using the force layout and canvas to create a network chart is shown in the following screenshot: We can think that to erase all the shapes and redraw each shape again and again could have a negative impact on the performance. In fact, sometimes it's faster to draw the figures using canvas, because this way the browser doesn't have to manage the DOM tree of the SVG elements (but it still have to redraw them if the SVG elements are changed). Summary In this article, we will learn how to use the force layout of D3 and the canvas element to create an animated network visualization with random data. Resources for Article: Further resources on this subject: Interacting with Data for Dashboards [Article] Kendo UI DataViz – Advance Charting [Article] Visualizing my Social Graph with d3.js [Article]
Read more
  • 0
  • 0
  • 6090

article-image-importing-dynamic-data
Packt
20 Aug 2014
19 min read
Save for later

Importing Dynamic Data

Packt
20 Aug 2014
19 min read
In this article by Chad Adams, author of the book Learning Python Data Visualization, we will go over the finer points of pulling data from the Web using the Python language and its built-in libraries and cover parsing XML, JSON, and JSONP data. (For more resources related to this topic, see here.) Since we now have an understanding of how to work with the pygal library and building charts and graphics in general, this is the time to start looking at building an application using Python. In this article, we will take a look at the fundamentals of pulling data from the Web, parsing the data, and adding it to our code base and formatting the data into a useable format, and we will look at how to carry those fundamentals over to our Python code. We will also cover parsing XML and JSON data. Pulling data from the Web For many non-developers, it may seem like witchcraft that developers are magically able to pull data from an online resource and integrate that with an iPhone app, or a Windows Store app, or pull data to a cloud resource that is able to generate various versions of the data upon request. To be fair, they do have a general understanding; data is pulled from the Web and formatted to their app of choice. They just may not get the full background of how that process workflow happens. It's the same case with some developers as well—many developers mainly work on a technology that only works on a locked down environment, or generally, don't use the Internet for their applications. Again, they understand the logic behind it; somehow an RSS feed gets pulled into an application. In many languages, the same task is done in various ways, usually depending on which language is used. Let's take a look at a few examples using Packt's own news RSS feed, using an iOS app pulling in data via Objective-C. Now, if you're reading this and not familiar with Objective-C, that's OK, the important thing is that we have the inner XML contents of an XML file showing up in an iPhone application: #import "ViewController.h" @interfaceViewController () @property (weak, nonatomic) IBOutletUITextView *output; @end @implementationViewController - (void)viewDidLoad { [super viewDidLoad]; // Do any additional setup after loading the view, typically from a nib. NSURL *packtURL = [NSURLURLWithString:@"http://www.packtpub.com/rss.xml"]; NSURLRequest *request = [NSURLRequestrequestWithURL:packtURL]; NSURLConnection *connection = [[NSURLConnectionalloc] initWithRequest:requestdelegate:selfstartImmediately:YES]; [connection start]; } - (void)connection:(NSURLConnection *)connection didReceiveData:(NSData *)data { NSString *downloadstring = [[NSStringalloc] initWithData:dataencoding:NSUTF8StringEncoding]; [self.outputsetText:downloadstring]; } - (void)didReceiveMemoryWarning { [superdidReceiveMemoryWarning]; // Dispose of any resources that can be recreated. } @end Here, we can see in iPhone Simulator that our XML output is pulled dynamically through HTTP from the Web to our iPhone simulator. This is what we'll want to get started with doing in Python: The XML refresher Extensible Markup Language (XML) is a data markup language that sets a series of rules and hierarchy to a data group, which is stored as a static file. Typically, servers update these XML files on the Web periodically to be reused as data sources. XML is really simple to pick up as it's similar to HTML. You can start with the document declaration in this case: <?xml version="1.0" encoding="utf-8"?> Next, a root node is set. A node is like an HTML tag (which is also called a node). You can tell it's a node by the brackets around the node's name. For example, here's a node named root: <root></root> Note that we close the node by creating a same-named node with a backslash. We can also add parameters to the node and assign a value, as shown in the following root node: <root parameter="value"></root> Data in XML is set through a hierarchy. To declare that hierarchy, we create another node and place that inside the parent node, as shown in the following code: <root parameter="value"> <subnode>Subnode's value</subnode> </root> In the preceding parent node, we created a subnode. Inside the subnode, we have an inner value called Subnode's value. Now, in programmatical terms, getting data from an XML data file is a process called parsing. With parsing, we specify where in the XML structure we would like to get a specific value; for instance, we can crawl the XML structure and get the inner contents like this: /root/subnode This is commonly referred to as XPath syntax, a cross-language way of going through an XML file. For more on XML and XPath, check out the full spec at: http://www.w3.org/TR/REC-xml/ and here http://www.w3.org/TR/xpath/ respectively. RSS and the ATOM Really simple syndication (RSS) is simply a variation of XML. RSS is a spec that defines specific nodes that are common for sending data. Typically, many blog feeds include an RSS option for users to pull down the latest information from those sites. Some of the nodes used in RSS include rss, channel, item, title, description, pubDate, link, and GUID. Looking at our iPhone example in this article from the Pulling data from the Web section, we can see what a typical RSS structure entails. RSS feeds are usually easy to spot since the spec requires the root node to be named rss for it to be a true RSS file. In some cases, some websites and services will use .rss rather than .xml; this is typically fine since most readers for RSS content will pull in the RSS data like an XML file, just like in the iPhone example. Another form of XML is called ATOM. ATOM was another spec similar to RSS, but developed much later than RSS. Because of this, ATOM has more features than RSS: XML namespacing, specified content formats (video, or audio-specific URLs), support for internationalization, and multilanguage support, just to name a few. ATOM does have a few different nodes compared to RSS; for instance, the root node to an RSS feed would be <rss>. ATOM's root starts with <feed>, so it's pretty easy to spot the difference. Another difference is that ATOM can also end in .atom or .xml. For more on the RSS and ATOM spec, check out the following sites: http://www.rssboard.org/rss-specification http://tools.ietf.org/html/rfc4287 Understanding HTTP All these samples from the RSS feed of the Packt Publishing website show one commonality that's used regardless of the technology coded in, and that is the method used to pull down these static files is through the Hypertext Transfer Protocol (HTTP). HTTP is the foundation of Internet communication. It's a protocol with two distinct types of requests: a request for data or GET and a push of data called a POST. Typically, when we download data using HTTP, we use the GET method of HTTP in order to pull down the data. The GET request will return a string or another data type if we mention a specific type. We can either use this value directly or save to a variable. With a POST request, we are sending values to a service that handles any incoming values; say we created a new blog post's title and needed to add to a list of current titles, a common way of doing that is with URL parameters. A URL parameter is an existing URL but with a suffixed key-value pair. The following is a mock example of a POST request with a URL parameter: http://www.yourwebsite.com/blogtitles/?addtitle=Your%20New%20Title If our service is set up correctly, it will scan the POST request for a key of addtitle and read the value, in this case: Your New Title. We may notice %20 in our title for our request. This is an escape character that allows us to send a value with spaces; in this case, %20 is a placehoder for a space in our value. Using HTTP in Python The RSS samples from the Packt Publishing website show a few commonalities we use in programming when working in HTTP; one is that we always account for the possibility of something potentially going wrong with a connection and we always close our request when finished. Here's an example on how the same RSS feed request is done in Python using a built-in library called urllib2: #!/usr/bin/env python # -*- coding: utf-8 -*- import urllib2 try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') #Read the file to a variable we named 'xml' xml = response.read() #print to the console. print(xml) #Finally, close our open network. response.close() except: #If we have an issue show a message and alert the user. print('Unable to connect to RSS...') If we look in the following console output, we can see the XML contents just as we saw in our iOS code example: In the example, notice that we wrapped our HTTP request around a try except block. For those coming from another language, except can be considered the same as a catch statement. This allows us to detect if an error occurs, which might be an incorrect URL or lack of connectivity, for example, to programmatically set an alternate result to our Python script. Parsing XML in Python with HTTP With these examples including our Python version of the script, it's still returning a string of some sorts, which by itself isn't of much use to grab values from the full string. In order to grab specific strings and values from an XML pulled through HTTP, we need to parse it. Luckily, Python has a built-in object in the Python main library for this, called as ElementTree, which is a part of the XML library in Python. Let's incorporate ElementTree into our example and see how that works: # -*- coding: utf-8 -*- import urllib2 from xml.etree import ElementTree try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') tree = ElementTree.parse(response) root = tree.getroot() #Create an 'Element' group from our XPATH using findall. news_post_title = root.findall("channel//title") #Iterate in all our searched elements and print the inner text for each. for title in news_post_title: print title.text #Finally, close our open network. response.close() except Exception as e: #If we have an issue show a message and alert the user. print(e) The following screenshot shows the results of our script: As we can see, our output shows each title for each blog post. Notice how we used channel//item for our findall() method. This is using XPath, which allows us to write in a shorthand manner on how to iterate an XML structure. It works like this; let's use the http://www.packtpub.com feed as an example. We have a root, followed by channel, then a series of title elements. The findall() method found each element and saved them as an Element type specific to the XML library ElementTree uses in Python, and saved each of those into an array. We can then use a for in loop to iterate each one and print the inner text using the text property specific to the Element type. You may notice in the recent example that I changed except with a bit of extra code and added Exception as e. This allows us to help debug issues and print them to a console or display a more in-depth feedback to the user. An Exception allows for generic alerts that the Python libraries have built-in warnings and errors to be printed back either through a console or handled with the code. They also have more specific exceptions we can use such as IOException, which is specific for working with file reading and writing. About JSON Now, another data type that's becoming more and more common when working with web data is JSON. JSON is an acronym for JavaScript Object Notation, and as the name implies, is indeed true JavaScript. It has become popular in recent years with the rise of mobile apps, and Rich Internet Applications (RIA). JSON is great for JavaScript developers; it's easier to work with when working in HTML content, compared to XML. Because of this, JSON is becoming a more common data type for web and mobile application development. Parsing JSON in Python with HTTP To parse JSON data in Python is a pretty similar process; however, in this case, our ElementTree library isn't needed, since that only works with XML. We need a library designed to parse JSON using Python. Luckily, the Python library creators already have a library for us, simply called json. Let's build an example similar to our XML code using the json import; of course, we need to use a different data source since we won't be working in XML. One thing we may note is that there aren't many public JSON feeds to use, many of which require using a code that gives a developer permission to generate a JSON feed through a developer API, such as Twitter's JSON API. To avoid this, we will use a sample URL from Google's Books API, which will show demo data of Pride and Prejudice, Jane Austen. We can view the JSON (or download the file), by typing in the following URL: https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ Notice the API uses HTTPS, many JSON APIs are moving to secure methods of transmitting data, so be sure to include the secure in HTTP, with HTTPS. Let's take a look at the JSON output: { "kind": "books#volume", "id": "s1gVAAAAYAAJ", "etag": "yMBMZ85ENrc", "selfLink": "https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ", "volumeInfo": { "title": "Pride and Prejudice", "authors": [ "Jane Austen" ], "publisher": "C. Scribner's Sons", "publishedDate": "1918", "description": "Austen's most celebrated novel tells the story of Elizabeth Bennet, a bright, lively young woman with four sisters, and a mother determined to marry them to wealthy men. At a party near the Bennets' home in the English countryside, Elizabeth meets the wealthy, proud Fitzwilliam Darcy. Elizabeth initially finds Darcy haughty and intolerable, but circumstances continue to unite the pair. Mr. Darcy finds himself captivated by Elizabeth's wit and candor, while her reservations about his character slowly vanish. The story is as much a social critique as it is a love story, and the prose crackles with Austen's wry wit.", "readingModes": { "text": true, "image": true }, "pageCount": 401, "printedPageCount": 448, "dimensions": { "height": "18.00 cm" }, "printType": "BOOK", "averageRating": 4.0, "ratingsCount": 433, "contentVersion": "1.1.5.0.full.3", "imageLinks": { "smallThumbnail": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec =frontcover&img=1&zoom=5&edge=curl&imgtk=AFLRE73F8btNqKpVjGX6q7V3XS77 QA2PftQUxcEbU3T3njKNxezDql_KgVkofGxCPD3zG1yq39u0XI8s4wjrqFahrWQ- 5Epbwfzfkoahl12bMQih5szbaOw&source=gbs_api", "thumbnail": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=1&edge=curl&imgtk=AFLRE70tVS8zpcFltWh_ 7K_5Nh8BYugm2RgBSLg4vr9tKRaZAYoAs64RK9aqfLRECSJq7ATs_j38JRI3D4P48-2g_ k4-EY8CRNVReZguZFMk1zaXlzhMNCw&source=gbs_api", "small": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec =frontcover&img=1&zoom=2&edge=curl&imgtk=AFLRE71qcidjIs37x0jN2dGPstn 6u2pgeXGWZpS1ajrGgkGCbed356114HPD5DNxcR5XfJtvU5DKy5odwGgkrwYl9gC9fo3y- GM74ZIR2Dc-BqxoDuUANHg&source=gbs_api", "medium": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=3&edge=curl&imgtk=AFLRE73hIRCiGRbfTb0uNIIXKW 4vjrqAnDBSks_ne7_wHx3STluyMa0fsPVptBRW4yNxNKOJWjA4Od5GIbEKytZAR3Nmw_ XTmaqjA9CazeaRofqFskVjZP0&source=gbs_api", "large": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=4&edge=curl&imgtk=AFLRE73mlnrDv-rFsL- n2AEKcOODZmtHDHH0QN56oG5wZsy9XdUgXNnJ_SmZ0sHGOxUv4sWK6GnMRjQm2eEwnxIV4dcF9eBhghMcsx -S2DdZoqgopJHk6Ts&source=gbs_api", "extraLarge": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=6&edge=curl&imgtk=AFLRE73KIXHChszn TbrXnXDGVs3SHtYpl8tGncDPX_7GH0gd7sq7SA03aoBR0mDC4-euzb4UCIDiDNLYZUBJwMJxVX_ cKG5OAraACPLa2QLDcfVkc1pcbC0&source=gbs_api" }, "language": "en", "previewLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&source=gbs_api", "infoLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&source=gbs_api", "canonicalVolumeLink": "http://books.google.com/books/about/ Pride_and_Prejudice.html?hl=&id=s1gVAAAAYAAJ" }, "layerInfo": { "layers": [ { "layerId": "geo", "volumeAnnotationsVersion": "6" } ] }, "saleInfo": { "country": "US", "saleability": "FREE", "isEbook": true, "buyLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&buy=&source=gbs_api" }, "accessInfo": { "country": "US", "viewability": "ALL_PAGES", "embeddable": true, "publicDomain": true, "textToSpeechPermission": "ALLOWED", "epub": { "isAvailable": true, "downloadLink": "http://books.google.com/books/download /Pride_and_Prejudice.epub?id=s1gVAAAAYAAJ&hl=&output=epub &source=gbs_api" }, "pdf": { "isAvailable": true, "downloadLink": "http://books.google.com/books/download/Pride_and_Prejudice.pdf ?id=s1gVAAAAYAAJ&hl=&output=pdf&sig=ACfU3U3dQw5JDWdbVgk2VRHyDjVMT4oIaA &source=gbs_api" }, "webReaderLink": "http://books.google.com/books/reader ?id=s1gVAAAAYAAJ&hl=&printsec=frontcover& output=reader&source=gbs_api", "accessViewStatus": "FULL_PUBLIC_DOMAIN", "quoteSharingAllowed": false } } One downside to JSON is that it can be hard to read complex data. So, if we run across a complex JSON feed, we can visualize it using a JSON Visualizer. Visual Studio includes one with all its editions, and a web search will also show multiple online sites where you can paste JSON and an easy-to-understand data structure will be displayed. Here's an example using http://jsonviewer.stack.hu/ loading our example JSON URL: Next, let's reuse some of our existing Python code using our urllib2 library to request our JSON feed, and then we will parse it with the Python's JSON library. Let's parse the volumeInfo node of the book by starting with the JSON (root) node that is followed by volumeInfo as the subnode. Here's our example from the XML section, reworked using JSON to parse all child elements: # -*- coding: utf-8 -*- import urllib2 import json try: #Set a URL variable. url = 'https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ' #Open the file via HTTP. response = urllib2.urlopen(url) #Read the request as one string. bookdata = response.read() #Convert the string to a JSON object in Python. data = json.loads(bookdata) for r in data ['volumeInfo']: print r #Close our response. response.close() except: #If we have an issue show a message and alert the user. print('Unable to connect to JSON API...') Here's our output. It should match the child nodes of volumeInfo, which it does in the output screen, as shown in the following screenshot: Well done! Now, let's grab the value for title. Look at the following example and notice we have two brackets: one for volumeInfo and another for title. This is similar to navigating our XML hierarchy: # -*- coding: utf-8 -*- import urllib2 import json try: #Set a URL variable. url = 'https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ' #Open the file via HTTP. response = urllib2.urlopen(url) #Read the request as one string. bookdata = response.read() #Convert the string to a JSON object in Python. data = json.loads(bookdata) print data['volumeInfo']['title'] #Close our response. response.close() except Exception as e: #If we have an issue show a message and alert the user. #'Unable to connect to JSON API...' print(e) The following screenshot shows the results of our script: As you can see in the preceding screenshot, we return one line with Pride and Prejudice parsed from our JSON data. About JSONP JSONP, or JSON with Padding, is actually JSON but it is set up differently compared to traditional JSON files. JSONP is a workaround for web cross-browser scripting. Some web services can serve up JSONP rather than pure JSON JavaScript files. The issue with that is JSONP isn't compatible with many JSON Python-based parsers including one covered here, so you will want to avoid JSONP style JSON whenever possible. So how can we spot JSONP files; do they have a different extension? No, it's simply a wrapper for JSON data; here's an example without JSONP: /* *Regular JSON */ { authorname: 'Chad Adams' } The same example with JSONP: /* * JSONP */ callback({ authorname: 'Chad Adams' }); Notice we wrapped our JSON data with a function wrapper, or a callback. Typically, this is what breaks in our parsers and is a giveaway that this is a JSONP-formatted JSON file. In JavaScript, we can even call it in code like this: /* * Using JSONP in JavaScript */ callback = function (data) { alert(data.authorname); }; JSONP with Python We can get around a JSONP data source though, if we need to; it just requires a bit of work. We can use the str.replace() method in Python to strip out the callback before running the string through our JSON parser. If we were parsing our example JSONP file in our JSON parser example, the string would look something like this: #Convert the string to a JSON object in Python. data = json.loads(bookdata.replace('callback(', '').) .replace(')', '')) Summary In this article, we covered HTTP concepts and methodologies for pulling strings and data from the Web. We learned how to do that with Python using the urllib2 library, and parsed XML data and JSON data. We discussed the differences between JSON and JSONP, and how to work around JSONP if needed. Resources for Article: Further resources on this subject: Introspecting Maya, Python, and PyMEL [Article] Exact Inference Using Graphical Models [Article] Automating Your System Administration and Deployment Tasks Over SSH [Article]
Read more
  • 0
  • 0
  • 16218

article-image-amazon-dynamodb-modelling-relationships-error-handling
Packt
20 Aug 2014
7 min read
Save for later

Amazon DynamoDB - Modelling relationships, Error handling

Packt
20 Aug 2014
7 min read
In this article by Tanmay Deshpande author of Mastering DynamoDB we are going to revise our concepts about the DynamoDB and will try to discover more about its features and implementation. (For more resources related to this topic, see here.) Amazon DynamoDB is a fully managed, cloud hosted, NoSQL database. It provides fast and predictable performance with the ability to scale seamlessly. It allows you to store and retrieve any amount of data , serving any level of network traffic without having any operational burden. DynamoDB gives numerous other advantages like consistent and predictable performance, flexible data modeling and durability. With just few clicks on Amazon Web Service console, you would be able to create your own DynamoDB database table, scale up or scale down provision throughput without taking down your application even for a millisecond. DynamoDB uses solid state disks (SSD) to store the data which confirms the durability of the work you are doing. It also automatically replicates the data across other AWS Availability Zones which provides built-in high availability and reliability. Before we start discussion details about DynamoDB let's try to understand what NoSQL databases are and when to choose DynamoDB over RDBMS. With the rise in data volume, variety and velocity, RDBMS were neither designed to cope up with the scale and flexibility challenges developers are facing to build the modern day applications nor were they able to take advantage of cheap commodity hardware. Also we need to provide schema before we start adding data which was restricting developers from making their application flexible. On the other hand, NoSQL databases are fast, provide flexible schema operations and do the effective use of cheap storage. Considering all these things, NoSQL is becoming popular very quickly amongst the developer community. But one has to be very cautious about when to go for NoSQL and when to stick to RDBMS. Sticking to relational databases makes sense when you know the schema is more over static, strong consistency is must and the data is not going to be that big in volume. But when you want to build an application which is Internet scalable, schema is more likely to get evolved over the time, the storage is going to be really big and the operations involved are ok to be eventually consistent then NoSQL is the way to go. There are various types of NoSQL databases. Following is the list of NoSQL database types and popular examples Document Store – MongoDB, CouchDB, MarkLogic etc. Column Store – Hbase, Cassandra etc. Key Value Store – DynamoDB, Azure, Redis etc. Graph Databases – Neo4J, DEX etc. Most of these NoSQL solutions are open source except few like DynamoDB, Azure which are available as service over Internet. DynamoDB being a key-value store indexes data only upon primary keys and one need to go through primary key to access certain attribute. Let's start learning more about DynamoDB by having a look at its history. DynamoDB's History Amazon's ecommerce platform had a huge set of decoupled services developed and managed individually and each and every service had API to be used and consumed for others. Earlier each service had direct database access which was a major bottleneck. In terms of scalability, Amazon's requirements were more than any third party vendors could provide at that time. DynamoDB was built to address Amazon's high availability, extreme scalability and durability needs. Earlier Amazon used to store its production data in relational databases and services had been provided for all required operations. But later they realized that most of the services access data only through its primary key and need not use complex queries to fetch the required data, plus maintaining these RDBMS systems required high end hardware and skilled personnel. So to overcome all such issues, Amazon's engineering team built a NoSQL database which addresses all above mentioned issues. In 2007, Amazon released one research paper on Dynamo which was combining the best of ideas from database and key value store worlds which was inspiration for many open source projects at the time. Cassandra, Voldemort and Riak were one of them. You can find the above mentioned paper at http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf Even though Dynamo had great features which would take care of all engineering needs, it was not widely accepted at that time in Amazon itself as it was not a fully managed service. When Amazon released S3 and SimpleDB, engineering teams were quite excited to adopt those compared to Dynamo as DynamoDB was bit expensive at that time due to SSDs. So finally after rounds of improvement, Amazon released Dynamo as cloud based service and since then it is one the most widely used NoSQL database. Before releasing to public cloud in 2012, DynamoDB has been the core storage service for Amazon's e-commerce platform, which was started shopping cart and session management service. Any downtime or degradation in performance had major impact on Amazon's business and any financial impact was strictly not acceptable and DynamoDB proved itself to be the best choice at the end. Now let's try to understand in more detail about DynamoDB. What is DynamoDB? DynamoDB is a fully managed, Internet scalable and easily administered, cost effective NoSQL database. It is a part of database as a service offering pane of Amazon Web Services. The above diagram shows how Amazon offers its various cloud services and where DynamoDB is exactly placed. AWS RDS is relational database as a service over Internet from Amazon while Simple DB and DynamoDB are NoSQL database as services. Both SimpleDB and DynamoDB are fully managed, non-relational services. DynamoDB is build considering fast, seamless scalability, and high performance. It runs on SSDs to provide faster responses and has no limits on request capacity and storage. It automatically partitions your data throughout the cluster to meet the expectations while in SimpleDB we have storage limit of 10 GB and can only take limited requests per second. Also in SimpleDB we have to manage our own partitions. So depending upon your need you have to choose the correct solution. To use DynamoDB, the first and foremost requirement is having an AWS account. Through easy to use AWS management console, you can directly create new tables, providing necessary information and can start loading data into the tables in few minutes. Modelling Relationships Like any other database, modeling relationships is quite interesting even though DynamoDB is a NoSQL database. Most of the time, people get confused on how do I model the relationships between various tables, in this section, we are trying make an effort to simplify this problem. To understand the relationships better, let's try to understand that using our example of Book Store where we have entities like Book, Author, Publisher, and so on. One to One In this type of relationship, one entity record of a table is related only one entity record of other table. In our book store application, we have BookInfo and Book-Details tables, BookInfo table can have short information about the book which can be used to display book information on web page whereas BookDetails table would be used when someone explicitly needs to see all the details of book. This design helps us keeping our system healthy as even if there are high request on one table, the other table would always be to up and running to fulfil what it is supposed to do. Following diagram shows how the table structure would look like. One to many In this type of relationship, one record from an entity is related to more than one record in another entity. In book store application, we can have Publisher Book Table which would keep information about the book and publisher relationship. Here we can have Publisher Id as hash key and Book Id as range key. Following diagram shows how a table structure would like. Many to many Many to many relationship means many records from an entity is related to many records from another entity. In case of book store application, we can say that a book can be authored by multiple authors and an author can write multiple books. In this we should use two tables with both and range keys.
Read more
  • 0
  • 0
  • 7638
article-image-sharding-action
Packt
21 Jul 2014
9 min read
Save for later

Sharding in Action

Packt
21 Jul 2014
9 min read
(For more resources related to this topic, see here.) Preparing the environment Before jumping into configuring and setting up the cluster network, we have to check some parameters and prepare the environment. To enable sharding for a database or collection, we have to configure some configuration servers that hold the cluster network metadata and shards information. Other parts of the cluster network use these configuration servers to get information about other shards. In production, it's recommended to have exactly three configuration servers in different machines. The reason for establishing each shard on a different server is to improve the safety of data and nodes. If one of the machines crashes, the whole cluster won't be unavailable. For the testing and developing environment, you can host all the configuration servers on a single server.Besides, we have two more parts for our cluster network, shards and mongos, or query routers. Query routers are the interface for all clients. All read/write requests are routed to this module, and the query router or mongos instance, using configuration servers, route the request to the corresponding shard. The following diagram shows the cluster network, modules, and the relation between them: It's important that all modules and parts have network access and are able to connect to each other. If you have any firewall, you should configure it correctly and give proper access to all cluster modules. Each configuration server has an address that routes to the target machine. We have exactly three configuration servers in our example, and the following list shows the hostnames: cfg1.sharding.com cfg2.sharding.com cfg3.sharding.com In our example, because we are going to set up a demo of sharding feature, we deploy all configuration servers on a single machine with different ports. This means all configuration servers addresses point to the same server, but we use different ports to establish the configuration server. For production use, all things will be the same, except you need to host the configuration servers on separate machines. In the next section, we will implement all parts and finally connect all of them together to start the sharding server and run the cluster network. Implementing configuration servers Now it's time to start the first part of our sharding. Establishing a configuration server is as easy as running a mongod instance using the --configsvr parameter. The following scheme shows the structure of the command: mongod --configsvr --dbpath <path> --port <port> If you don't pass the dbpath or port parameters, the configuration server uses /data/configdb as the path to store data and port 27019 to execute the instance. However, you can override the default values using the preceding command. If this is the first time that you have run the configuration server, you might be faced with some issues due to the existence of dbpath. Before running the configuration server, make sure that you have created the path; otherwise, you will see an error as shown in the following screenshot: You can simply create the directory using the mkdir command as shown in the following line of command: mkdir /data/configdb Also, make sure that you are executing the instance with sufficient permission level; otherwise, you will get an error as shown in the following screenshot: The problem is that the mongod instance can't create the lock file because of the lack of permission. To address this issue, you should simply execute the command using a root or administrator permission level. After executing the command using the proper permission level, you should see a result like the following screenshot: As you can see now, we have a configuration server for the hostname cfg1.sharding.com with port 27019 and with dbpath as /data/configdb. Also, there is a web console to watch and control the configuration server running on port 28019. By pointing the web browser to the address http://localhost:28019/, you can see the console. The following screenshot shows a part of this web console: Now, we have the first configuration server up and running. With the same method, you can launch other instances, that is, using /data/configdb2 with port 27020 for the second configuration server, and /data/configdb3 with port 27021 for the third configuration server. Configuring mongos instance After configuring the configuration servers, we should bind them to the core module of clustering. The mongos instance is responsible to bind all modules and parts together to make a complete sharding core. This module is simple and lightweight, and we can host it on the same machine that hosts other modules, such as configuration servers. It doesn't need a separate directory to store data. The mongos process uses port 27017 by default, but you can change the port using the configuration parameters. To define the configuration servers, you can use the configuration file or command-line parameters. Create a new file using your text editor in the /etc/ directory and add the following configuring settings: configdb = cfg1.sharding.com:27019, cfg2.sharding.com:27020 cfg3.sharding.com:27021 To execute and run the mongos instance, you can simply use the following command: mongos -f /etc/mongos.conf After executing the command, you should see an output like the following screenshot: Please note that if you have a configuration server that has been already used in a different sharding network, you can't use the existing data directory. You should create a new and empty data directory for the configuration server. Currently, we have mongos and all configuration servers that work together pretty well. In the next part, we will add shards to the mongos instance to complete the whole network. Managing mongos instance Now it's time to add shards and split whole dataset into smaller pieces. For production use, each shard should be a replica set network, but for the development and testing environment, you can simply add a single mongod instances to the cluster. To control and manage the mongos instance, we can simply use the mongo shell to connect to the mongos and execute commands. To connect to the mongos, you use the following command: mongo --host <mongos hostname> --port <mongos port> For instance, our mongos address is mongos1.sharding.com and the port is 27017. This is depicted in the following screenshot: After connecting to the mongos instance, we have a command environment, and we can use it to add, remove, or modify shards, or even get the status of the entire sharding network. Using the following command, you can get the status of the sharding network: sh.status() The following screenshot illustrates the output of this command: Because we haven't added any shards to sharding, you see an error that says there are no shards in the sharding network. Using the sh.help() command, you can see all commands as shown in the following screenshot: Using the sh.addShard() function, you can add shards to the network. Adding shards to mongos After connecting to the mongos, you can add shards to sharding. Basically, you can add two types of endpoints to the mongos as a shard; replica set or a standalone mongod instance. MongoDB has a sh namespace and a function called addShard(), which is used to add a new shard to an existing sharding network. Here is the example of a command to add a new shard. This is shown in the following screenshot: To add a replica set to mongos you should follow this scheme: setname/server:port For instance, if you have a replica set with the name of rs1, hostname mongod1.replicaset.com, and port number 27017, the command will be as follows: sh.addShard("rs1/mongod1.replicaset.com:27017") Using the same function, we can add standalone mongod instances. So, if we have a mongod instance with the hostname mongod1.sharding.com listening on port 27017, the command will be as follows: sh.addShard("mongod1.sharding.com:27017") You can use a secondary or primary hostname to add the replica set as a shard to the sharding network. MongoDB will detect the primary and use the primary node to interact with sharding. Now, we add the replica set network using the following command: sh.addShard("rs1/mongod1.replicaset.com:27017") If everything goes well, you won't see any output from the console, which means the adding process was successful. This is shown in the following screenshot: To see the status of sharding, you can use the sh.status() command. This is demonstrated in the following screenshot: Next, we will establish another standalone mongod instance and add it to sharding. The port number of mongod is 27016 and the hostname is mongod1.sharding.com. The following screenshot shows the output after starting the new mongod instance: Using the same approach, we will add the preceding node to sharding. This is shown in the following screenshot: It's time to see the sharding status using the sh.status() command: As you can see in the preceding screenshot, now we have two shards. The first one is a replica set with the name rs1, and the second shard is a standalone mongod instance on port 27016. If you create a new database on each shard, MongoDB syncs this new database with the mongos instance. Using the show dbs command, you can see all databases from all shards as shown in the following screenshot: The configuration database is an internal database that MongoDB uses to configure and manage the sharding network. Currently, we have all sharding modules working together. The last and final step is to enable sharding for a database and collection. Summary In this article, we prepared the environment for sharding of a database. We also learned about the implementation of a configuration server. Next, after configuring the configuration servers, we saw how to bind them to the core module of clustering. Resources for Article: Further resources on this subject: MongoDB data modeling [article] Ruby with MongoDB for Web Development [article] Dart Server with Dartling and MongoDB [article]
Read more
  • 0
  • 0
  • 4599

article-image-securing-wal-stream
Packt
15 Jul 2014
6 min read
Save for later

Securing the WAL Stream

Packt
15 Jul 2014
6 min read
(For more resources related to this topic, see here.) The primary mechanism that PostgreSQL uses to provide a data durability guarantee is through its Write Ahead Log (WAL). All transactional data is written to this location before ever being committed to database files. Once WAL files are no longer necessary for crash recovery, PostgreSQL will either delete or archive them. For the purposes of a highly available server, we recommend that you keep these important files as long as possible. There are several reasons for this; they are as follows: Archived WAL files can be used for Point In Time Recovery (PITR) If you are using streaming replication, interrupted streams can be re-established by applying WAL files until the replica has caught up WAL files can be reused to service multiple server copies In order to gain these benefits, we need to enable PostgreSQL WAL archiving and save these files until we no longer need them. This section will address our recommendations for long term storage of WAL files. Getting ready In order to properly archive WAL files, we recommend that you provision a server dedicated to backups or file storage. Depending on the transaction volume, an active PostgreSQL database might produce thousands of these on a daily basis. At 16 MB apiece, this is not an idle concern. For instance, for a 1 TB database, we recommend at least 3 TB of storage space. In addition, we will be using rsync as a daemon on this archive server. To install this on a Debian-based server, execute the following command as a root-level user: sudo apt-get install rsync Red-Hat-based systems will need this command instead: sudo yum install rsync xinetd How to do it... Our archive server has a 3 TB mount at the /db directory and is named arc_server on our network. The PostgreSQL source server resides at 192.168.56.10. Follow these steps for long-term storage of important WAL files on an archive server Enable rsync to run as a daemon on the archive server. On Debian based systems, edit the /etc/default/rsync file and change the RSYNC_ENABLE variable to true. On Red-Hat-based systems, edit the /etc/xinet.d/rsync file and change the disable parameter to no. Create a directory to store archived WAL files as the postgres user with these commands: sudo mkdir /db/pg_archived sudo chown postgres:postgres /db/pg_archived Create a file named /etc/rsyncd.conf and fill it with the following contents: [wal_store] path = /db/pg_archived comment = DB WAL Archives uid = postgres gid = postgres read only = false hosts allow = 192.168.56.10 hosts deny = * Start the rsync daemon. Debian-based systems should execute the following command: sudo service rsync start Red-Hat-based systems can start rsync with this command instead: sudo service xinetd start Change the archive_mode and archive_command parameters in postgresql.conf to read the following: archive_mode = on archive_command = 'rsync -aq %p arc_server::wal_store/%f' Restart the PostgreSQL server with a command similar to this: pg_ctl -D $PGDATA restart How it works The rsync utility is normally used to transfer files between two servers. However, we can take advantage of using it as a daemon to avoid connection overhead imposed by using SSH as an rsync protocol. Our first step is to ensure that the service is not disabled in some manner, which would make the rest of this guide moot. Next, we need a place to store archived WAL files on the archive server. Assuming that we have 3 TB of space in the /db directory, we simply claim /db/pg_archived as the desired storage location. There should be enough space to use /db for backups as well, but we won't discuss that here. Next, we create a file named /etc/rsyncd.conf, which will configure how rsync operates as a daemon. Here, we name the /db/pg_archived directory wal_store so that we can address the path by its name when sending files. We give it a human-readable name and ensure that files are owned by the postgres user, as this user also controls most of the PostgreSQL-related services. The next, and possibly the most important step, is to block all hosts but the primary PostgreSQL server from writing to this location. We set hosts deny to *, which blocks every server. Then, we set hosts allow to the primary database server's IP address so that only it has access. If everything goes well, we can start the rsync (or xinetd on Red Hat systems) service and we can see that in the following screenshot: Next, we enable archive_mode by setting it to on. With archive mode enabled, we can specify a command that will execute when PostgreSQL no longer needs a WAL file for crash recovery. In this case, we invoke the rsync command with the -a parameter to preserve elements such as file ownership, timestamps, and so on. In addition, we specify the -q setting to suppress output, as PostgreSQL only checks the command exit status to determine its success. In the archive_command setting, the %p value represents the full path to the WAL file, and %f resolves to the filename. In this context, we're sending the WAL file to the archive server at the wal_store module we defined in rsyncd.conf. Once we restart PostgreSQL, it will start storing all the old WAL files by sending them to the archive server. In case any rsync command fails because the archive server is unreachable, PostgreSQL will keep trying to send it until it is successful. If the archive server is unreachable for too long, we suggest that you change the archive_command setting to store files elsewhere. This prevents accidentally overfilling the PostgreSQL server storage. There's more... As we will likely want to use the WAL files on other servers, we suggest that you make a list of all the servers that could need WAL files. Then, modify the rsyncd.conf file on the archive server and add this section: [wal_fetch] path = /db/pg_archived comment = DB WAL Archive Retrieval uid = postgres gid = postgres read only = true hosts allow = host1, host2, host3 hosts deny = * Now, we can fetch WAL files from any of the hosts in hosts allow. As these are dedicated PostgreSQL replicas, recovery servers, or other defined roles, this makes the archive server a central location for all our WAL needs. See also We suggest that you read more about the archive_mode and archive_command settings on the PostgreSQL site. We've included a link here: http://www.postgresql.org/docs/9.3/static/runtime-config-wal.html The rsyncd.conf file should also have its own manual page. Read it with this command to learn more about the available settings: man rsyncd.conf Summary In this article, we've successfully learned how to secure the WAL stream by following the given steps. Resources for Article: Further resources on this subject: PostgreSQL 9: Reliable Controller and Disk Setup [article] Backup in PostgreSQL 9 [article] Recovery in PostgreSQL 9 [article]
Read more
  • 0
  • 0
  • 2292

article-image-making-most-your-hadoop-data-lake-part-2-optimized-file-formats
Kristen Hardwick
30 Jun 2014
5 min read
Save for later

Making the Most of Your Hadoop Data Lake, Part 2: Optimized File Formats

Kristen Hardwick
30 Jun 2014
5 min read
One major factor of making the conversion to Hadoop is the concept of the Data Lake. That idea suggests that users keep as much data as possible in HDFS in order to prepare for future use cases and as-yet-unknown data integration points. As your data grows, it is important to make sure that the data is being stored in a way that prolongs that behavior. Data compression is not the only technique that can be used to speed up job performance and improve cluster organization. In addition to the Text and Sequence File options that are typically used by default, Hadoop offers a few more optimized file formats that are specifically designed to improve the process of interacting with the data. In the second part of this two-part series, “Making the Most of Your Hadoop Data Lake”, we will address another important factor in improving manageability—optimized file formats. Using a smarter file for your data: RCFile RCFile stands for Record Columnar File, and it serves as an ideal format for storing relational data that will be accessed through Hive. This format offers performance improvements by storing the data in an optimized way. First, the data is partitioned horizontally, into groups of rows. Then each row group is partitioned vertically, into collections of columns. Finally, the data in each column collection is compressed and stored in column-row format, as if it were a column-oriented database. The first benefit of this altered storage mechanism is apparent at the row level. All HDFS blocks used to form RCFiles will be made up of the horizontally partitioned collections of rows. This is significant because it ensures that no row of data will be split across multiple blocks, and will therefore always be on the same machine. This is not the case for traditional HDFS file formats, which typically use data size to split the file. This optimized data storage will reduce the amount of network bandwidth that is required to serve queries. The second benefit comes from optimizations at the column level, in the form of disk I/O reduction. Since the columns are stored vertically within each row group, the system will be able to seek directly to the required column position in the file, rather than being required to scan across all columns and filter out data that is not necessary. This is extremely useful in queries that only require access to a small subset of the existing columns. RCFiles can be used natively in both Hive and Pig with very little configuration. In Hive CREATE TABLE … STORED AS RCFILE; ALTER TABLE … SET FILEFORMAT RCFILE; SET hive.default.fileformat=RCFile; In Pig: register …/piggybank.jar; a = LOAD '/user/hive/warehouse/table' USING org.apache.pig.piggybank.storage.hiverc.HiveRCInputFormat(…); The Pig jar file referenced here is just one option for enabling the RCFile. At the time of writing, there was also an RCFilePigStorageclass available through Twitter’s Elephant Bird open source library. Hortonworks’ ORCFile and Cloudera’s Parquet formats RCFiles provide optimization for relational files primarily by implementing modifications at the storage level. New innovations have provided improvements on the RCFile format, namely the ORCFile format from Hortonworks and the Parquet format from Cloudera. When storing data using the Optimized Row Columnar file or Parquet formats, several pieces of metadata are automatically written at the column level within each row group; for example, minimum and maximum values for numeric data types and dictionary-style metadata for text data types. The specific metadata is also configurable. One such use case would be for a user to configure the dataset to be sorted on a given set of columns for efficient access. This excess metadata allows for queries to take advantage of an improvement on the original RCFiles–predicate pushdown. That technique allows Hive to evaluate the where clause during the record gathering process, instead of filtering data after all records have been collected. The predicate pushdown technique will evaluate the conditions of the query against the metadata associated with a particular row group, allowing it to skip over entire file blocks if possible, or to seek directly to the correct row. One major benefit of this process is that the more complex a particular where clause is, the more potential there is for row groups and columns to be filtered as irrelevant to the final result. Cloudera’s Parquet format is typically used in conjunction with Impala, but just like with RCFiles, ORCFiles can be incorporated into both Hive and Pig. HCatalog can be used as the primary method to read and write ORCFiles using Pig. The commands for Hive are provided below: In Hive: CREATE TABLE … STORED AS ORC; ALTER TABLE … SET FILEFORMAT ORC SET hive.default.fileformat=Orc Conclusion This post has detailed the alternatives to the default file formats that can be used in Hadoop in order to optimize data access and storage. This information combined with the compression techniques described in the previous post (part 1) will provide some guidelines that can be used to ensure that users can make the most of the Hadoop Data Lake. About the author Kristen Hardwick has been gaining professional experience with software development in parallel computing environments in the private, public, and government sectors since 2007. She has interfaced with several different parallel paradigms including Grid, Cluster, and Cloud. She started her software development career with Dynetics in Huntsville, AL, and then moved to Baltimore, MD, to work for Dynamics Research Corporation. She now works at Spry where her focus is on designing and developing big data analytics for the Hadoop ecosystem.
Read more
  • 0
  • 0
  • 2089
article-image-making-most-your-hadoop-data-lake-part-1-data-compression
Kristen Hardwick
30 Jun 2014
6 min read
Save for later

Making the Most of Your Hadoop Data Lake, Part 1: Data Compression

Kristen Hardwick
30 Jun 2014
6 min read
In the world of big data, the Data Lake concept reigns supreme. Hadoop users are encouraged to keep all data in order to prepare for future use cases and as-yet-unknown data integration points. This concept is part of what makes Hadoop and HDFS so appealing, so it is important to make sure that the data is being stored in a way that prolongs that behavior. In the first part of this two-part series, “Making the Most of Your Hadoop Data Lake”, we will address one important factor in improving manageability—data compression. Data compression is an area that is often overlooked in the context of Hadoop. In many cluster environments, compression is disabled by default, putting the burden on the user. In this post, we will discuss the tradeoffs involved in deciding how to take advantage of compression techniques and the advantages and disadvantages of specific compression codec options with respect to Hadoop. To compress or not to compress Whenever data is converted to something other than its raw data format, that naturally implies some overhead involved in completing the conversion process. When data compression is being discussed, it is important to take that overhead into account with respect to the benefits of reducing the data footprint. One obvious benefit is that compressed data will reduce the amount of disk space that is required for storage of a particular dataset. In the big data environment, this benefit is especially significant—either the Hadoop cluster will be able to keep data for a larger time range, or storing data for the same time range will require fewer nodes, or the disk usage ratios will remain lower for longer. In addition, the smaller file sizes will mean lower data transfer times—either internally for MapReduce jobs or when performing exports of data results. The cost of these benefits, however, is that the data must be decompressed at every point where the data needs to be read, and compressed before being inserted into HDFS. With respect to MapReduce jobs, this processing overhead at both the map phase and the reduce phase will increase the CPU processing time. Fortunately, by making informed choices about the specific compression codecs used at any given phase in the data transformation process, the cluster administrator or user can ensure that the advantages of compression outweigh the disadvantages. Choosing the right codec for each phase Hadoop provides the user with some flexibility on which compression codec is used at each step of the data transformation process. It is important to realize that certain codecs are optimal for some stages, and non-optimal for others. In the next sections, we will cover some important notes for each choice. zlib The major benefit of using this codec is that it is the easiest way to get the benefits of data compression from a cluster and job configuration standpoint—the zlib codec is the default compression option. From the data transformation perspective, this codec will decrease the data footprint on disk, but will not provide much of a benefit in terms of job performance. gzip The gzip codec available in Hadoop is the same one that is used outside of the Hadoop ecosystem. It is common practice to use this as the codec for compressing the final output from a job, simply for the benefit of being able to share the compressed result with others (possibly outside of Hadoop) using a standard file format. bzip2 There are two important benefits for the bzip2 codec. First, if reducing the data footprint is a high priority, this algorithm will compress the data more than the default zlib option. Second, this is the only supported codec that produces “splittable” compressed data. A major characteristic of Hadoop is the idea of splitting the data so that they can be handled on each node independently. With the other compression codecs, there is an initial requirement to gather all parts of the compressed file in order to have all information necessary to decompress the data. With this format, the data can be decompressed in parallel. This splittable quality makes this format ideal for compressing data that will be used as input to a map function, either in a single step or as part of a series of chained jobs. LZO, LZ4, Snappy These three codecs are ideal for compressing intermediate data—the data output from the mappers that will be immediately read in by the reducers. All three codecs heavily favor compression speed over file size ratio, but the detailed specifications for each algorithm should be examined based on the specific licensing, cluster, and job requirements. Enabling compression Once the appropriate compression codec for any given transformation phase has been selected, there are a few configuration properties that need to be adjusted in order to have the changes take effect in the cluster. Intermediate data to reducer mapreduce.map.output.compress = true (Optional) mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.SnappyCodec Final output from a job mapreduce.output.fileoutputformat.compress = true (Optional) mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.BZip2Codec These compression codecs are also available within some of the ecosystem tools like Hive and Pig. In most cases, the tools will default to the Hadoop-configured values for particular codecs, but the tools also provide the option to compress the data generated between steps. Pig pig.tmpfilecompression = true (Optional) pig.tmpfilecompression.codec = snappy Hive hive.exec.compress.intermediate = true hive.exec.compress.output = true Conclusion This post detailed the benefits and disadvantages of data compression, along with some helpful guidelines on how to choose a codec and enable it at various stages in the data transformation workflow. In the next post, we will go through some additional techniques that can be used to ensure that users can make the most of the Hadoop Data Lake. For more Big Data and Hadoop tutorials and insight, visit our dedicated Hadoop page.  About the author Kristen Hardwick has been gaining professional experience with software development in parallel computing environments in the private, public, and government sectors since 2007. She has interfaced with several different parallel paradigms including Grid, Cluster, and Cloud. She started her software development career with Dynetics in Huntsville, AL, and then moved to Baltimore, MD, to work for Dynamics Research Corporation. She now works at Spry where her focus is on designing and developing big data analytics for the Hadoop ecosystem.
Read more
  • 0
  • 0
  • 6671

article-image-mapreduce-openstack-swift-and-zerovm
Lars Butler
30 Jun 2014
6 min read
Save for later

MapReduce with OpenStack Swift and ZeroVM

Lars Butler
30 Jun 2014
6 min read
Originally coined in a 2004 research paper produced by Google, the term “MapReduce” was defined as a “programming model and an associated implementation for processing and generating large datasets”. While Google’s proprietary implementation of this model is known simply as “MapReduce”, the term has since become overloaded to refer to any software that follows the same general computation model. The philosophy behind the MapReduce computation model is based on a divide and conquer approach: the input dataset is divided into small pieces and distributed among a pool of computation nodes for processing. The advantage here is that many nodes can run in parallel to solve a problem. This can be much quicker than a single machine, provided that the cost of communication (loading input data, relaying intermediate results between compute nodes, and writing final results to persistent storage) does not outweigh the computation time. Indeed, the cost of moving data between storage and computation nodes can diminish the advantage of distributed parallel computation if task payloads are too granular. On the other hand, they must be granular enough in order to be distributed evenly (although achieving perfect distribution is nearly impossible if exact computational complexity of each task cannot be known in advance). A distributed MapReduce solution can be effective for processing large data sets, provided that each task is sufficiently long-running to offset communication costs. If the need for data transfer (between compute and storage nodes) could somehow be reduced or eliminated, the task size limitations would all but disappear, enabling a wider range of use cases. One way to achieve this is to perform the computation closer to the data, that is, on the same physical machine. Stored procedures in relational database systems, for example, are one way to achieve data-local computation. Simpler yet, computation could be run directly on the system where the data is stored in files. In both cases, the same problems are present: changes to the code require direct access to the system, and thus access needs to be carefully controlled (through group management, read/write permissions, and so on). Scaling in these scenarios is also problematic: vertical scaling (beefing up a single server) is the only feasible way to gain performance while maintaining the efficiency of data-local computation. Achieving horizontal scalability by adding more machines is not feasible unless the storage system inherently supports this. Enter Swift and ZeroVM OpenStack Swift is one such example of a horizontally scalable data store. Swift can store petabytes of data in millions of objects across thousands of machines. Swift’s storage operations can also be extended by writing custom middleware applications, which makes it a good foundation for building a “smart” storage system capable of running computations directly on the data. ZeroCloud is a middleware application for Swift which does just that. It embeds ZeroVM inside Swift and exposes a new API method for executing jobs. ZeroVM provides sufficient program isolation and a guarantee of security such that any arbitrary code can be run safely without compromising the storage system. Access to data is governed by Swift’s inherent access control, so there is no need to further lock down the storage system, even in a multi-tenant environment. The combination of Swift and ZeroVM results in something like stored procedures, but much more powerful. First and foremost, user application code can be updated at any time without the need to worry about malicious or otherwise destructive side effects. (The worst thing that can happen is that a user can crash their own machines and delete their own data—but not another user’s.) Second, ZeroVM instances can start very quickly: in about five milliseconds. This means that to process one thousand objects, ZeroCloud can instantaneously spawn one thousand ZeroVM instances (one for each file), run the job, and destroy the instances very quickly. This also means that any job, big or small, can feasibly run on the system. Finally, the networking component of ZeroVM allows intercommunication between instances, enabling the creation of arbitrary job pipelines and multi-stage computations. This converged compute and storage solution makes for a good alternative to popular MapReduce frameworks like Hadoop because little effort is required to set up and tear down computation nodes for a job. Also, once a job is complete, there is no need to extract result data, pipe it across the network, and then save it in a separate persistent data store (which can be an expensive operation if there is a large quantity of data to save). With ZeroCloud, the results can simply be saved back into Swift. The same principle applies to the setup of a job. There is no need to move or copy data to the computation cluster; the data is already where it needs to be. Limitations Currently, ZeroCloud uses a fairly naïve scheduling algorithm. To perform a data-local computation on an object, ZeroCloud determines which machines in the cluster contain a replicated copy of an object and then randomly chooses one to execute a ZeroVM process. While this functions well as a proof-of-concept for data-local computing, it is quite possible for jobs to be spread unevenly across a cluster, resulting in an inefficient use of resources. A research group at the University of Texas at San Antonio (UTSA) sponsored by Rackspace is currently working on developing better algorithms for scheduling workloads running on ZeroVM-based infrastructure. A short video of the research proposal can be found here:http://link.brightcove.com/services/player/bcpid3530100726001?bckey=AQ~~,AAACa24Pu2k~,Q186VLPcl3-oLBDP8npyqxjCNB5jgYcT. Further reading Rackspace has launched a ZeroCloud playground service called Zebra for developers to try out the platform. At the time this was written, Zebra is still in a private beta and invitations are limited. But it also possible to install and run your own copy of ZeroCloud for testing and development; basic installation instructions are here: https://github.com/zerovm/zerocloud/blob/icehouse/doc/Hacking.md. There are also some tutorials (including sample applications) for creating, packaging, deploying, and executing applications on ZeroCloud: http://zerovm.readthedocs.org/en/latest/zebra/tutorial.html. The tutorials are intended for use with the Zebra service, but can be run on any deployment of ZeroCloud. Big Data in the cloud? It's not the future, it's already here! Find more Hadoop tutorials and extra content here, and dive deeper into OpenStack by visiting this page, dedicated to one of the most exciting cloud platforms around today.  About the author Lars Butler is a software developer at Rackspace, the open cloud company. He has worked as a software developer in avionics, seismic hazard research, and most recently, on ZeroVM. He can be reached @larsbutler.
Read more
  • 0
  • 0
  • 2966
Modal Close icon
Modal Close icon