Data | Tech News, Tutorials & Expert Insights

article-image-visualization-tool-understand-data

22 Sep 2014

23 min read

Visualization as a Tool to Understand Data

22 Sep 2014

In this article by Nazmus Saquib, the author of Mathematica Data Visualization, we will look at a few simple examples that demonstrate the importance of data visualization. We will then discuss the types of datasets that we will encounter over the course of this book, and learn about the Mathematica interface to get ourselves warmed up for coding. (For more resources related to this topic, see here.) In the last few decades, the quick growth in the volume of information we produce and the capacity of digital information storage have opened a new door for data analytics. We have moved on from the age of terabytes to that of petabytes and exabytes. Traditional data analysis is now augmented with the term big data analysis, and computer scientists are pushing the bounds for analyzing this huge sea of data using statistical, computational, and algorithmic techniques. Along with the size, the types and categories of data have also evolved. Along with the typical and popular data domain in Computer Science (text, image, and video), graphs and various categorical data that arise from Internet interactions have become increasingly interesting to analyze. With the advances in computational methods and computing speed, scientists nowadays produce an enormous amount of numerical simulation data that has opened up new challenges in the field of Computer Science. Simulation data tends to be structured and clean, whereas data collected or scraped from websites can be quite unstructured and hard to make sense of. For example, let's say we want to analyze some blog entries in order to find out which blogger gets more follows and referrals from other bloggers. This is not as straightforward as getting some friends' information from social networking sites. Blog entries consist of text and HTML tags; thus, a combination of text analytics and tag parsing, coupled with a careful observation of the results would give us our desired outcome. Regardless of whether the data is simulated or empirical, the key word here is observation. In order to make intelligent observations, data scientists tend to follow a certain pipeline. The data needs to be acquired and cleaned to make sure that it is ready to be analyzed using existing tools. Analysis may take the route of visualization, statistics, and algorithms, or a combination of any of the three. Inference and refining the analysis methods based on the inference is an iterative process that needs to be carried out several times until we think that a set of hypotheses is formed, or a clear question is asked for further analysis, or a question is answered with enough evidence. Visualization is a very effective and perceptive method to make sense of our data. While statistics and algorithmic techniques provide good insights about data, an effective visualization makes it easy for anyone with little training to gain beautiful insights about their datasets. The power of visualization resides not only in the ease of interpretation, but it also reveals visual trends and patterns in data, which are often hard to find using statistical or algorithmic techniques. It can be used during any step of the data analysis pipeline—validation, verification, analysis, and inference—to aid the data scientist. How have you visualized your data recently? If you still have not, it is okay, as this book will teach you exactly that. However, if you had the opportunity to play with any kind of data already, I want you to take a moment and think about the techniques you used to visualize your data so far. Make a list of them. Done? Do you have 2D and 3D plots, histograms, bar charts, and pie charts in the list? If yes, excellent! We will learn how to style your plots and make them more interactive using Mathematica. Do you have chord diagrams, graph layouts, word cloud, parallel coordinates, isosurfaces, and maps somewhere in that list? If yes, then you are already familiar with some modern visualization techniques, but if you have not had the chance to use Mathematica as a data visualization language before, we will explore how visualization prototypes can be built seamlessly in this software using very little code. The aim of this book is to teach a Mathematica beginner the data-analysis and visualization powerhouse built into Mathematica, and at the same time, familiarize the reader with some of the modern visualization techniques that can be easily built with Mathematica. We will learn how to load, clean, and dissect different types of data, visualize the data using Mathematica's built-in tools, and then use the Mathematica graphics language and interactivity functions to build prototypes of a modern visualization. The importance of visualization Visualization has a broad definition, and so does data. The cave paintings drawn by our ancestors can be argued as visualizations as they convey historical data through a visual medium. Map visualizations were commonly used in wars since ancient times to discuss the past, present, and future states of a war, and to come up with new strategies. Astronomers in the 17th century were believed to have built the first visualization of their statistical data. In the 18th century, William Playfair invented many of the popular graphs we use today (line, bar, circle, and pie charts). Therefore, it appears as if many, since ancient times, have recognized the importance of visualization in giving some meaning to data. To demonstrate the importance of visualization in a simple mathematical setting, consider fitting a line to a given set of points. Without looking at the data points, it would be unwise to try to fit them with a model that seemingly lowers the error bound. It should also be noted that sometimes, the data needs to be changed or transformed to the correct form that allows us to use a particular tool. Visualizing the data points ensures that we do not fall into any trap. The following screenshot shows the visualization of a polynomial as a circle: Figure1.1 Fitting a polynomial In figure 1.1, the points are distributed around a circle. Imagine we are given these points in a Cartesian space (orthogonal x and y coordinates), and we are asked to fit a simple linear model. There is not much benefit if we try to fit these points to any polynomial in a Cartesian space; what we really need to do is change the parameter space to polar coordinates. A 1-degree polynomial in polar coordinate space (essentially a circle) would nicely fit these points when they are converted to polar coordinates, as shown in figure 1.1. Visualizing the data points in more complicated but similar situations can save us a lot of trouble. The following is a screenshot of Anscombe's quartet: Figure1.2 Anscombe's quartet, generated using Mathematica Downloading the color images of this book We also provide you a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from: https://www.packtpub.com/sites/default/files/downloads/2999OT_coloredimages.PDF. Anscombe's quartet (figure 1.2), named after the statistician Francis Anscombe, is a classic example of how simple data visualization like plotting can save us from making wrong statistical inferences. The quartet consists of four datasets that have nearly identical statistical properties (such as mean, variance, and correlation), and gives rise to the same linear model when a regression routine is run on these datasets. However, the second dataset does not really constitute a linear relationship; a spline would fit the points better. The third dataset (at the bottom-left corner of figure 1.2) actually has a different regression line, but the outlier exerts enough influence to force the same regression line on the data. The fourth dataset is not even a linear relationship, but the outlier enforces the same regression line again. These two examples demonstrate the importance of "seeing" our data before we blindly run algorithms and statistics. Fortunately, for visualization scientists like us, the world of data types is quite vast. Every now and then, this gives us the opportunity to create new visual tools other than the traditional graphs, plots, and histograms. These visual signatures and tools serve the same purpose that the graph plotting examples previously just did—spy and investigate data to infer valuable insights—but on different types of datasets other than just point clouds. Another important use of visualization is to enable the data scientist to interactively explore the data. Two features make today's visualization tools very attractive—the ability to view data from different perspectives (viewing angles) and at different resolutions. These features facilitate the investigator in understanding both the micro- and macro-level behavior of their dataset. Types of datasets There are many different types of datasets that a visualization scientist encounters in their work. This book's aim is to prepare an enthusiastic beginner to delve into the world of data visualization. Certainly, we will not comprehensively cover each and every visualization technique out there. Our aim is to learn to use Mathematica as a tool to create interactive visualizations. To achieve that, we will focus on a general classification of datasets that will determine which Mathematica functions and programming constructs we should learn in order to visualize the broad class of data covered in this book. Tables The table is one of the most common data structures in Computer Science. You might have already encountered this in a computer science, database, or even statistics course, but for the sake of completeness, we will describe the ways in which one could use this structure to represent different kinds of data. Consider the following table as an example: Attribute 1 Attribute 2 … Item 1 Item 2 Item 3 When storing datasets in tables, each row in the table represents an instance of the dataset, and each column represents an attribute of that data point. For example, a set of two-dimensional Cartesian vectors can be represented as a table with two attributes, where each row represents a vector, and the attributes are the x and y coordinates relative to an origin. For three-dimensional vectors or more, we could just increase the number of attributes accordingly. Tables can be used to store more advanced forms of scientific, time series, and graph data. We will cover some of these datasets over the course of this book, so it is a good idea for us to get introduced to them now. Here, we explain the general concepts. Scalar fields There are many kinds of scientific dataset out there. In order to aid their investigations, scientists have created their own data formats and mathematical tools to analyze the data. Engineers have also developed their own visualization language in order to convey ideas in their community. In this book, we will cover a few typical datasets that are widely used by scientists and engineers. We will eventually learn how to create molecular visualizations and biomedical dataset exploration tools when we feel comfortable manipulating these datasets. In practice, multidimensional data (just like vectors in the previous example) is usually augmented with one or more characteristic variable values. As an example, let's think about how a physicist or an engineer would keep track of the temperature of a room. In order to tackle the problem, they would begin by measuring the geometry and the shape of the room, and put temperature sensors at certain places to measure the temperature. They will note the exact positions of those sensors relative to the room's coordinate system, and then, they will be all set to start measuring the temperature. Thus, the temperature of a room can be represented, in a discrete sense, by using a set of points that represent the temperature sensor locations and the actual temperature at those points. We immediately notice that the data is multidimensional in nature (the location of a sensor can be considered as a vector), and each data point has a scalar value associated with it (temperature). Such a discrete representation of multidimensional data is quite widely used in the scientific community. It is called a scalar field. The following screenshot shows the representation of a scalar field in 2D and 3D: Figure1.3 In practice, scalar fields are discrete and ordered Figure 1.3 depicts how one would represent an ordered scalar field in 2D or 3D. Each point in the 2D field has a well-defined x and y location, and a single temperature value gets associated with it. To represent a 3D scalar field, we can think of it as a set of 2D scalar field slices placed at a regular interval along the third dimension. Each point in the 3D field is a point that has {x, y, z} values, along with a temperature value. A scalar field can be represented using a table. We will denote each {x, y} point (for 2D) or {x, y, z} point values (for 3D) as a row, but this time, an additional attribute for the scalar value will be created in the table. Thus, a row will have the attributes {x, y, z, T}, where T is the temperature associated with the point defined by the x, y, and z coordinates. This is the most common representation of scalar fields. A widely used visualization technique to analyze scalar fields is to find out the isocontours or isosurfaces of interest. However, for now, let's take a look at the kind of application areas such analysis will enable one to pursue. Instead of temperature, one could think of associating regularly spaced points with any relevant scalar value to form problem-specific scalar fields. In an electrochemical simulation, it is important to keep track of the charge density in the simulation space. Thus, the chemist would create a scalar field with charge values at specific points. For an aerospace engineer, it is quite important to understand how air pressure varies across airplane wings; they would keep track of the pressure by forming a scalar field of pressure values. Scalar field visualization is very important in many other significant areas, ranging from from biomedical analysis to particle physics. In this book, we will cover how to visualize this type of data using Mathematica. Time series Another widely used data type is the time series. A time series is a sequence of data points that are measured usually over a uniform interval of time. Time series arise in many fields, but in today's world, they are mostly known for their applications in Economics and Finance. Other than these, they are frequently used in statistics, weather prediction, signal processing, astronomy, and so on. It is not the purpose of this book to describe the theory and mathematics of time series data. However, we will cover some of Mathematica's excellent capabilities for visualizing time series, and in the course of this book, we will construct our own visualization tool to view time series data. Time series can be easily represented using tables. Each row of the time series table will represent one point in the series, with one attribute denoting the time stamp—the time at which the data point was recorded, and the other attribute storing the actual data value. If the starting time and the time interval are known, then we can get rid of the time attribute and simply store the data value in each row. The actual timestamp of each value can be calculated using the initial time and time interval. Images and videos can be represented as tables too, with pixel-intensity values occupying each entry of the table. As we focus on visualization and not image processing, we will skip those types of data. Graphs Nowadays, graphs arise in all contexts of computer science and social science. This particular data structure provides a way to convert real-world problems into a set of entities and relationships. Once we have a graph, we can use a plethora of graph algorithms to find beautiful insights about the dataset. Technically, a graph can be stored as a table. However, Mathematica has its own graph data structure, so we will stick to its norm. Sometimes, visualizing the graph structure reveals quite a lot of hidden information. Graph visualization itself is a challenging problem, and is an active research area in computer science. A proper visualization layout, along with proper color maps and size distribution, can produce very useful outputs. Text The most common form of data that we encounter everywhere is text. Mathematica does not provide any specific visualization package for state-of-the-art text visualization methods. Cartographic data As mentioned before, map visualization is one of the ancient forms of visualization known to us. Nowadays, with the advent of GPS, smartphones, and publicly available country-based data repositories, maps are providing an excellent way to contrast and compare different countries, cities, or even communities. Cartographic data comes in various forms. A common form of a single data item is one that includes latitude, longitude, location name, and an attribute (usually numerical) that records a relevant quantity. However, instead of a latitude and longitude coordinate, we may be given a set of polygons that describe the geographical shape of the place. The attributable quantity may not be numerical, but rather something qualitative, like text. Thus, there is really no standard form that one can expect when dealing with cartographic data. Fortunately, Mathematica provides us with excellent data-mining and dissecting capabilities to build custom formats out of the data available to us. . Mathematica as a tool for visualization At this point, you might be wondering why Mathematica is suited for visualizing all the kinds of datasets that we have mentioned in the preceding examples. There are many excellent tools and packages out there to visualize data. Mathematica is quite different from other languages and packages because of the unique set of capabilities it presents to its user. Mathematica has its own graphics language, with which graphics primitives can be interactively rendered inside the worksheet. This makes Mathematica's capability similar to many widely used visualization languages. Mathematica provides a plethora of functions to combine these primitives and make them interactive. Speaking of interactivity, Mathematica provides a suite of functions to interactively display any of its process. Not only visualization, but any function or code evaluation can be interactively visualized. This is particularly helpful when managing and visualizing big datasets. Mathematica provides many packages and functions to visualize the kinds of datasets we have mentioned so far. We will learn to use the built-in functions to visualize structured and unstructured data. These functions include point, line, and surface plots; histograms; standard statistical charts; and so on. Other than these, we will learn to use the advanced functions that will let us build our own visualization tools. Another interesting feature is the built-in datasets that this software provides to its users. This feature provides a nice playground for the user to experiment with different datasets and visualization functions. From our discussion so far, we have learned that visualization tools are used to analyze very large datasets. While Mathematica is not really suited for dealing with petabytes or exabytes of data (and many other popularly used visualization tools are not suited for that either), often, one needs to build quick prototypes of such visualization tools using smaller sample datasets. Mathematica is very well suited to prototype such tools because of its efficient and fast data-handling capabilities, along with its loads of convenient functions and user-friendly interface. It also supports GPU and other high-performance computing platforms. Although it is not within the scope of this book, a user who knows how to harness the computing power of Mathematica can couple that knowledge with visualization techniques to build custom big data visualization solutions. Another feature that Mathematica presents to a data scientist is the ability to keep the workflow within one worksheet. In practice, many data scientists tend to do their data analysis with one package, visualize their data with another, and export and present their findings using something else. Mathematica provides a complete suite of a core language, mathematical and statistical functions, a visualization platform, and versatile data import and export features inside a single worksheet. This helps the user focus on the data instead of irrelevant details. By now, I hope you are convinced that Mathematica is worth learning for your data-visualization needs. If you still do not believe me, I hope I will be able to convince you again at the end of the book, when we will be done developing several visualization prototypes, each requiring only few lines of code! Getting started with Mathematica We will need to know a few basic Mathematica notebook essentials. Assuming you already have Mathematica installed on your computer, let's open a new notebook by navigating to File|New|Notebook, and do the following experiments. Creating and selecting cells In Mathematica, a chunk of code or any number of mathematical expressions can be written within a cell. Each cell in the notebook can be evaluated to see the output immediately below it. To start a new cell, simply start typing at the position of the blinking cursor. Each cell can be selected by clicking on the respective rightmost bracket. To select multiple cells, press Ctrl + right-mouse button in Windows or Linux (or cmd + right-mouse button on a Mac) on each of the cells. The following screenshot shows several cells selected together, along with the output from each cell: Figure1.4 Selecting and evaluating cells in Mathematica We can place a new cell in between any set of cells in order to change the sequence of instruction execution. Use the mouse to place the cursor in between two cells, and start typing your commands to create a new cell. We can also cut, copy, and paste cells by selecting them and applying the usual shortcuts (for example, Ctrl + C, Ctrl + X, and Ctrl + V in Windows/Linux, or cmd + C, cmd + X, and cmd + V in Mac) or using the Edit menu bar. In order to delete cell(s), select the cell(s) and press the Delete key. Evaluating a cell A cell can be evaluated by pressing Shift + Enter. Multiple cells can be selected and evaluated in the same way. To evaluate the full notebook, press Ctrl + A (to select all the cells) and then press Shift + Enter. In this case, the cells will be evaluated one after the other in the sequence in which they appear in the notebook. To see examples of notebooks filled with commands, code, and mathematical expressions, you can open the notebooks supplied with this article, which are the polar coordinates fitting and Anscombe's quartet examples, and select each cell (or all of them) and evaluate them. If we evaluate a cell that uses variables declared in a previous cell, and the previous cell was not already evaluated, then we may get errors. It is possible that Mathematica will treat the unevaluated variables as a symbolic expression; in that case, no error will be displayed, but the results will not be numeric anymore. Suppressing output from a cell If we don't wish to see the intermediate output as we load data or assign values to variables, we can add a semicolon (;) at the end of each line that we want to leave out from the output. Cell formatting Mathematica input cells treat everything inside them as mathematical and/or symbolic expressions. By default, every new cell you create by typing at the horizontal cursor will be an input expression cell. However, you can convert the cell to other formats for convenient typesetting. In order to change the format of cell(s), select the cell(s) and navigate to Format|Style from the menu bar, and choose a text format style from the available options. You can add mathematical symbols to your text by selecting Palettes|Basic Math Assistant. Note that evaluating a text cell will have no effect/output. Commenting We can write any comment in a text cell as it will be ignored during the evaluation of our code. However, if we would like to write a comment inside an input cell, we use the (* operator to open a comment and the *) operator to close it, as shown in the following code snippet: (* This is a comment *) The shortcut Ctrl + / (cmd + / in Mac) is used to comment/uncomment a chunk of code too. This operation is also available in the menu bar. Downloading the example code You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. Aborting evaluation We can abort the currently running evaluation of a cell by navigating to Evaluation|Abort Evaluation in the menu bar, or simply by pressing Alt + . (period). This is useful when you want to end a time-consuming process that you suddenly realize will not give you the correct results at the end of the evaluation, or end a process that might use up the available memory and shut down the Mathematica kernel. Further reading The history of visualization deserves a separate book, as it is really fascinating how the field has matured over the centuries, and it is still growing very strongly. Michael Friendly, from York University, published a historical development paper that is freely available online, titled Milestones in History of Data Visualization: A Case Study in Statistical Historiography. This is an entertaining compilation of the history of visualization methods. The book The Visual Display of Quantitative Information by Edward R. Tufte published by Graphics Press USA, is an excellent resource and a must-read for every data visualization practitioner. This is a classic book on the theory and practice of data graphics and visualization. Since we will not have the space to discuss the theory of visualization, the interested reader can consider reading this book for deeper insights. Summary In this article, we discussed the importance of data visualization in different contexts. We also introduced the types of dataset that will be visualized over the course of this book. The flexibility and power of Mathematica as a visualization package was discussed, and we will see the demonstration of these properties throughout the book with beautiful visualizations. Finally, we have taken the first step to writing code in Mathematica. Resources for Article: Further resources on this subject: Driving Visual Analyses with Automobile Data (Python) [article] Importing Dynamic Data [article] Interacting with Data for Dashboards [article]

0
0
9283

article-image-driving-visual-analyses-automobile-data-python

Packt

19 Sep 2014

19 min read

Driving Visual Analyses with Automobile Data (Python)

Packt

19 Sep 2014

19 min read

0
0
9021

Packt

18 Sep 2014

18 min read

Caches

Packt

18 Sep 2014

18 min read

In this article, by Federico Razzoli, author of the book Mastering MariaDB, we will see that how in order to avoid accessing disks, MariaDB and storage engines have several caches that a DBA should know about. (For more resources related to this topic, see here.) InnoDB caches Since InnoDB is the recommended engine for most use cases, configuring it is very important. The InnoDB buffer pool is a cache that should speed up most read and write operations. Thus, every DBA should know how it works. The doublewrite buffer is an important mechanism that guarantees that a row is never half-written to a file. For heavy-write workloads, we may want to disable it to obtain more speed. InnoDB pages Tables, data, and indexes are organized in pages, both in the caches and in the files. A page is a package of data that contains one or two rows and usually some empty space. The ratio between the used space and the total size of pages is called the fill factor. By changing the page size, the fill factor changes inevitably. InnoDB tries to keep the pages 15/16 full. If a page's fill factor is lower than 1/2, InnoDB merges it with another page. If the rows are written sequentially, the fill factor should be about 15/16. If the rows are written randomly, the fill factor is between 1/2 and 15/16. A low fill factor represents a memory waste. With a very high fill factor, when pages are updated and their content grows, they often need to be reorganized, which negatively affects the performance. The columns with a variable length type (TEXT, BLOB, VARCHAR, or VARBIT) are written into separate data structures called overlow pages. Such columns are called off-page columns. They are better handled by the DYNAMIC row format, which can be used for most tables when backward compatibility is not a concern. A page never changes its size, and the size is the same for all pages. The page size, however, is configurable: it can be 4 KB, 8 KB, or 16 KB. The default size is 16 KB, which is appropriate for many workloads and optimizes full table scans. However, smaller sizes can improve the performance of some OLTP workloads involving many small insertions because of lower memory allocation, or storage devices with smaller blocks (old SSD devices). Another reason to change the page size is that this can greatly affect the InnoDB compression. The page size can be changed by setting the innodb_page_size variable in the configuration file and restarting the server. The InnoDB buffer pool On servers that mainly use InnoDB tables (the most common case), the buffer pool is the most important cache to consider. Ideally, it should contain all the InnoDB data and indexes to allow MariaDB to execute queries without accessing the disks. Changes to data are written into the buffer pool first. They are flushed to the disks later to reduce the number of I/O operations. Of course, if the data does not fit the server's memory, only a subset of them can be in the buffer pool. In this case, that subset should be the so-called working set: the most frequently accessed data. The default size of the buffer pool is 128 MB and should always be changed. On production servers, this value is too low. On a developer's computer, usually, there is no need to dedicate so much memory to InnoDB. The minimum size, 5 MB, is usually more than enough when developing a simple application. Old and new pages We can think of the buffer pool as a list of data pages that are sorted with a variation of the classic Last Recently Used (LRU) algorithm. The list is split into two sublists: the new list contains the most used pages, and the old list contains the less used pages. The first page in each sublist is called the head. The head of the old list is called the midpoint. When a page is accessed that is not in the buffer pool, it is inserted into the midpoint. The other pages in the old list shift by one position, and the last one is evicted. When a page from the old list is accessed, it is moved from the old list to the head of the new list. When a page in the new list is accessed, it goes to the head of the list. The following variables affect the previously described algorithm: innodb_old_blocks_pct: This variable defines the percentage of the buffer pool reserved to the old list. The allowed range is 5 to 95, and it is 37 (3/5) by default. innodb_old_blocks_time: If this value is not 0, it represents the minimum age (in milliseconds) the old pages must reach before they can be moved into the new list. If an old page is accessed that did not reach this age, it goes to the head of the old list. innodb_max_dirty_pages_pct: This variable defines the maximum percentage of pages that were modified in-memory. This mechanism will be discussed in the Dirty pages section later in this article. This value is not a hard limit, but InnoDB tries not to exceed it. The allowed range is 0 to 100, and the default is 75. Increasing this value can reduce the rate of writes, but the shutdown will take longer (because dirty pages need to be written onto the disk before the server can be stopped in a clean way). innodb_flush_neighbors: If set to 1, when a dirty page is flushed from memory to a disk, even the contiguous pages are flushed. If set to 2, all dirty pages from the same extent (the portion of memory whose size is 1 MB) are flushed. With 0, only dirty pages are flushed when their number exceeds innodb_max_dirty_pages_pct or when they are evicted from the buffer pool. The default is 1. This optimization is only useful for spinning disks. Write-incentive workloads may need an aggressive flushing strategy; however, if the pages are written too often, they degrade the performance. Buffer pool instances On MariaDB versions older than 5.5, InnoDB creates only one instance of the buffer pool. However, concurrent threads are blocked by a mutex, and this may become a bottleneck. This is particularly true if the concurrency level is high and the buffer pool is very big. Splitting the buffer pool into multiple instances can solve the problem. Multiple instances represent an advantage only if the buffer pool size is at least 2 GB. Each instance should be of size 1 GB. InnoDB will ignore the configuration and will maintain only one instance if the buffer pool size is less than 1 GB. Furthermore, this feature is more useful on 64-bit systems. The following variables control the instances and their size: innodb_buffer_pool_size: This variable defines the total size of the buffer pool (no single instances). Note that the real size will be about 10 percent bigger than this value. A percentage of this amount of memory is dedicated to the change buffer. innodb_buffer_pool_instances: This variable defines the number of instances. If the value is -1, InnoDB will automatically decide the number of instances. The maximum value is 64. The default value is 8 on Unix and depends on the innodb_buffer_pool_size variable on Windows. Dirty pages When a user executes a statement that modifies data in the buffer pool, InnoDB initially modifies the data that is only in memory. The pages that are only modified in the buffer pool are called dirty pages. Pages that have not been modified or whose changes have been written on the disk are called as clean pages. Note that changes to data are also written to the redo log. If a crash occurs before those changes are applied to data files, InnoDB is usually able to recover the data, including the last modifications, by reading the redo log and the doublewrite buffer. The doublewrite buffer will be discussed later, in the Explaining the doublewrite buffer section. At some point, the data needs to be flushed to the InnoDB data files (the .ibd files). In MariaDB 10.0, this is done by a dedicated thread called the page cleaner. In older versions, this was done by the master thread, which executes several InnoDB maintenance operations. The flushing is not only concerned with the buffer pool, but also with the InnoDB redo and undo log. The list of dirty pages is frequently updated when transactions write data at the physical level. It has its own mutex that does not lock the whole buffer pool. The maximum number of dirty pages is determined by innodb_max_dirty_pages_pct as a percentage. When this maximum limit is reached, dirty pages are flushed. The innodb_flush_neighbor_pages value determines how InnoDB selects the pages to flush. If it is set to none, only selected pages are written. If it is set to area, even the neighboring dirty pages are written. If it is set to cont, all contiguous blocks of the dirty pages are flushed. On shutdown, a complete page flushing is only done if innodb_fast_shutdown is 0. Normally, this method should be preferred, because it leaves data in a consistent state. However, if many changes have been requested but still not written to disk, this process could be very slow. It is possible to speed up the shutdown by specifying a higher value for innodb_fast_shutdown. In this case, a crash recovery will be performed on the next restart. The read ahead optimization The read ahead feature is designed to reduce the number of read operations from the disks. It tries to guess which data will be needed in the near future and reads it with one operation. Two algorithms are available to choose the pages to read in advance: linear read ahead random read ahead The linear read ahead is used by default. It counts the pages in the buffer pool that are read sequentially. If their number is greater than or equal to innodb_read_ahead_threshold, InnoDB will read all data from the same extent (a portion of data whose size is always 1 MB). The innodb_read_ahead_threshold value must be a number from 0 to 64. The value 0 disables the linear read ahead but does not enable the random read ahead. The default value is 56. The random read ahead is only used if the innodb_random_read_ahead server variable is set to ON. By default, it is set to OFF. This algorithm checks whether at least 13 pages in the buffer pool have been read to the same extent. In this case, it does not matter whether they were read sequentially. With this variable enabled, the full extent will be read. The 13-page threshold is not configurable. If innodb_read_ahead_threshold is set to 0 and innodb_random_read_ahead is set to OFF, the read ahead optimization is completely turned off. Diagnosing the buffer pool performance MariaDB provides some tools to monitor the activities of the buffer pool and the InnoDB main thread. By inspecting these activities, a DBA can tune the relevant server variables to improve the performance. In this section, we will discuss the SHOW ENGINE INNODB STATUS SQL statement and the INNODB_BUFFER_POOL_STATS table in the information_schema database. While the latter provides more information about the buffer pool, the SHOW ENGINE INNODB STATUS output is easier to read. The INNODB_BUFFER_POOL_STATS table contains the following columns: Column name Description POOL_ID Each InnoDB buffer pool instance has a different ID. POOL_SIZE Size (in pages) of the instance. FREE_BUFFERS Number of free pages. DATABASE_PAGES Total number of data pages. OLD_DATABASE_PAGES Pages in the old list. MODIFIED_DATABASE_PAGES Dirty pages. PENDING_DECOMPRESS Number of pages that need to be decompressed. PENDING_READS Pending read operations. PENDING_FLUSH_LRU Pages in the old or new lists that need to be flushed. PENDING_FLUSH_LIST Pages in the flush list that need to flushed. PAGES_MADE_YOUNG Number of pages moved into the new list. PAGES_NOT_MADE_YOUNG Old pages that did not become young. PAGES_MADE_YOUNG_RATE Pages made young per second. This value is reset each time it is shown. PAGES_MADE_NOT_YOUNG_RATE Pages read but not made young (this happens because they do not reach the minimum age) per second. This value is reset each time it is shown. NUMBER_PAGES_READ Number of pages read from disk. NUMBER_PAGES_CREATED Number of pages created in the buffer pool. NUMBER_PAGES_WRITTEN Number of pages written to disk. PAGES_READ_RATE Pages read from disk per second. PAGES_CREATE_RATE Pages created in the buffer pool per second. PAGES_WRITTEN_RATE Pages written to disk per second. NUMBER_PAGES_GET Requests of pages that are not in the buffer pool. HIT_RATE Rate of page hits. YOUNG_MAKE_PER_THOUSAND_GETS Pages made young per thousand physical reads. NOT_YOUNG_MAKE_PER_THOUSAND_GETS Pages that remain in the old list per thousand reads. NUMBER_PAGES_READ_AHEAD Number of pages read with a read ahead operation. NUMBER_READ_AHEAD_EVICTED The number of pages read with a read ahead operation that were never used and then were evicted. READ_AHEAD_RATE Similar to NUMBER_PAGES_READ_AHEAD, but this is a per second rate. READ_AHEAD_EVICTED_RATE Similar to NUMBER_READ_AHEAD_EVICTED, but this is a per-second rate. LRU_IO_TOTAL Total number of pages read or written to disk. LRU_IO_CURRENT Pages read or written to disk within the last second. UNCOMPRESS_TOTAL Pages that have been uncompressed. UNCOMPRESS_CURRENT Pages that have been uncompressed within the last second. The per-second values are reset after they are shown. The PAGES_MADE_YOUNG_RATE and PAGES_NOT_MADE_YOUNG_RATE values show us, respectively, how often old pages become new and how much old pages are never accessed in a reasonable amount of time. If the former value is too high, the old list is probably not big enough and vice versa. Comparing READ_AHEAD_RATE and READ_AHEAD_EVICTED_RATE is useful to tune the read ahead feature. The READ_AHEAD_EVICTED_RATE value should be low, because it indicates which pages read with the read ahead operations were not useful. If their ratio is good but READ_AHEAD_RATE is low, probably the read ahead should be used more often. In this case, if the linear read ahead is used, we can try to increase or decrease innodb_read_ahead_threshold. Or, we can change the used algorithm (linear or random read ahead). The columns whose names end with _RATE better describe the current server activities. They should be examined several times a day, and during the whole week or month, perhaps with the help of one of more monitoring tools. Good, free software monitoring tools include Cacti and Nagios. The Percona Monitoring Tools package includes MariaDB (and MySQL) plugins that provide an interface to these tools. Dumping and loading the buffer pool In some cases, one may want to save the current contents of the buffer pool and reload it later. The most common case is when the server is stopped. Normally, on startup, the buffer pool is empty, and InnoDB needs to fill it with useful data. This process is called warm-up. Until the warm-up is complete, the InnoDB performance is lower than usual. Two variables help avoid the warm-up phase: innodb_buffer_pool_dump_at_shutdown and innodb_buffer_pool_load_at_startup. If their value is ON, InnoDB automatically saves the buffer pool into a file at shut down and restores it at startup. Their default value is OFF. Turning them ON can be very useful, but remember the caveats: The startup and shutdown time might be longer. In some cases, we might prefer MariaDB to start more quickly even if it is slower during warm-up. We need the disk space necessary to store the buffer pool. The user may also want to dump the buffer pool at any moment and restore it without restarting the server. This is advisable when the buffer pool is optimal and some statements are going to heavily change its contents. A common example is when a big InnoDB table is fully scanned. This happens, for example, during logical backups. A full table scan will fill the old list with non-frequently accessed data. A good way to solve the problem is to dump the buffer pool before the table scan and reload it later. This operation can be performed by setting two special variables: innodb_buffer_pool_dump_now and innodb_buffer_pool_load_now. Reading the values of these variables always returns OFF. Setting the first variable to ON forces InnoDB to immediately dump the buffer pool into a file. Setting the latter variable to ON forces InnoDB to load the buffer pool from that file. In both cases, the progress of the dump or load operation is indicated by the Innodb_buffer_pool_dump_status and Innodb_buffer_pool_load_status status variables. If loading the buffer pool takes too long, it is possible to stop it by setting innodb_buffer_pool_load_abort to ON. The name and path of the dump file is specified in the innodb_buffer_pool_filename server variable. Of course, we should be sure that the chosen directory can contain the file, but it is much smaller than the memory used by the buffer pool. InnoDB change buffer The change buffer is a cache that is a part of the buffer pool. It contains dirty pages related to secondary indexes (not primary keys) that are not stored in the main part of the buffer pool. If the modified data is read later, it will be merged into the buffer pool. In older versions, this buffer was called the insert buffer, but now it is renamed, because it can handle deletions. The change buffer speeds up the following write operations: insertions: When new rows are written. deletions: When existing row are marked for deletion but not yet physically erased for performance reasons. purges: The physical elimination of previously marked rows and obsolete index values. This is periodically done by a dedicated thread. In some cases, we may want to disable the change buffer. For example, we may have a working set that only fits the memory if the change buffer is discarded. In this case, even after disabling it, we will still have all the frequently accessed secondary indexes in the buffer pool. Also, DML statements may be rare for our database, or we may have just a few secondary indexes: in these cases, the change buffer does not help. The change buffer can be configured using the following variables: innodb_change_buffer_max_size: This is the maximum size of the change buffer, expressed as a percentage of the buffer pool. The allowed range is 0 to 50, and the default value is 25. innodb_change_buffering: This determines which types of operations are cached by the change buffer. The allowed values are none (to disable the buffer), all, inserts, deletes, purges, and changes (to cache inserts and deletes, but not purges). The all value is the default value. Explaining the doublewrite buffer When InnoDB writes a page to disk, at least two events can interrupt the operation after it is started: a hardware failure or an OS failure. In the case of an OS failure, this should not be possible if the pages are not bigger than the blocks written by the system. In this case, the InnoDB redo and undo logs are not sufficient to recover the half-written page, because they only contain pages ID's, not their data. This improves the performance. To avoid half-written pages, InnoDB uses the doublewrite buffer. This mechanism involves writing every page twice. A page is valid after the second write is complete. When the server restarts, if a recovery occurs, half-written pages are discarded. The doublewrite buffer has a small impact on performance, because the writes are sequential, and are flushed to disk together. However, it is still possible to disable the doublewrite buffer by setting the innodb_doublewrite variable to OFF in the configuration file or by starting the server with the --skip-innodb-doublewrite parameter. This can be done if data correctness is not important. If performance is very important, and we use a fast storage device, we may note the overhead caused by the additional disk writes. But if data correctness is important to us, we do not want to simply disable it. MariaDB provides an alternative mechanism called atomic writes. These writes are like a transaction: they completely succeed or they completely fail. Half-written data is not possible. However, MariaDB does not directly implement this mechanism, so it can only be used on FusionIO storage devices using the DirectFS filesystem. FusionIO flash memories are very fast flash memories that can be used as block storage or DRAM memory. To enable this alternative mechanism, we can set innodb_use_atomic_writes to ON. This automatically disables the doublewrite buffer. Summary In this article, we discussed the main MariaDB buffers. The most important ones are the caches used by the storage engine. We dedicated much space to the InnoDB buffer pool, because it is more complex and, usually, InnoDB is the most used storage engine. Resources for Article: Further resources on this subject: Building a Web Application with PHP and MariaDB – Introduction to caching [article] Installing MariaDB on Windows and Mac OS X [article] Using SHOW EXPLAIN with running queries [article]

0
0
2186

article-image-using-r-statistics-research-and-graphics

Packt

16 Sep 2014

12 min read

Using R for Statistics, Research, and Graphics

Packt

16 Sep 2014

12 min read

In this article by David Alexander Lillis, author of the R Graph Essentials, we will talk about R. Developed by Professor Ross Ihaka and Dr. Robert Gentleman at Auckland University (New Zealand) during the early 1990s, the R statistics environment is a real success story. R is open source software, which you can download in a couple of minutes from the Comprehensive R Network (CRAN) website (http://cran.r-project.org/), and combines a powerful programming language, outstanding graphics, and a comprehensive range of useful statistical functions. If you need a statistics environment that includes a programming language, R is ideal. It's true that the learning curve is longer than for spreadsheet-based packages, but once you master the R programming syntax, you can develop your own very powerful analytic tools. Many contributed packages are available on the web for use with R, and very often the analytic tools you need can be downloaded at no cost. (For more resources related to this topic, see here.) The main problem for those new to R is the time required to master the programming language, but several nice graphical user interfaces, such as John Fox's R Commander package, are available, which make it much easier for the newcomer to develop proficiency in R than it used to be. For many statisticians and researchers, R is the package of choice because of its powerful programming language, the easy availability of code, and because it can import Excel spreadsheets, comma separated variable (.csv) spreadsheets, and text files, as well as SPSS files, STATA files, and files produced within other statistical packages. You may be looking for a tool for your own data analysis. If so, let's take a brief look at what R can do for you. Some basic R syntax Data can be created in R or else read in from .csv or other files as objects. For example, you can read in the data contained within a .csv file called mydata.csv as follows: A <- read.csv(mydata.csv, h=T) A The object A now contains all the data of the original file. The syntax A[3,7] picks out the element in row 3 and column 7. The syntax A[14, ] selects the fourteenth row and A[,6] selects the sixth column. The functions mean(A) and sd(A) find the mean and standard deviation of each column. The syntax 3*A + 7 would triple each element of A and add 7 to each element and store the new array as the object B Now you could save this array as a .csv file called Outputfile.csv as follows: write.csv(B, file="Outputfile.csv") Statistical modeling R provides a comprehensive range of basic statistical functions relating to the commonly-used distributions (normal distribution, t-distribution, Poisson, gamma, and so on), and many less-well known distributions. It also provides a range of non-parametric tests that are appropriate when your data are not distributed normally. Linear and non-linear regressions are easy to perform, and finding the optimum model (that is, by eliminating non-significant independent variables and non-significant factor interactions) is particularly easy. Implementing Generalized Linear Models and other commonly-used models such as Analysis of Variance, Multivariate Analysis of Variance, and Analysis of Covariance is also straightforward and, once you know the syntax, you may find that such tasks can be done more quickly in R than in other packages. The usual post-hoc tests for identifying factor levels that are significantly different from the other levels (for example, Tukey and Sheffe tests) are available, and testing for interactions between factors is easy. Factor Analysis, and the related Principal Components Analysis, are well known data reduction techniques that enable you to explain your data in terms of smaller sets of independent variables (or factors). Both methods are available in R, and code for complex designs, including One and Two Way Repeated Measures, and Four Way ANOVA (for example, two repeated measures and two between-subjects), can be written relatively easily or downloaded from various websites (for example, http://www.personality-project.org/r/). Other analytic tools include Cluster Analysis, Discriminant Analysis, Multidimensional Scaling, and Correspondence Analysis. R also provides various methods for fitting analytic models to data and smoothing (for example, lowess and spline-based methods). Miscellaneous packages for specialist methods You can find some very useful packages of R code for fields as diverse as biometry, epidemiology, astrophysics, econometrics, financial and actuarial modeling, the social sciences, and psychology. For example, if you are interested in Astrophysics, Penn State Astrophysics School offers a nice website that includes both tutorials and code (http://www.iiap.res.in/astrostat/RTutorials.html). Here I'll mention just a few of the popular techniques: Monte Carlo methods A number of sources give excellent accounts of how to perform Monte Carlo simulations in R (that is, drawing samples from multidimensional distributions and estimating expected values). A valuable text is Christian Robert's book Introducing Monte Carlo Methods with R. Murali Haran gives another interesting Astrophysical example in the CAStR website (http://www.stat.psu.edu/~mharan/MCMCtut/MCMC.html). Structural Equation Modeling Structural Equation Modelling (SEM) is becoming increasingly popular in the social sciences and economics as an alternative to other modeling techniques such as multiple regression, factor analysis and analysis of covariance. Essentially, SEM is a kind of multiple regression that takes account of factor interactions, nonlinearities, measurement error, multiple latent independent variables, and latent dependent variables. Useful references for conducting SEM in R include those of Revelle, Farnsworth (2008), and Fox (2002 and 2006). Data mining A number of very useful resources are available for anyone contemplating data mining using R. For example, Luis Torgo has just published a book on data mining using R, and presents case studies, along with the datasets and code, which the interested student can work through. Torgo's book provides the usual analytic and graphical techniques used every day by data miners, including visualization techniques, dealing with missing values, developing prediction models, and methods for evaluating the performance of your models. Also of interest to the data miner is the Rattle GUI (R Analytical Tool to Learn Easily). Rattle is a data mining facility for analyzing very large data sets. It provides many useful statistical and graphical data summaries, presents mechanisms for developing a variety of models, and summarizes the performance of your models. Graphics in R Quite simply, the quality and range of graphics available through R is superb and, in my view, vastly superior to those of any other package I have encountered. Of course, you have to write the necessary code, but once you have mastered this skill, you have access to wonderful graphics. You can write your own code from scratch, but many websites provide helpful examples, complete with code, which you can download and modify to suit your own needs. R's base graphics (graphics created without the use of any additional contributed packages) are superb, but various graphics packages such as ggplot2 (and the associated qplot function) help you to create wonderful graphs. R's graphics capabilities include, but are not limited to, the following: Base graphics in R Basic graphics techniques and syntax Creating scatterplots and line plots Customizing axes, colors, and symbols Adding text – legends, titles, and axis labels Adding lines – interpolation lines, regression lines, and curves Increasing complexity – graphing three variables, multiple plots, or multiple axes Saving your plots to multiple formats – PDF, postscript, and JPG Including mathematical expressions on your plots Making graphs clear and pretty – including a grid, point labels, and shading Shading and coloring your plot Creating bar charts, histograms, boxplots, pie charts, and dotcharts Adding loess smoothers Scatterplot matrices R's color palettes Adding error bars Creating graphs using qplot Using basic qplot graphics techniques and syntax to customize in easy steps Creating scatterplots and line plots in qplot Mapping symbol size, symbol type and symbol color to categorical data Including regressions and confidence intervals on your graphs Shading and coloring your graph Creating bar charts, histograms, boxplots, pie charts, and dotcharts Labelling points on your graph Creating graphs using ggplot Ploting options – backgrounds, sizes, transparencies, and colors Superimposing points Controlling symbol shapes and using pretty color schemes Stacked, clustered, and paneled bar charts Methods for detailed customization of lines, point labels, smoothers, confidence bands, and error bars The following graph records information on the heights in centimeters and weights in kilograms of patients in a medical study. The curve in red gives a smoothed version of the data, created using locally weighted scatterplot smoothing. Both the graph and the modelling required to produce the smoothed curve, were performed in R. Here is another graph. It gives the heights and body masses of female patients receiving treatment in a hospital. Each patient is identified by name. This graph was created very easily using ggplot, and shows the default background produced by ggplot (a grey plotting background and white grid lines). Next, we see a histogram of patients' heights and body masses, partitioned by gender. The bars are given in an orange and an ivory color. The ggplot package provides a wide range of colors and hues, as well as a wide range of color palettes. Finally, we see a line graph of height against age for a group of four children. The graph includes both points and lines and we have a unique color for each child. The ggplot package makes it possible to create attractive and effective graphs for research and data analysis. Summary For many scientists and data analysts, mastery of R could be an investment for the future, particularly for those who are beginning their careers. The technology for handling scientific computation is advancing very quickly, and is a major impetus for scientific advance. Some level of mastery of R has become, for many applications, essential for taking advantage of these developments. Spatial analysis, where R provides an integrated framework access to abilities that are spread across many different computer programs, is a good example. A few years ago, I would not have recommended R as a statistics environment for generalist data analysts or postgraduate students, except those working directly in areas involving statistical modeling. However, many tutorials are downloadable from the Internet and a number of organizations provide online tutorials and/or face-to-face workshops (for example, The Analysis Factor http://www.theanalysisfactor.com/). In addition, the appearance of GUIs, such as R Commander and the new iNZight GUI33 (designed for use in schools), makes it easier for non-specialists to learn and use R effectively. I am most happy to provide advice to anyone contemplating learning to use this outstanding statistical and research tool. References Some useful material on R are as follows: L'analyse des donn´ees. Tome 1: La taxinomie, Tome 2: L'analyse des correspondances, Dunod, Paris, Benz´ecri, J. P (1973). Computation of Correspondence Analysis, Blasius J, Greenacre, M. J (1994). In M J Greenacre, J Blasius (eds.), Correspondence Analysis in the Social Sciences, pp. 53–75, Academic Press, London. Statistics: An Introduction using R, Crawley, M. J. (m.crawley@imperial.ac.uk), Imperial College, Silwood Park, Ascot, Berks, Published in 2005 by John Wiley & Sons, Ltd. http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470022973,subjectCd-ST05.html (ISBN 0-470-02297-3). http://www3.imperial.ac.uk/naturalsciences/research/statisticsusingr. Structural Equation Models Appendix to An R and S-PLUS Companion to Applied Regression, Fox, John, http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-sems.pdf. Getting Started with the R Commander, Fox, John, 26 August 2006. The R Commander: A Basic-Statistics Graphical User Interface to R, Fox, John, Journal of Statistical Software, September 2005, Volume 14, Issue 9. http://www.jstatsoft.org/. Structural Equation Modeling With the sem Package in R, Fox, John, Structural Equation Modeling, 13(3), 465–486. Lawrence Erlbaum Associates, Inc. 2006. Biplots in Biomedical Research, Gabriel, K, R and Odoroff, C, 9, 469–485, Statistics in Medicine, 1990. Theory and Applications of Correspondence Analysis, Greenacre M. J., Academic Press, London, 1984. Using R for Data Analysis and Graphics Introduction, Code and Commentary, Maindonald, J. H, Centre for Mathematics and its Applications, Australian National University. Introducing Monte Carlo Methods with R, Series Use R, Robert, Christian P., Casella, George, 2010, XX, 284 p., Softcover, ISBN 978-1-4419-1575-7. <p>Useful tutorials available on the web are as follows:</p> An Introduction to R: examples for Actuaries, De Silva, N, 2006, http://toolkit.pbworks.com/f/R%20Examples%20for%20Actuaries%20v0.1-1.pdf. Econometrics in R, Farnsworth, Grant, V, October 26, 2008, http://cran.r-project.org/doc/contrib/Farnsworth-EconometricsInR.pdf. An Introduction to the R Language, Harte, David, Statistics Research Associates Limited, www.statsresearch.co.nz. Quick R, Kabakoff, Rob, http://www.statmethods.net/index.html. R for SAS and SPSS Users, Muenchen, Bob, http://RforSASandSPSSusers.com. Statistical Analysis with R - a quick start, Nenadi´,C and Zucchini, Walter. R for Beginners, Paradis, Emannuel (paradis@isem.univ-montp2.fr), Institut des Sciences de l' Evolution, Universite Montpellier II, F-34095 Montpellier c_edex 05, France. Data Mining with R learning by case studies, Torgo, Luis, http://www.liaad.up.pt/~ltorgo/DataMiningWithR/. SimpleR - Using R for Introductory Statistics, Verzani, John, http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf. Time Series Analysis and Its Applications: With R Examples, http://www.stat.pitt.edu/stoffer/tsa2/textRcode.htm#ch2. The irises of the Gaspé peninsula, E. Anderson, Bulletin of the American Iris Society, 59, 2-5. 1935. Introducing Monte Carlo Methods with R, Series Use R, Robert, Christian P., Casella, George. 2010, XX, 284 p., Softcover, ISBN: 978-1-4419-1575-7. Resources for Article: Further resources on this subject: Aspects of Data Manipulation in R [Article] Learning Data Analytics with R and Hadoop [Article] First steps with R [Article]

0
0
4073

Packt

26 Aug 2014

7 min read

Stream Grouping

Packt

26 Aug 2014

7 min read

In this article, by Ankit Jain and Anand Nalya, the authors of the book Learning Storm, we will cover different types of stream groupings. (For more resources related to this topic, see here.) When defining a topology, we create a graph of computation with a number of bolt-processing streams. At a more granular level, each bolt executes as multiple tasks in the topology. A stream will be partitioned into a number of partitions and divided among the bolts' tasks. Thus, each task of a particular bolt will only get a subset of the tuples from the subscribed streams. Stream grouping in Storm provides complete control over how this partitioning of tuples happens among many tasks of a bolt subscribed to a stream. Grouping for a bolt can be defined on the instance of the backtype.storm.topology.InputDeclarer class returned when defining bolts using the backtype.storm.topology.TopologyBuilder.setBolt method. Storm supports the following types of stream groupings: Shuffle grouping Fields grouping All grouping Global grouping Direct grouping Local or shuffle grouping Custom grouping Now, we will look at each of these groupings in detail. Shuffle grouping Shuffle grouping distributes tuples in a uniform, random way across the tasks. An equal number of tuples will be processed by each task. This grouping is ideal when you want to distribute your processing load uniformly across the tasks and where there is no requirement of any data-driven partitioning. Fields grouping Fields grouping enables you to partition a stream on the basis of some of the fields in the tuples. For example, if you want that all the tweets from a particular user should go to a single task, then you can partition the tweet stream using fields grouping on the username field in the following manner: builder.setSpout("1", new TweetSpout()); builder.setBolt("2", new TweetCounter()).fieldsGrouping("1", new Fields("username")) Fields grouping is calculated with the following function: hash (fields) % (no. of tasks) Here, hash is a hashing function. It does not guarantee that each task will get tuples to process. For example, if you have applied fields grouping on a field, say X, with only two possible values, A and B, and created two tasks for the bolt, then it might be possible that both hash (A) % 2 and hash (B) % 2 are equal, which will result in all the tuples being routed to a single task and other tasks being completely idle. Another common usage of fields grouping is to join streams. Since partitioning happens solely on the basis of field values and not the stream type, we can join two streams with any common join fields. The name of the fields do not need to be the same. For example, in order to process domains, we can join the Order and ItemScanned streams when an order is completed: builder.setSpout("1", new OrderSpout()); builder.setSpout("2", new ItemScannedSpout()); builder.setBolt("joiner", new OrderJoiner()) .fieldsGrouping("1", new Fields("orderId")) .fieldsGrouping("2", new Fields("orderRefId")); All grouping All grouping is a special grouping that does not partition the tuples but replicates them to all the tasks, that is, each tuple will be sent to each of the bolt's tasks for processing. One common use case of all grouping is for sending signals to bolts. For example, if you are doing some kind of filtering on the streams, then you have to pass the filter parameters to all the bolts. This can be achieved by sending those parameters over a stream that is subscribed by all bolts' tasks with all grouping. Another example is to send a reset message to all the tasks in an aggregation bolt. The following is an example of all grouping: builder.setSpout("1", new TweetSpout()); builder.setSpout("signals", new SignalSpout()); builder.setBolt("2", new TweetCounter()).fieldsGrouping("1", new Fields("username")).allGrouping("signals"); Here, we are subscribing signals for all the TweetCounter bolt's tasks. Now, we can send different signals to the TweetCounter bolt using SignalSpout. Global grouping Global grouping does not partition the stream but sends the complete stream to the bolt's task with the smallest ID. A general use case of this is when there needs to be a reduce phase in your topology where you want to combine results from previous steps in the topology in a single bolt. Global grouping might seem redundant at first, as you can achieve the same results with defining the parallelism for the bolt as one and setting the number of input streams to one. Though, when you have multiple streams of data coming through different paths, you might want only one of the streams to be reduced and others to be processed in parallel. For example, consider the following topology. In this topology, you might want to route all the tuples coming from Bolt C to a single Bolt D task, while you might still want parallelism for tuples coming from Bolt E to Bolt D. Global grouping This can be achieved with the following code snippet: builder.setSpout("a", new SpoutA()); builder.setSpout("b", new SpoutB()); builder.setBolt("c", new BoltC()); builder.setBolt("e", new BoltE()); builder.setBolt("d", new BoltD()) .globalGrouping("c") .shuffleGrouping("e"); Direct grouping In direct grouping, the emitter decides where each tuple will go for processing. For example, say we have a log stream and we want to process each log entry using a specific bolt task on the basis of the type of resource. In this case, we can use direct grouping. Direct grouping can only be used with direct streams. To declare a stream as a direct stream, use the backtype.storm.topology.OutputFieldsDeclarer.declareStream method that takes a Boolean parameter directly in the following way in your spout: @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declareStream("directStream", true, new Fields("field1")); } Now, we need the number of tasks for the component so that we can specify the taskId parameter while emitting the tuple. This can be done using the backtype.storm.task.TopologyContext.getComponentTasks method in the prepare method of the bolt. The following snippet stores the number of tasks in a bolt field: public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { this.numOfTasks = context.getComponentTasks("my-stream"); this.collector = collector; } Once you have a direct stream to emit to, use the backtype.storm.task.OutputCollector.emitDirect method instead of the emit method to emit it. The emitDirect method takes a taskId parameter to specify the task. In the following snippet, we are emitting to one of the tasks randomly: public void execute(Tuple input) { collector.emitDirect(new Random().nextInt(this.numOfTasks), process(input)); } Local or shuffle grouping If the tuple source and target bolt tasks are running in the same worker, using this grouping will act as a shuffle grouping only between the target tasks running on the same worker, thus minimizing any network hops resulting in increased performance. In case there are no target bolt tasks running on the source worker process, this grouping will act similar to the shuffle grouping mentioned earlier. Custom grouping If none of the preceding groupings fit your use case, you can define your own custom grouping by implementing the backtype.storm.grouping.CustomStreamGrouping interface. The following is a sample custom grouping that partitions a stream on the basis of the category in the tuples: public class CategoryGrouping implements CustomStreamGrouping, Serializable { // Mapping of category to integer values for grouping private static final Map<String, Integer> categories = ImmutableMap.of ( "Financial", 0, "Medical", 1, "FMCG", 2, "Electronics", 3 ); // number of tasks, this is initialized in prepare method private int tasks = 0; public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) { // initialize the number of tasks tasks = targetTasks.size(); } public List<Integer> chooseTasks(int taskId, List<Object> values) { // return the taskId for a given category String category = (String) values.get(0); return ImmutableList.of(categories.get(category) % tasks); } } Now, we can use this grouping in our topologies with the following code snippet: builder.setSpout("a", new SpoutA()); builder.setBolt("b", (IRichBolt)new BoltB()) .customGrouping("a", new CategoryGrouping()); The following diagram represents the Storm groupings graphically: Summary In this article, we discussed stream grouping in Storm and its types. Resources for Article: Further resources on this subject: Integrating Storm and Hadoop [article] Deploying Storm on Hadoop for Advertising Analysis [article] Photo Stream with iCloud [article]

0
0
3647

Packt

26 Aug 2014

5 min read

What is a content provider?

Packt

26 Aug 2014

5 min read

This article is written by Sunny Kumar Aditya and Vikash Kumar Karn, the authors of Android SQLite Essentials. There are four essential components in an Android Application: activity, service, broadcast receiver, and content provider. Content provider is used to manage access to structured set of data. They encapsulate the data and provide abstraction as well as the mechanism for defining data security. However, content providers are primarily intended to be used by other applications that access the provider using a provider's client object. Together, providers and provider clients offer a consistent, standard interface to data that also handles inter-process communication and secure data access. (For more resources related to this topic, see here.) A content provider allows one app to share data with other applications. By design, an Android SQLite database created by an application is private to the application; it is excellent if you consider the security point of view but troublesome when you want to share data across different applications. This is where a content provider comes to the rescue; you can easily share data by building your content provider. It is important to note that although our discussion would focus on a database, a content provider is not limited to it. It can also be used to serve file data that normally goes into files, such as photos, audio, or videos. The interaction between Application A and B happens while exchanging data can be seen in the following diagram: Here we have an Application A, whose activity needs to access the database of Application B. As we already read, the database of the Application B is stored in the internal memory and cannot be directly accessed by Application A. This is where the Content Provider comes into the picture; it allows us to share data and modify access to other applications. The content provider implements methods for querying, inserting, updating, and deleting data in databases. Application A now requests the content provider to perform some desired operations on behalf of it. We would explore the use of Content Provider to fetch contacts from a phone's contact database. Using existing content providers Android lists a lot of standard content providers that we can use. Some of them are Browser, CalendarContract, CallLog, Contacts, ContactsContract, MediaStore, userDictionary, and so on. We will fetch contacts from a phone's contact list with the help of system's existing ContentProvider and ContentResolver. We will be using the ContactsContract provider for this purpose. What is a content resolver? The ContentResolver object in the application's context is used to communicate with the provider as a client. The ContentResolver object communicates with the provider object—an instance of a class that implements ContentProvider. The provider object receives data requests from clients, performs the requested action, and returns the results. ContentResolver is the single, global instance in our application that provides access to other application's content provider; we do not need to worry about handling inter-process communication. The ContentResolver methods provide the basic CRUD (create, retrieve, update, and delete) functions of persistent storage; it has methods that call identically named methods in the provider object but does not know the implementation. In the code provided in the corresponding AddNewContactActivity class, we will initiate picking of contact by building an intent object, Intent.ACTION_PICK, that allows us to pick an item from a data source; in addition, all we need to know is the URI of the provider, which in our case is ContactsContract.Contacts.CONTENT_URI: public void pickContact() { try { Intent cIntent = new Intent(Intent.ACTION_PICK, ContactsContract.Contacts.CONTENT_URI); startActivityForResult(cIntent, PICK_CONTACT); } catch (Exception e) { e.printStackTrace(); Log.i(TAG, "Exception while picking contact"); } } The code used in this article is placed at GitHub: https://github.com/sunwicked/Ch3-PersonalContactManager/tree/master This functionality is also provided by Messaging, Gallery, and Contacts. The Contacts screen will pop up allowing us to browse or search for contacts we require to migrate to our new application. In onActivityResult, that is our next stop, we will use the method to handle our corresponding request to pick and use contacts. Let's look at the code we have to add to pick contacts from an Android's contact provider: protected void onActivityResult(int requestCode, int resultCode, Intent data) { . . else if (requestCode == PICK_CONTACT) { if (resultCode == Activity.RESULT_OK) { Uri contactData = data.getData(); Cursor c = getContentResolver().query(contactData, null, null, null, null); if (c.moveToFirst()) { String id = c .getString(c.getColumnIndexOrThrow(ContactsContract.Contacts._ID)); String hasPhone = c .getString(c.getColumnIndex(ContactsContract.Contacts.HAS_PHONE_NUMBER)); if (hasPhone.equalsIgnoreCase("1")) { Cursor phones = getContentResolver().query(ContactsContract. CommonDataKinds.Phone.CONTENT_URI, null, ContactsContract.CommonDataKinds. Phone.CONTACT_ID + " = " + id, null, null); phones.moveToFirst(); contactPhone.setText(phones.getString(phones.getColumnIndex("data1"))); contactName .setText(phones.getString(phones.getColumnIndex(ContactsContract.Contacts. DISPLAY_NAME))); } ….. We start by checking whether the request code is matching ours, and then we cross check resultcode. We get the content resolver object by making a call to getcontentresolver on the Context object; it is a method of the android.content.Context class. As we are in an activity that inherits from Context, we do not need to be explicit in making a call to it; same goes for services. We will now check whether the contact we picked has a phone number or not. After verifying the necessary details, we pull data that we require, such as contact name and phone number, and set them in relevant fields. Summary This article reflects on how to access and share data in Android via content providers and how to construct a content provider. We also talk about content resolvers and how they are used to communicate with the providers as a client. Resources for Article: Further resources on this subject: Reversing Android Applications [article] Saying Hello to Unity and Android [article] Android Fragmentation Management [article]

0
0
10315

article-image-more-line-charts-area-charts-and-scatter-plots

Packt

26 Aug 2014

13 min read

More Line Charts, Area Charts, and Scatter Plots

Packt

26 Aug 2014

13 min read

In this article by Scott Gottreu, the author of Learning jqPlot, we'll learn how to import data from remote sources. We will discuss what area charts, stacked area charts, and scatter plots are. Then we will learn how to implement these newly learned charts. We will also learn about trend lines. (For more resources related to this topic, see here.) Working with remote data sources We return from lunch and decide to start on our line chart showing social media conversions. With this chart, we want to pull the data in from other sources. You start to look for some internal data sources, coming across one that returns the data as an object. We can see an excerpt of data returned by the data source. We will need to parse the object and create data arrays for jqPlot: { "twitter":[ ["2012-11-01",289],...["2012-11-30",225] ], "facebook":[ ["2012-11-01",27],...["2012-11-30",48] ] } We solve this issue using a data renderer to pull our data and then format it properly for jqPlot. We can pass a function as a variable to jqPlot and when it is time to render the data, it will call this new function. We start by creating the function to receive our data and then format it. We name it remoteDataSource. jqPlot will pass the following three parameters to our function: url: This is the URL of our data source. plot: The jqPlot object we create is passed by reference, which means we could modify the object from within remoteDataSource. However, it is best to treat it as a read-only object. options: We can pass any type of option in the dataRendererOptions option when we create our jqPlot object. For now, we will not be passing in any options: <script src="../js/jqplot.dateAxisRenderer.min.js"></script> <script> $(document).ready(function(){ var remoteDataSource = function(url, plot, options) { Next we create a new array to hold our formatted data. Then, we use the $.ajax method in jQuery to pull in our data. We set the async option to false. If we don't, the function will continue to run before getting the data and we'll have an empty chart: var data = new Array; $.ajax({ async: false, We set the url option to the url variable that jqPlot passed in. We also set the data type to json: url: url, dataType:"json", success: function(remoteData) { Then we will take the twitter object in our JSON and make that the first element of our data array and make facebook the second element. We then return the whole array back to jqPlot to finish rendering our chart: data.push(remoteData.twitter); data.push(remoteData.facebook); } }); return data; }; With our previous charts, after the id attribute, we would have passed in a data array. This time, instead of passing in a data array, we pass in a URL. Then, within the options, we declare the dataRenderer option and set remoteDataSource as the value. Now when our chart is created, it will call our renderer and pass in all the three parameters we discussed earlier: var socialPlot = $.jqplot ('socialMedia', "./data/social_shares.json", { title:'Social Media Shares', dataRenderer: remoteDataSource, We create labels for both our data series and enable the legend: series:[ { label: 'Twitter' }, { label: 'Facebook' } ], legend: { show: true, placement: 'outsideGrid' }, We enable DateAxisRenderer for the x axis and set min to 0 on the y axis, so jqPlot will not extend the axis below zero: axes:{ xaxis:{ renderer:$.jqplot.DateAxisRenderer, label: 'Days in November' }, yaxis: { min:0, label: 'Number of Shares' } } }); }); </script> <div id="socialMedia" style="width:600px;"></div> If you are running the code samples from your filesystem in Chrome, you will get an error message similar to this: No 'Access-Control-Allow-Origin' header is present on the requested resource. The security settings do not allow AJAX requests to be run against files on the filesystem. It is better to use a local web server such as MAMP, WAMP, or XAMPP. This way, we avoid the access control issues. Further information about cross-site HTTP requests can be found at the Mozilla Developer Network at https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS. We load this new chart in our browser and can see the result. We are likely to run into cross-domain issues when trying to access remote sources that do not allow cross-domain requests. The common practice to overcome this hurdle would be to use the JSONP data type in our AJAX call. jQuery will only run JSONP calls asynchronously. This keeps your web page from hanging if a remote source stops responding. However, because jqPlot requires all the data from the remote source before continuing, we can't use cross-domain sources with our data renderers. We start to think of ways we can use external APIs to pull in data from all kinds of sources. We make a note to contact the server guys to write some scripts to pull from the external APIs we want and pass along the data to our charts. By doing it in this way, we won't have to implement OAuth (OAuth is a standard framework used for authentication), http://oauth.net/2, in our web app or worry about which sources allow cross-domain access. Adding to the project's scope As we continue thinking up new ways to work with this data, Calvin stops by. "Hey guys, I've shown your work to a few of the regional vice-presidents and they love it." Your reply is that all of this is simply an experiment and was not designed for public consumption. Calvin holds up his hands as if to hold our concerns at bay. "Don't worry, they know it's all in beta. They did have a couple of ideas. Can you insert in the expenses with the revenue and profit reports? They also want to see those same charts but formatted differently." He continues, "One VP mentioned that maybe we could have one of those charts where everything under the line is filled in. Oh, and they would like to see these by Wednesday ahead of the meeting." With that, Calvin turns around and makes his customary abrupt exit. Adding a fill between two lines We talk through Calvin's comments. Adding in expenses won't be too much of an issue. We could simply add the expense line to one of our existing reports but that will likely not be what they want. Visually, the gap on our chart between profit and revenue should be the total amount of expenses. You mention that we could fill in the gap between the two lines. We decide to give this a try: We leave the plugins and the data arrays alone. We pass an empty array into our data array as a placeholder for our expenses. Next, we update our title. After this, we add a new series object and label it Expenses: ... var rev_profit = $.jqplot ('revPrfChart', [revenue, profit, [] ], { title:'Monthly Revenue & Profit with Highlighted Expenses', series:[ { label: 'Revenue' }, { label: 'Profit' }, { label: 'Expenses' } ], legend: { show: true, placement: 'outsideGrid' }, To fill in the gap between the two lines, we use the fillBetween option. The only two required options are series1 and series2. These require the positions of the two data series in the data array. So in our chart, series1 would be 0 and series2 would be 1. The other three optional settings are: baseSeries, color, and fill. The baseSeries option tells jqPlot to place the fill on a layer beneath the given series. It will default to 0. If you pick a series above zero, then the fill will hide any series below the fill layer: fillBetween: { series1: 0, series2: 1, We want to assign a different value to color because it will default to the color of the first data series option. The color option will accept either a hexadecimal value or the rgba option, which allows us to change the opacity of the fill. Even though the fill option defaults to true, we explicitly set it. This option also gives us the ability to turn off the fill after the chart is rendered: color: "rgba(232, 44, 12, 0.5)", fill: true }, The settings for the rest of the chart remain unchanged: axes:{ xaxis:{ renderer:$.jqplot.DateAxisRenderer, label: 'Months' }, yaxis:{ label: 'Totals Dollars', tickOptions: { formatString: "$%'d" } } } }); }); </script> <div id="revPrfChart" style="width:600px;"></div> We switch back to our web browser and load the new page. We see the result of our efforts in the following screenshot. This chart layout works but we think Calvin and the others will want something else. We decide we need to make an area chart. Understanding area and stacked area charts Area charts come in two varieties. The default type of area chart is simply a modification of a line chart. Everything from the data point on the y axis all the way to zero is shaded. In the event your numbers are negative, then the data above the line up to zero is shaded in. Each data series you have is laid upon the others. Area charts are best to use when we want to compare similar elements, for example, sales by each division in our company or revenue among product categories. The other variation of an area chart is the stacked area chart. The chart starts off being built in the same way as a normal area chart. The first line is plotted and shaded below the line to zero. The difference occurs with the remaining lines. We simply stack them. To understand what happens, consider this analogy. Each shaded line represents a wall built to the height given in the data series. Instead of building one wall behind another, we stack them on top of each other. What can be hard to understand is the y axis. It now denotes a cumulative total, not the individual data points. For example, if the first y value of a line is 4 and the first y value on the second line is 5, then the second point will be plotted at 9 on our y axis. Consider this more complicated example: if the y value in our first line is 2, 7 for our second line, and 4 for the third line, then the y value for our third line will be plotted at 13. That's why we need to compare similar elements. Creating an area chart We grab the quarterly report with the divisional profits we created this morning. We will extend the data to a year and plot the divisional profits as an area chart: We remove the data arrays for revenue and the overall profit array. We also add data to the three arrays containing the divisional profits: <script src="../js/jqplot.dateAxisRenderer.min.js"></script> <script> $(document).ready(function(){ var electronics = [["2011-11-20", 123487.87], ...]; var media = [["2011-11-20", 66449.15], ...]; var nerd_corral = [["2011-11-20", 2112.55], ...]; var div_profit = $.jqplot ('division_profit', [ media, nerd_corral, electronics ], { title:'12 Month Divisional Profits', Under seriesDefaults, we assign true to fill and fillToZero. Without setting fillToZero to true, the fill would continue to the bottom of the chart. With the option set, the fill will extend downward to zero on the y axis for positive values and stop. For negative data points, the fill will extend upward to zero: seriesDefaults: { fill: true, fillToZero: true }, series:[ { label: 'Media & Software' }, { label: 'Nerd Corral' }, { label: 'Electronics' } ], legend: { show: true, placement: 'outsideGrid' }, For our x axis, we set numberTicks to 6. The rest of our options we leave unchanged: axes:{ xaxis:{ label: 'Months', renderer:$.jqplot.DateAxisRenderer, numberTicks: 6, tickOptions: { formatString: "%B" } }, yaxis: { label: 'Total Dollars', tickOptions: { formatString: "$%'d" } } } }); }); </script> <div id="division_profit" style="width:600px;"></div> We review the results of our changes in our browser. We notice something is wrong: only the Electronics series, shown in brown, is showing. This goes back to how area charts are built. Revisiting our wall analogy, we have built a taller wall in front of our other two walls. We need to order our data series from largest to smallest: We move the Electronics series to be the first one in our data array: var div_profit = $.jqplot ('division_profit', [ electronics, media, nerd_corral ], It's also hard to see where some of the lines go when they move underneath another layer. Thankfully, jqPlot has a fillAlpha option. We pass in a percentage in the form of a decimal and jqPlot will change the opacity of our fill area: ... seriesDefaults: { fill: true, fillToZero: true, fillAlpha: .6 }, ... We reload our chart in our web browser and can see the updated changes. Creating a stacked area chart with revenue Calvin stops by while we're taking a break. "Hey guys, I had a VP call and they want to see revenue broken down by division. Can we do that?" We tell him we can. "Great" he says, before turning away and leaving. We discuss this new request and realize this would be a great chance to use a stacked area chart. We dig around and find the divisional revenue numbers Calvin wanted. We can reuse the chart we just created and just change out the data and some options. We use the same variable names for our divisional data and plug in revenue numbers instead of profit. We use a new variable name for our chart object and a new id attribute for our div. We update our title and add the stackSeries option and set it to true: var div_revenue = $.jqplot ( 'division_revenue' , [electronics, media, nerd_corral], { title: '12 Month Divisional Revenue', stackSeries: true, We leave our series' options alone and the only option we change on our x axis is set numberTicks back to 3: seriesDefaults: { fill: true, fillToZero: true }, series:[ { label: 'Electronics' }, { label: 'Media & Software' }, { label: 'Nerd Corral' } ], legend: { show: true, placement: 'outsideGrid' }, axes:{ xaxis:{ label: 'Months', renderer:$.jqplot.DateAxisRenderer, numberTicks: 3, tickOptions: { formatString: "%B" } }, We finish our changes by updating the ID of our div container: yaxis: { label: 'Total Dollars', tickOptions: { formatString: "$%'d" } } } }); }); </script> <div id=" division_revenue " style="width:600px;"></div> With our changes complete, we load this new chart in our browser. As we can see in the following screenshot, we have a chart with each of the data series stacked on top of each other. Because of the nature of a stacked chart, the individual data points are no longer decipherable; however, with the visualization, this is less of an issue. We decide that this is a good place to stop for the day. We'll start on scatterplots and trend lines tomorrow morning. As we begin gathering our things, Calvin stops by on his way out and we show him our recent work. "This is amazing. You guys are making great progress." We tell him we're going to move on to trend lines tomorrow. "Oh, good," Calvin says. "I've had requests to show trending data for our revenue and profit. Someone else mentioned they would love to see trending data of shares on Twitter for our daily deals site. But, like you said, that can wait till tomorrow. Come on, I'll walk with you two."

0
0
2386

Packt

26 Aug 2014

23 min read

Classifying Text

Packt

26 Aug 2014

23 min read

0
0
2860

Packt

25 Aug 2014

18 min read

Camera Calibration

Packt

25 Aug 2014

18 min read

This article by Robert Laganière, author of OpenCV Computer Vision Application Programming Cookbook Second Edition, includes that images are generally produced using a digital camera, which captures a scene by projecting light going through its lens onto an image sensor. The fact that an image is formed by the projection of a 3D scene onto a 2D plane implies the existence of important relationships between a scene and its image and between different images of the same scene. Projective geometry is the tool that is used to describe and characterize, in mathematical terms, the process of image formation. In this article, we will introduce you to some of the fundamental projective relations that exist in multiview imagery and explain how these can be used in computer vision programming. You will learn how matching can be made more accurate through the use of projective constraints and how a mosaic from multiple images can be composited using two-view relations. Before we start the recipe, let's explore the basic concepts related to scene projection and image formation. (For more resources related to this topic, see here.) Image formation Fundamentally, the process used to produce images has not changed since the beginning of photography. The light coming from an observed scene is captured by a camera through a frontal aperture; the captured light rays hit an image plane (or an image sensor) located at the back of the camera. Additionally, a lens is used to concentrate the rays coming from the different scene elements. This process is illustrated by the following figure: Here, do is the distance from the lens to the observed object, di is the distance from the lens to the image plane, and f is the focal length of the lens. These quantities are related by the so-called thin lens equation: In computer vision, this camera model can be simplified in a number of ways. First, we can neglect the effect of the lens by considering that we have a camera with an infinitesimal aperture since, in theory, this does not change the image appearance. (However, by doing so, we ignore the focusing effect by creating an image with an infinite depth of field.) In this case, therefore, only the central ray is considered. Second, since most of the time we have do>>di, we can assume that the image plane is located at the focal distance. Finally, we can note from the geometry of the system that the image on the plane is inverted. We can obtain an identical but upright image by simply positioning the image plane in front of the lens. Obviously, this is not physically feasible, but from a mathematical point of view, this is completely equivalent. This simplified model is often referred to as the pin-hole camera model, and it is represented as follows: From this model, and using the law of similar triangles, we can easily derive the basic projective equation that relates a pictured object with its image: The size (hi) of the image of an object (of height ho) is therefore inversely proportional to its distance (do) from the camera, which is naturally true. In general, this relation describes where a 3D scene point will be projected on the image plane given the geometry of the camera. Calibrating a camera From the introduction of this article, we learned that the essential parameters of a camera under the pin-hole model are its focal length and the size of the image plane (which defines the field of view of the camera). Also, since we are dealing with digital images, the number of pixels on the image plane (its resolution) is another important characteristic of a camera. Finally, in order to be able to compute the position of an image's scene point in pixel coordinates, we need one additional piece of information. Considering the line coming from the focal point that is orthogonal to the image plane, we need to know at which pixel position this line pierces the image plane. This point is called the principal point. It might be logical to assume that this principal point is at the center of the image plane, but in practice, this point might be off by a few pixels depending on the precision at which the camera has been manufactured. Camera calibration is the process by which the different camera parameters are obtained. One can obviously use the specifications provided by the camera manufacturer, but for some tasks, such as 3D reconstruction, these specifications are not accurate enough. Camera calibration will proceed by showing known patterns to the camera and analyzing the obtained images. An optimization process will then determine the optimal parameter values that explain the observations. This is a complex process that has been made easy by the availability of OpenCV calibration functions. How to do it... To calibrate a camera, the idea is to show it a set of scene points for which their 3D positions are known. Then, you need to observe where these points project on the image. With the knowledge of a sufficient number of 3D points and associated 2D image points, the exact camera parameters can be inferred from the projective equation. Obviously, for accurate results, we need to observe as many points as possible. One way to achieve this would be to take one picture of a scene with many known 3D points, but in practice, this is rarely feasible. A more convenient way is to take several images of a set of some 3D points from different viewpoints. This approach is simpler but requires you to compute the position of each camera view in addition to the computation of the internal camera parameters, which fortunately is feasible. OpenCV proposes that you use a chessboard pattern to generate the set of 3D scene points required for calibration. This pattern creates points at the corners of each square, and since this pattern is flat, we can freely assume that the board is located at Z=0, with the X and Y axes well-aligned with the grid. In this case, the calibration process simply consists of showing the chessboard pattern to the camera from different viewpoints. Here is one example of a 6x4 calibration pattern image: The good thing is that OpenCV has a function that automatically detects the corners of this chessboard pattern. You simply provide an image and the size of the chessboard used (the number of horizontal and vertical inner corner points). The function will return the position of these chessboard corners on the image. If the function fails to find the pattern, then it simply returns false: // output vectors of image points std::vector<cv::Point2f> imageCorners; // number of inner corners on the chessboard cv::Size boardSize(6,4); // Get the chessboard corners bool found = cv::findChessboardCorners(image, boardSize, imageCorners); The output parameter, imageCorners, will simply contain the pixel coordinates of the detected inner corners of the shown pattern. Note that this function accepts additional parameters if you needs to tune the algorithm, which are not discussed here. There is also a special function that draws the detected corners on the chessboard image, with lines connecting them in a sequence: //Draw the corners cv::drawChessboardCorners(image, boardSize, imageCorners, found); // corners have been found The following image is obtained: The lines that connect the points show the order in which the points are listed in the vector of detected image points. To perform a calibration, we now need to specify the corresponding 3D points. You can specify these points in the units of your choice (for example, in centimeters or in inches); however, the simplest is to assume that each square represents one unit. In that case, the coordinates of the first point would be (0,0,0) (assuming that the board is located at a depth of Z=0), the coordinates of the second point would be (1,0,0), and so on, the last point being located at (5,3,0). There are a total of 24 points in this pattern, which is too small to obtain an accurate calibration. To get more points, you need to show more images of the same calibration pattern from various points of view. To do so, you can either move the pattern in front of the camera or move the camera around the board; from a mathematical point of view, this is completely equivalent. The OpenCV calibration function assumes that the reference frame is fixed on the calibration pattern and will calculate the rotation and translation of the camera with respect to the reference frame. Let's now encapsulate the calibration process in a CameraCalibrator class. The attributes of this class are as follows: class CameraCalibrator { // input points: // the points in world coordinates std::vector<std::vector<cv::Point3f>> objectPoints; // the point positions in pixels std::vector<std::vector<cv::Point2f>> imagePoints; // output Matrices cv::Mat cameraMatrix; cv::Mat distCoeffs; // flag to specify how calibration is done int flag; Note that the input vectors of the scene and image points are in fact made of std::vector of point instances; each vector element is a vector of the points from one view. Here, we decided to add the calibration points by specifying a vector of the chessboard image filename as input: // Open chessboard images and extract corner points int CameraCalibrator::addChessboardPoints( const std::vector<std::string>& filelist, cv::Size & boardSize) { // the points on the chessboard std::vector<cv::Point2f> imageCorners; std::vector<cv::Point3f> objectCorners; // 3D Scene Points: // Initialize the chessboard corners // in the chessboard reference frame // The corners are at 3D location (X,Y,Z)= (i,j,0) for (int i=0; i<boardSize.height; i++) { for (int j=0; j<boardSize.width; j++) { objectCorners.push_back(cv::Point3f(i, j, 0.0f)); } } // 2D Image points: cv::Mat image; // to contain chessboard image int successes = 0; // for all viewpoints for (int i=0; i<filelist.size(); i++) { // Open the image image = cv::imread(filelist[i],0); // Get the chessboard corners bool found = cv::findChessboardCorners( image, boardSize, imageCorners); // Get subpixel accuracy on the corners cv::cornerSubPix(image, imageCorners, cv::Size(5,5), cv::Size(-1,-1), cv::TermCriteria(cv::TermCriteria::MAX_ITER + cv::TermCriteria::EPS, 30, // max number of iterations 0.1)); // min accuracy //If we have a good board, add it to our data if (imageCorners.size() == boardSize.area()) { // Add image and scene points from one view addPoints(imageCorners, objectCorners); successes++; } } return successes; } The first loop inputs the 3D coordinates of the chessboard, and the corresponding image points are the ones provided by the cv::findChessboardCorners function. This is done for all the available viewpoints. Moreover, in order to obtain a more accurate image point location, the cv::cornerSubPix function can be used, and as the name suggests, the image points will then be localized at a subpixel accuracy. The termination criterion that is specified by the cv::TermCriteria object defines the maximum number of iterations and the minimum accuracy in subpixel coordinates. The first of these two conditions that is reached will stop the corner refinement process. When a set of chessboard corners have been successfully detected, these points are added to our vectors of the image and scene points using our addPoints method. Once a sufficient number of chessboard images have been processed (and consequently, a large number of 3D scene point / 2D image point correspondences are available), we can initiate the computation of the calibration parameters as follows: // Calibrate the camera // returns the re-projection error double CameraCalibrator::calibrate(cv::Size &imageSize) { //Output rotations and translations std::vector<cv::Mat> rvecs, tvecs; // start calibration return calibrateCamera(objectPoints, // the 3D points imagePoints, // the image points imageSize, // image size cameraMatrix, // output camera matrix distCoeffs, // output distortion matrix rvecs, tvecs, // Rs, Ts flag); // set options } In practice, 10 to 20 chessboard images are sufficient, but these must be taken from different viewpoints at different depths. The two important outputs of this function are the camera matrix and the distortion parameters. These will be described in the next section. How it works... In order to explain the result of the calibration, we need to go back to the figure in the introduction, which describes the pin-hole camera model. More specifically, we want to demonstrate the relationship between a point in 3D at the position (X,Y,Z) and its image (x,y) on a camera specified in pixel coordinates. Let's redraw this figure by adding a reference frame that we position at the center of the projection as seen here: Note that the y axis is pointing downward to get a coordinate system compatible with the usual convention that places the image origin at the upper-left corner. We learned previously that the point (X,Y,Z) will be projected onto the image plane at (fX/Z,fY/Z). Now, if we want to translate this coordinate into pixels, we need to divide the 2D image position by the pixel's width (px) and height (py), respectively. Note that by dividing the focal length given in world units (generally given in millimeters) by px, we obtain the focal length expressed in (horizontal) pixels. Let's then define this term as fx. Similarly, fy =f/py is defined as the focal length expressed in vertical pixel units. Therefore, the complete projective equation is as follows: Recall that (u0,v0) is the principal point that is added to the result in order to move the origin to the upper-left corner of the image. These equations can be rewritten in the matrix form through the introduction of homogeneous coordinates, in which 2D points are represented by 3-vectors and 3D points are represented by 4-vectors (the extra coordinate is simply an arbitrary scale factor, S, that needs to be removed when a 2D coordinate needs to be extracted from a homogeneous 3-vector). Here is the rewritten projective equation: The second matrix is a simple projection matrix. The first matrix includes all of the camera parameters, which are called the intrinsic parameters of the camera. This 3x3 matrix is one of the output matrices returned by the cv::calibrateCamera function. There is also a function called cv::calibrationMatrixValues that returns the value of the intrinsic parameters given by a calibration matrix. More generally, when the reference frame is not at the projection center of the camera, we will need to add a rotation vector (a 3x3 matrix) and a translation vector (a 3x1 matrix). These two matrices describe the rigid transformation that must be applied to the 3D points in order to bring them back to the camera reference frame. Therefore, we can rewrite the projection equation in its most general form: Remember that in our calibration example, the reference frame was placed on the chessboard. Therefore, there is a rigid transformation (made of a rotation component represented by the matrix entries r1 to r9 and a translation represented by t1, t2, and t3) that must be computed for each view. These are in the output parameter list of the cv::calibrateCamera function. The rotation and translation components are often called the extrinsic parameters of the calibration, and they are different for each view. The intrinsic parameters remain constant for a given camera/lens system. The intrinsic parameters of our test camera obtained from a calibration based on 20 chessboard images are fx=167, fy=178, u0=156, and v0=119. These results are obtained by cv::calibrateCamera through an optimization process aimed at finding the intrinsic and extrinsic parameters that will minimize the difference between the predicted image point position, as computed from the projection of the 3D scene points, and the actual image point position, as observed on the image. The sum of this difference for all the points specified during the calibration is called the re-projection error. Let's now turn our attention to the distortion parameters. So far, we have mentioned that under the pin-hole camera model, we can neglect the effect of the lens. However, this is only possible if the lens that is used to capture an image does not introduce important optical distortions. Unfortunately, this is not the case with lower quality lenses or with lenses that have a very short focal length. You may have already noted that the chessboard pattern shown in the image that we used for our example is clearly distorted—the edges of the rectangular board are curved in the image. Also, note that this distortion becomes more important as we move away from the center of the image. This is a typical distortion observed with a fish-eye lens, and it is called radial distortion. The lenses used in common digital cameras usually do not exhibit such a high degree of distortion, but in the case of the lens used here, these distortions certainly cannot be ignored. It is possible to compensate for these deformations by introducing an appropriate distortion model. The idea is to represent the distortions induced by a lens by a set of mathematical equations. Once established, these equations can then be reverted in order to undo the distortions visible on the image. Fortunately, the exact parameters of the transformation that will correct the distortions can be obtained together with the other camera parameters during the calibration phase. Once this is done, any image from the newly calibrated camera will be undistorted. Therefore, we have added an additional method to our calibration class: // remove distortion in an image (after calibration) cv::Mat CameraCalibrator::remap(const cv::Mat &image) { cv::Mat undistorted; if (mustInitUndistort) { // called once per calibration cv::initUndistortRectifyMap( cameraMatrix, // computed camera matrix distCoeffs, // computed distortion matrix cv::Mat(), // optional rectification (none) cv::Mat(), // camera matrix to generate undistorted image.size(), // size of undistorted CV_32FC1, // type of output map map1, map2); // the x and y mapping functions mustInitUndistort= false; } // Apply mapping functions cv::remap(image, undistorted, map1, map2, cv::INTER_LINEAR); // interpolation type return undistorted; } Running this code results in the following image: As you can see, once the image is undistorted, we obtain a regular perspective image. To correct the distortion, OpenCV uses a polynomial function that is applied to the image points in order to move them at their undistorted position. By default, five coefficients are used; a model made of eight coefficients is also available. Once these coefficients are obtained, it is possible to compute two cv::Mat mapping functions (one for the x coordinate and one for the y coordinate) that will give the new undistorted position of an image point on a distorted image. This is computed by the cv::initUndistortRectifyMap function, and the cv::remap function remaps all the points of an input image to a new image. Note that because of the nonlinear transformation, some pixels of the input image now fall outside the boundary of the output image. You can expand the size of the output image to compensate for this loss of pixels, but you will now obtain output pixels that have no values in the input image (they will then be displayed as black pixels). There's more... More options are available when it comes to camera calibration. Calibration with known intrinsic parameters When a good estimate of the camera's intrinsic parameters is known, it could be advantageous to input them in the cv::calibrateCamera function. They will then be used as initial values in the optimization process. To do so, you just need to add the CV_CALIB_USE_INTRINSIC_GUESS flag and input these values in the calibration matrix parameter. It is also possible to impose a fixed value for the principal point (CV_CALIB_FIX_PRINCIPAL_POINT), which can often be assumed to be the central pixel. You can also impose a fixed ratio for the focal lengths fx and fy (CV_CALIB_FIX_RATIO); in which case, you assume the pixels of the square shape. Using a grid of circles for calibration Instead of the usual chessboard pattern, OpenCV also offers the possibility to calibrate a camera by using a grid of circles. In this case, the centers of the circles are used as calibration points. The corresponding function is very similar to the function we used to locate the chessboard corners: cv::Size boardSize(7,7); std::vector<cv::Point2f> centers; bool found = cv:: findCirclesGrid( image, boardSize, centers); See also The A flexible new technique for camera calibration article by Z. Zhang in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no 11, 2000, is a classic paper on the problem of camera calibration Summary In this article, we explored the projective relations that exist between two images of the same scene. Resources for Article: Further resources on this subject: Creating an Application from Scratch [Article] Wrapping OpenCV [Article] New functionality in OpenCV 3.0 [Article]

0
0
22104

Packt

25 Aug 2014

13 min read

Report Data Filtering

Packt

25 Aug 2014

13 min read

0
0
2066

article-image-recommender-systems-dissected

Packt

21 Aug 2014

4 min read

Recommender systems dissected

Packt

21 Aug 2014

4 min read

In this article by Rik Van Bruggen, the author of Learning Neo4j, we will take a look at so-called recommender systems. Such systems, broadly speaking, consist of two elementary parts: (For more resources related to this topic, see here.) A pattern discovery system: This is the system that somehow figures out what would be a useful recommendation for a particular target group. This discovery can be done in many different ways, but in general we see three ways to do so: A business expert who thoroughly understands the domain of the graph database application will use this understanding to determine useful recommendations. For example, the supervisor of a "do-it-yourself" retail outlet would understand a particular pattern. Suppose that if someone came in to buy multiple pots of paint, they would probably also benefit from getting a gentle recommendation for a promotion of high-end brushes. The store has that promotion going on right then, so the recommendation would be very timely. This process would be a discovery process where the pattern that the business expert has discovered would be applied and used in graph databases in real time as part of a sophisticated recommendation. A visual discovery of a specific pattern in the graph representation of the business domain. We have found that in many different projects, business users have stumbled upon these kinds of patterns while using graph visualizations to look at their data. Specific patterns emerge, unexpected relations jump out, or, in more advanced visualization solutions, specific clusters of activity all of a sudden become visible and require further investigation. Graph databases such as Neo4j, and the visualization solutions that complement it, can play a wonderfully powerful role in this process. An algorithmic discovery of a pattern in a dataset of the business domain using machine learning algorithms to uncover previously unknown patterns in the data. Typically, these processes require an iterative, raw number-crunching approach that has also been used on non-graph data formats in the past. It remains part-art part-science at this point, but can of course yield interesting insights if applied correctly. An overview of recommender systems A pattern application system: All recommender systems will be more or less successful based on not just the patterns that they are able to discover, but also based on the way that they are able to apply these patterns in business applications. We can look at different types of applications for these patterns: Batch-oriented applications: Some applications of these patterns are not as time-critical as one would expect. It does not really matter in any material kind of way if a bulk e-mail with recommendations, or worse, a printed voucher with recommended discounted products, gets delivered to the customer or prospect at 10 am or 11 am. Batch solutions can usually cope with these kinds of requests, even if they do so in the most inefficient way. Real-time oriented applications: Some pattern applications simply have to be delivered in real time, in between a web request and a web response, and cannot be precalculated. For these types of systems, which typically use more complex database queries in order to match for the appropriate recommendation to make, graph databases such as Neo4j are a fantastic tool to have. We will illustrate this going forward. With this classification behind us, we will look at an example dataset and some example queries to make this topic come alive. Using a graph model for recommendations We will be using a very specific data model for our recommender system. All we have changed is that we added a couple of products and brands to the model, and inserted some data into the database correspondingly. In total, we added the following: Ten products Three product brands Fifty relationships between existing person nodes and the mentioned products, highlighting that these persons bought these products These are the products and brands that we added: Adding products and brands to the dataset The following diagram shows the resulting model: In Neo4j, that model will look something like the following: A dataset like this one, while of course a broad simplification, offers us some interesting possibilities for a recommender system. Let's take a look at some queries that could really match this use case, and that would allow us to either visually or in real time exploit the data in this dataset in a product recommendation application. Summary This article elaborately dissected the so-called recommender systems. Resources for Article: Further resources on this subject: Working with a Neo4j Embedded Database [article] Comparative Study of NoSQL Products [article] Getting Started with CouchDB and Futon [article]

0
0
2275

Packt

21 Aug 2014

9 min read

Using Canvas and D3

Packt

21 Aug 2014

9 min read

This article by Pablo Navarro Castillo, author of Mastering D3.js, includes that most of the time, we use D3 to create SVG-based charts and visualizations. If the number of elements to render is huge, or if we need to render raster images, it can be more convenient to render our visualizations using the HTML5 canvas element. In this article, we will learn how to use the force layout of D3 and the canvas element to create an animated network visualization with random data. (For more resources related to this topic, see here.) Creating figures with canvas The HTML canvas element allows you to create raster graphics using JavaScript. It was first introduced in HTML5. It enjoys more widespread support than SVG, and can be used as a fallback option. Before diving deeper into integrating canvas and D3, we will construct a small example with canvas. The canvas element should have the width and height attributes. This alone will create an invisible figure of the specified size: <!— Canvas Element --> <canvas id="canvas-demo" width="650px" height="60px"></canvas> If the browser supports the canvas element, it will ignore any element inside the canvas tags. On the other hand, if the browser doesn't support the canvas, it will ignore the canvas tags, but it will interpret the content of the element. This behavior provides a quick way to handle the lack of canvas support: <!— Canvas Element --> <canvas id="canvas-demo" width="650px" height="60px">  <img src="img/fallback-img.png" width="650" height="60"></img> </canvas> If the browser doesn't support canvas, the fallback image will be displayed. Note that unlike the <img> element, the canvas closing tag (</canvas>) is mandatory. To create figures with canvas, we don't need special libraries; we can create the shapes using the canvas API: <script> // Graphic Variables var barw = 65, barh = 60; // Append a canvas element, set its size and get the node. var canvas = document.getElementById('canvas-demo'); // Get the rendering context. var context = canvas.getContext('2d'); // Array with colors, to have one rectangle of each color. var color = ['#5c3566', '#6c475b', '#7c584f', '#8c6a44', '#9c7c39', '#ad8d2d', '#bd9f22', '#cdb117', '#ddc20b', '#edd400']; // Set the fill color and render ten rectangles. for (var k = 0; k < 10; k += 1) { // Set the fill color. context.fillStyle = color[k]; // Create a rectangle in incremental positions. context.fillRect(k * barw, 0, barw, barh); } </script> We use the DOM API to access the canvas element with the canvas-demo ID and to get the rendering context. Then we set the color using the fillStyle method, and use the fillRect canvas method to create a small rectangle. Note that we need to change fillStyle every time or all the following shapes will have the same color. The script will render a series of rectangles, each one filled with a different color, shown as follows: A graphic created with canvas Canvas uses the same coordinate system as SVG, with the origin in the top-left corner, the horizontal axis augmenting to the right, and the vertical axis augmenting to the bottom. Instead of using the DOM API to get the canvas node, we could have used D3 to create the node, set its attributes, and created scales for the color and position of the shapes. Note that the shapes drawn with canvas don't exists in the DOM tree; so, we can't use the usual D3 pattern of creating a selection, binding the data items, and appending the elements if we are using canvas. Creating shapes Canvas has fewer primitives than SVG. In fact, almost all the shapes must be drawn with paths, and more steps are needed to create a path. To create a shape, we need to open the path, move the cursor to the desired location, create the shape, and close the path. Then, we can draw the path by filling the shape or rendering the outline. For instance, to draw a red semicircle centered in (325, 30) and with a radius of 20, write the following code: // Create a red semicircle. context.beginPath(); context.fillStyle = '#ff0000'; context.moveTo(325, 30); context.arc(325, 30, 20, Math.PI / 2, 3 * Math.PI / 2); context.fill(); The moveTo method is a bit redundant here, because the arc method moves the cursor implicitly. The arguments of the arc method are the x and y coordinates of the arc center, the radius, and the starting and ending angle of the arc. There is also an optional Boolean argument to indicate whether the arc should be drawn counterclockwise. A basic shape created with the canvas API is shown in the following screenshot: Integrating canvas and D3 We will create a small network chart using the force layout of D3 and canvas instead of SVG. To make the graph looks more interesting, we will randomly generate the data. We will generate 250 nodes sparsely connected. The nodes and links will be stored as the attributes of the data object: // Number of Nodes var nNodes = 250, createLink = false; // Dataset Structure var data = {nodes: [],links: []}; We will append nodes and links to our dataset. We will create nodes with a radius attribute randomly assigning it a value of either 2 or 4 as follows: // Iterate in the nodes for (var k = 0; k < nNodes; k += 1) { // Create a node with a random radius. data.nodes.push({radius: (Math.random() > 0.3) ? 2 : 4}); // Create random links between the nodes. } We will create a link with probability of 0.1 only if the difference between the source and target indexes are less than 8. The idea behind this way to create links is to have only a few connections between the nodes: // Create random links between the nodes. for (var j = k + 1; j < nNodes; j += 1) { // Create a link with probability 0.1 createLink = (Math.random() < 0.1) && (Math.abs(k - j) < 8); if (createLink) { // Append a link with variable distance between the nodes data.links.push({ source: k, target: j, dist: 2 * Math.abs(k - j) + 10 }); } } We will use the radius attribute to set the size of the nodes. The links will contain the distance between the nodes and the indexes of the source and target nodes. We will create variables to set the width and height of the figure: // Figure width and height var width = 650, height = 300; We can now create and configure the force layout. As we did in the previous section, we will set the charge strength to be proportional to the area of each node. This time, we will also set the distance between the links, using the linkDistance method of the layout: // Create and configure the force layout var force = d3.layout.force() .size([width, height]) .nodes(data.nodes) .links(data.links) .charge(function(d) { return -1.2 * d.radius * d.radius; }) .linkDistance(function(d) { return d.dist; }) .start(); We can create a canvas element now. Note that we should use the node method to get the canvas element, because the append and attr methods will both return a selection, which don't have the canvas API methods: // Create a canvas element and set its size. var canvas = d3.select('div#canvas-force').append('canvas') .attr('width', width + 'px') .attr('height', height + 'px') .node(); We get the rendering context. Each canvas element has its own rendering context. We will use the '2d' context, to draw two-dimensional figures. At the time of writing this, there are some browsers that support the webgl context; more details are available at https://developer.mozilla.org/en-US/docs/Web/WebGL/Getting_started_with_WebGL. Refer to the following '2d' context: // Get the canvas context. var context = canvas.getContext('2d'); We register an event listener for the force layout's tick event. As canvas doesn't remember previously created shapes, we need to clear the figure and redraw all the elements on each tick: force.on('tick', function() { // Clear the complete figure. context.clearRect(0, 0, width, height); // Draw the links ... // Draw the nodes ... }); The clearRect method cleans the figure under the specified rectangle. In this case, we clean the entire canvas. We can draw the links using the lineTo method. We iterate through the links, beginning a new path for each link, moving the cursor to the position of the source node, and by creating a line towards the target node. We draw the line with the stroke method: // Draw the links data.links.forEach(function(d) { // Draw a line from source to target. context.beginPath(); context.moveTo(d.source.x, d.source.y); context.lineTo(d.target.x, d.target.y); context.stroke(); }); We iterate through the nodes and draw each one. We use the arc method to represent each node with a black circle: // Draw the nodes data.nodes.forEach(function(d, i) { // Draws a complete arc for each node. context.beginPath(); context.arc(d.x, d.y, d.radius, 0, 2 * Math.PI, true); context.fill(); }); We obtain a constellation of disconnected network graphs, slowly gravitating towards the center of the figure. Using the force layout and canvas to create a network chart is shown in the following screenshot: We can think that to erase all the shapes and redraw each shape again and again could have a negative impact on the performance. In fact, sometimes it's faster to draw the figures using canvas, because this way the browser doesn't have to manage the DOM tree of the SVG elements (but it still have to redraw them if the SVG elements are changed). Summary In this article, we will learn how to use the force layout of D3 and the canvas element to create an animated network visualization with random data. Resources for Article: Further resources on this subject: Interacting with Data for Dashboards [Article] Kendo UI DataViz – Advance Charting [Article] Visualizing my Social Graph with d3.js [Article]

0
0
6090

Packt

20 Aug 2014

19 min read

Importing Dynamic Data

Packt

20 Aug 2014

19 min read

In this article by Chad Adams, author of the book Learning Python Data Visualization, we will go over the finer points of pulling data from the Web using the Python language and its built-in libraries and cover parsing XML, JSON, and JSONP data. (For more resources related to this topic, see here.) Since we now have an understanding of how to work with the pygal library and building charts and graphics in general, this is the time to start looking at building an application using Python. In this article, we will take a look at the fundamentals of pulling data from the Web, parsing the data, and adding it to our code base and formatting the data into a useable format, and we will look at how to carry those fundamentals over to our Python code. We will also cover parsing XML and JSON data. Pulling data from the Web For many non-developers, it may seem like witchcraft that developers are magically able to pull data from an online resource and integrate that with an iPhone app, or a Windows Store app, or pull data to a cloud resource that is able to generate various versions of the data upon request. To be fair, they do have a general understanding; data is pulled from the Web and formatted to their app of choice. They just may not get the full background of how that process workflow happens. It's the same case with some developers as well—many developers mainly work on a technology that only works on a locked down environment, or generally, don't use the Internet for their applications. Again, they understand the logic behind it; somehow an RSS feed gets pulled into an application. In many languages, the same task is done in various ways, usually depending on which language is used. Let's take a look at a few examples using Packt's own news RSS feed, using an iOS app pulling in data via Objective-C. Now, if you're reading this and not familiar with Objective-C, that's OK, the important thing is that we have the inner XML contents of an XML file showing up in an iPhone application: #import "ViewController.h" @interfaceViewController () @property (weak, nonatomic) IBOutletUITextView *output; @end @implementationViewController - (void)viewDidLoad { [super viewDidLoad]; // Do any additional setup after loading the view, typically from a nib. NSURL *packtURL = [NSURLURLWithString:@"http://www.packtpub.com/rss.xml"]; NSURLRequest *request = [NSURLRequestrequestWithURL:packtURL]; NSURLConnection *connection = [[NSURLConnectionalloc] initWithRequest:requestdelegate:selfstartImmediately:YES]; [connection start]; } - (void)connection:(NSURLConnection *)connection didReceiveData:(NSData *)data { NSString *downloadstring = [[NSStringalloc] initWithData:dataencoding:NSUTF8StringEncoding]; [self.outputsetText:downloadstring]; } - (void)didReceiveMemoryWarning { [superdidReceiveMemoryWarning]; // Dispose of any resources that can be recreated. } @end Here, we can see in iPhone Simulator that our XML output is pulled dynamically through HTTP from the Web to our iPhone simulator. This is what we'll want to get started with doing in Python: The XML refresher Extensible Markup Language (XML) is a data markup language that sets a series of rules and hierarchy to a data group, which is stored as a static file. Typically, servers update these XML files on the Web periodically to be reused as data sources. XML is really simple to pick up as it's similar to HTML. You can start with the document declaration in this case: <?xml version="1.0" encoding="utf-8"?> Next, a root node is set. A node is like an HTML tag (which is also called a node). You can tell it's a node by the brackets around the node's name. For example, here's a node named root: <root></root> Note that we close the node by creating a same-named node with a backslash. We can also add parameters to the node and assign a value, as shown in the following root node: <root parameter="value"></root> Data in XML is set through a hierarchy. To declare that hierarchy, we create another node and place that inside the parent node, as shown in the following code: <root parameter="value"> <subnode>Subnode's value</subnode> </root> In the preceding parent node, we created a subnode. Inside the subnode, we have an inner value called Subnode's value. Now, in programmatical terms, getting data from an XML data file is a process called parsing. With parsing, we specify where in the XML structure we would like to get a specific value; for instance, we can crawl the XML structure and get the inner contents like this: /root/subnode This is commonly referred to as XPath syntax, a cross-language way of going through an XML file. For more on XML and XPath, check out the full spec at: http://www.w3.org/TR/REC-xml/ and here http://www.w3.org/TR/xpath/ respectively. RSS and the ATOM Really simple syndication (RSS) is simply a variation of XML. RSS is a spec that defines specific nodes that are common for sending data. Typically, many blog feeds include an RSS option for users to pull down the latest information from those sites. Some of the nodes used in RSS include rss, channel, item, title, description, pubDate, link, and GUID. Looking at our iPhone example in this article from the Pulling data from the Web section, we can see what a typical RSS structure entails. RSS feeds are usually easy to spot since the spec requires the root node to be named rss for it to be a true RSS file. In some cases, some websites and services will use .rss rather than .xml; this is typically fine since most readers for RSS content will pull in the RSS data like an XML file, just like in the iPhone example. Another form of XML is called ATOM. ATOM was another spec similar to RSS, but developed much later than RSS. Because of this, ATOM has more features than RSS: XML namespacing, specified content formats (video, or audio-specific URLs), support for internationalization, and multilanguage support, just to name a few. ATOM does have a few different nodes compared to RSS; for instance, the root node to an RSS feed would be <rss>. ATOM's root starts with <feed>, so it's pretty easy to spot the difference. Another difference is that ATOM can also end in .atom or .xml. For more on the RSS and ATOM spec, check out the following sites: http://www.rssboard.org/rss-specification http://tools.ietf.org/html/rfc4287 Understanding HTTP All these samples from the RSS feed of the Packt Publishing website show one commonality that's used regardless of the technology coded in, and that is the method used to pull down these static files is through the Hypertext Transfer Protocol (HTTP). HTTP is the foundation of Internet communication. It's a protocol with two distinct types of requests: a request for data or GET and a push of data called a POST. Typically, when we download data using HTTP, we use the GET method of HTTP in order to pull down the data. The GET request will return a string or another data type if we mention a specific type. We can either use this value directly or save to a variable. With a POST request, we are sending values to a service that handles any incoming values; say we created a new blog post's title and needed to add to a list of current titles, a common way of doing that is with URL parameters. A URL parameter is an existing URL but with a suffixed key-value pair. The following is a mock example of a POST request with a URL parameter: http://www.yourwebsite.com/blogtitles/?addtitle=Your%20New%20Title If our service is set up correctly, it will scan the POST request for a key of addtitle and read the value, in this case: Your New Title. We may notice %20 in our title for our request. This is an escape character that allows us to send a value with spaces; in this case, %20 is a placehoder for a space in our value. Using HTTP in Python The RSS samples from the Packt Publishing website show a few commonalities we use in programming when working in HTTP; one is that we always account for the possibility of something potentially going wrong with a connection and we always close our request when finished. Here's an example on how the same RSS feed request is done in Python using a built-in library called urllib2: #!/usr/bin/env python # -*- coding: utf-8 -*- import urllib2 try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') #Read the file to a variable we named 'xml' xml = response.read() #print to the console. print(xml) #Finally, close our open network. response.close() except: #If we have an issue show a message and alert the user. print('Unable to connect to RSS...') If we look in the following console output, we can see the XML contents just as we saw in our iOS code example: In the example, notice that we wrapped our HTTP request around a try except block. For those coming from another language, except can be considered the same as a catch statement. This allows us to detect if an error occurs, which might be an incorrect URL or lack of connectivity, for example, to programmatically set an alternate result to our Python script. Parsing XML in Python with HTTP With these examples including our Python version of the script, it's still returning a string of some sorts, which by itself isn't of much use to grab values from the full string. In order to grab specific strings and values from an XML pulled through HTTP, we need to parse it. Luckily, Python has a built-in object in the Python main library for this, called as ElementTree, which is a part of the XML library in Python. Let's incorporate ElementTree into our example and see how that works: # -*- coding: utf-8 -*- import urllib2 from xml.etree import ElementTree try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') tree = ElementTree.parse(response) root = tree.getroot() #Create an 'Element' group from our XPATH using findall. news_post_title = root.findall("channel//title") #Iterate in all our searched elements and print the inner text for each. for title in news_post_title: print title.text #Finally, close our open network. response.close() except Exception as e: #If we have an issue show a message and alert the user. print(e) The following screenshot shows the results of our script: As we can see, our output shows each title for each blog post. Notice how we used channel//item for our findall() method. This is using XPath, which allows us to write in a shorthand manner on how to iterate an XML structure. It works like this; let's use the http://www.packtpub.com feed as an example. We have a root, followed by channel, then a series of title elements. The findall() method found each element and saved them as an Element type specific to the XML library ElementTree uses in Python, and saved each of those into an array. We can then use a for in loop to iterate each one and print the inner text using the text property specific to the Element type. You may notice in the recent example that I changed except with a bit of extra code and added Exception as e. This allows us to help debug issues and print them to a console or display a more in-depth feedback to the user. An Exception allows for generic alerts that the Python libraries have built-in warnings and errors to be printed back either through a console or handled with the code. They also have more specific exceptions we can use such as IOException, which is specific for working with file reading and writing. About JSON Now, another data type that's becoming more and more common when working with web data is JSON. JSON is an acronym for JavaScript Object Notation, and as the name implies, is indeed true JavaScript. It has become popular in recent years with the rise of mobile apps, and Rich Internet Applications (RIA). JSON is great for JavaScript developers; it's easier to work with when working in HTML content, compared to XML. Because of this, JSON is becoming a more common data type for web and mobile application development. Parsing JSON in Python with HTTP To parse JSON data in Python is a pretty similar process; however, in this case, our ElementTree library isn't needed, since that only works with XML. We need a library designed to parse JSON using Python. Luckily, the Python library creators already have a library for us, simply called json. Let's build an example similar to our XML code using the json import; of course, we need to use a different data source since we won't be working in XML. One thing we may note is that there aren't many public JSON feeds to use, many of which require using a code that gives a developer permission to generate a JSON feed through a developer API, such as Twitter's JSON API. To avoid this, we will use a sample URL from Google's Books API, which will show demo data of Pride and Prejudice, Jane Austen. We can view the JSON (or download the file), by typing in the following URL: https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ Notice the API uses HTTPS, many JSON APIs are moving to secure methods of transmitting data, so be sure to include the secure in HTTP, with HTTPS. Let's take a look at the JSON output: { "kind": "books#volume", "id": "s1gVAAAAYAAJ", "etag": "yMBMZ85ENrc", "selfLink": "https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ", "volumeInfo": { "title": "Pride and Prejudice", "authors": [ "Jane Austen" ], "publisher": "C. Scribner's Sons", "publishedDate": "1918", "description": "Austen's most celebrated novel tells the story of Elizabeth Bennet, a bright, lively young woman with four sisters, and a mother determined to marry them to wealthy men. At a party near the Bennets' home in the English countryside, Elizabeth meets the wealthy, proud Fitzwilliam Darcy. Elizabeth initially finds Darcy haughty and intolerable, but circumstances continue to unite the pair. Mr. Darcy finds himself captivated by Elizabeth's wit and candor, while her reservations about his character slowly vanish. The story is as much a social critique as it is a love story, and the prose crackles with Austen's wry wit.", "readingModes": { "text": true, "image": true }, "pageCount": 401, "printedPageCount": 448, "dimensions": { "height": "18.00 cm" }, "printType": "BOOK", "averageRating": 4.0, "ratingsCount": 433, "contentVersion": "1.1.5.0.full.3", "imageLinks": { "smallThumbnail": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec =frontcover&img=1&zoom=5&edge=curl&imgtk=AFLRE73F8btNqKpVjGX6q7V3XS77 QA2PftQUxcEbU3T3njKNxezDql_KgVkofGxCPD3zG1yq39u0XI8s4wjrqFahrWQ- 5Epbwfzfkoahl12bMQih5szbaOw&source=gbs_api", "thumbnail": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=1&edge=curl&imgtk=AFLRE70tVS8zpcFltWh_ 7K_5Nh8BYugm2RgBSLg4vr9tKRaZAYoAs64RK9aqfLRECSJq7ATs_j38JRI3D4P48-2g_ k4-EY8CRNVReZguZFMk1zaXlzhMNCw&source=gbs_api", "small": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec =frontcover&img=1&zoom=2&edge=curl&imgtk=AFLRE71qcidjIs37x0jN2dGPstn 6u2pgeXGWZpS1ajrGgkGCbed356114HPD5DNxcR5XfJtvU5DKy5odwGgkrwYl9gC9fo3y- GM74ZIR2Dc-BqxoDuUANHg&source=gbs_api", "medium": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=3&edge=curl&imgtk=AFLRE73hIRCiGRbfTb0uNIIXKW 4vjrqAnDBSks_ne7_wHx3STluyMa0fsPVptBRW4yNxNKOJWjA4Od5GIbEKytZAR3Nmw_ XTmaqjA9CazeaRofqFskVjZP0&source=gbs_api", "large": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=4&edge=curl&imgtk=AFLRE73mlnrDv-rFsL- n2AEKcOODZmtHDHH0QN56oG5wZsy9XdUgXNnJ_SmZ0sHGOxUv4sWK6GnMRjQm2eEwnxIV4dcF9eBhghMcsx -S2DdZoqgopJHk6Ts&source=gbs_api", "extraLarge": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=6&edge=curl&imgtk=AFLRE73KIXHChszn TbrXnXDGVs3SHtYpl8tGncDPX_7GH0gd7sq7SA03aoBR0mDC4-euzb4UCIDiDNLYZUBJwMJxVX_ cKG5OAraACPLa2QLDcfVkc1pcbC0&source=gbs_api" }, "language": "en", "previewLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&source=gbs_api", "infoLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&source=gbs_api", "canonicalVolumeLink": "http://books.google.com/books/about/ Pride_and_Prejudice.html?hl=&id=s1gVAAAAYAAJ" }, "layerInfo": { "layers": [ { "layerId": "geo", "volumeAnnotationsVersion": "6" } ] }, "saleInfo": { "country": "US", "saleability": "FREE", "isEbook": true, "buyLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&buy=&source=gbs_api" }, "accessInfo": { "country": "US", "viewability": "ALL_PAGES", "embeddable": true, "publicDomain": true, "textToSpeechPermission": "ALLOWED", "epub": { "isAvailable": true, "downloadLink": "http://books.google.com/books/download /Pride_and_Prejudice.epub?id=s1gVAAAAYAAJ&hl=&output=epub &source=gbs_api" }, "pdf": { "isAvailable": true, "downloadLink": "http://books.google.com/books/download/Pride_and_Prejudice.pdf ?id=s1gVAAAAYAAJ&hl=&output=pdf&sig=ACfU3U3dQw5JDWdbVgk2VRHyDjVMT4oIaA &source=gbs_api" }, "webReaderLink": "http://books.google.com/books/reader ?id=s1gVAAAAYAAJ&hl=&printsec=frontcover& output=reader&source=gbs_api", "accessViewStatus": "FULL_PUBLIC_DOMAIN", "quoteSharingAllowed": false } } One downside to JSON is that it can be hard to read complex data. So, if we run across a complex JSON feed, we can visualize it using a JSON Visualizer. Visual Studio includes one with all its editions, and a web search will also show multiple online sites where you can paste JSON and an easy-to-understand data structure will be displayed. Here's an example using http://jsonviewer.stack.hu/ loading our example JSON URL: Next, let's reuse some of our existing Python code using our urllib2 library to request our JSON feed, and then we will parse it with the Python's JSON library. Let's parse the volumeInfo node of the book by starting with the JSON (root) node that is followed by volumeInfo as the subnode. Here's our example from the XML section, reworked using JSON to parse all child elements: # -*- coding: utf-8 -*- import urllib2 import json try: #Set a URL variable. url = 'https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ' #Open the file via HTTP. response = urllib2.urlopen(url) #Read the request as one string. bookdata = response.read() #Convert the string to a JSON object in Python. data = json.loads(bookdata) for r in data ['volumeInfo']: print r #Close our response. response.close() except: #If we have an issue show a message and alert the user. print('Unable to connect to JSON API...') Here's our output. It should match the child nodes of volumeInfo, which it does in the output screen, as shown in the following screenshot: Well done! Now, let's grab the value for title. Look at the following example and notice we have two brackets: one for volumeInfo and another for title. This is similar to navigating our XML hierarchy: # -*- coding: utf-8 -*- import urllib2 import json try: #Set a URL variable. url = 'https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ' #Open the file via HTTP. response = urllib2.urlopen(url) #Read the request as one string. bookdata = response.read() #Convert the string to a JSON object in Python. data = json.loads(bookdata) print data['volumeInfo']['title'] #Close our response. response.close() except Exception as e: #If we have an issue show a message and alert the user. #'Unable to connect to JSON API...' print(e) The following screenshot shows the results of our script: As you can see in the preceding screenshot, we return one line with Pride and Prejudice parsed from our JSON data. About JSONP JSONP, or JSON with Padding, is actually JSON but it is set up differently compared to traditional JSON files. JSONP is a workaround for web cross-browser scripting. Some web services can serve up JSONP rather than pure JSON JavaScript files. The issue with that is JSONP isn't compatible with many JSON Python-based parsers including one covered here, so you will want to avoid JSONP style JSON whenever possible. So how can we spot JSONP files; do they have a different extension? No, it's simply a wrapper for JSON data; here's an example without JSONP: /* *Regular JSON */ { authorname: 'Chad Adams' } The same example with JSONP: /* * JSONP */ callback({ authorname: 'Chad Adams' }); Notice we wrapped our JSON data with a function wrapper, or a callback. Typically, this is what breaks in our parsers and is a giveaway that this is a JSONP-formatted JSON file. In JavaScript, we can even call it in code like this: /* * Using JSONP in JavaScript */ callback = function (data) { alert(data.authorname); }; JSONP with Python We can get around a JSONP data source though, if we need to; it just requires a bit of work. We can use the str.replace() method in Python to strip out the callback before running the string through our JSON parser. If we were parsing our example JSONP file in our JSON parser example, the string would look something like this: #Convert the string to a JSON object in Python. data = json.loads(bookdata.replace('callback(', '').) .replace(')', '')) Summary In this article, we covered HTTP concepts and methodologies for pulling strings and data from the Web. We learned how to do that with Python using the urllib2 library, and parsed XML data and JSON data. We discussed the differences between JSON and JSONP, and how to work around JSONP if needed. Resources for Article: Further resources on this subject: Introspecting Maya, Python, and PyMEL [Article] Exact Inference Using Graphical Models [Article] Automating Your System Administration and Deployment Tasks Over SSH [Article]

0
0
16218

article-image-amazon-dynamodb-modelling-relationships-error-handling

Packt

20 Aug 2014

7 min read

Amazon DynamoDB - Modelling relationships, Error handling

Packt

20 Aug 2014

7 min read

In this article by Tanmay Deshpande author of Mastering DynamoDB we are going to revise our concepts about the DynamoDB and will try to discover more about its features and implementation. (For more resources related to this topic, see here.) Amazon DynamoDB is a fully managed, cloud hosted, NoSQL database. It provides fast and predictable performance with the ability to scale seamlessly. It allows you to store and retrieve any amount of data , serving any level of network traffic without having any operational burden. DynamoDB gives numerous other advantages like consistent and predictable performance, flexible data modeling and durability. With just few clicks on Amazon Web Service console, you would be able to create your own DynamoDB database table, scale up or scale down provision throughput without taking down your application even for a millisecond. DynamoDB uses solid state disks (SSD) to store the data which confirms the durability of the work you are doing. It also automatically replicates the data across other AWS Availability Zones which provides built-in high availability and reliability. Before we start discussion details about DynamoDB let's try to understand what NoSQL databases are and when to choose DynamoDB over RDBMS. With the rise in data volume, variety and velocity, RDBMS were neither designed to cope up with the scale and flexibility challenges developers are facing to build the modern day applications nor were they able to take advantage of cheap commodity hardware. Also we need to provide schema before we start adding data which was restricting developers from making their application flexible. On the other hand, NoSQL databases are fast, provide flexible schema operations and do the effective use of cheap storage. Considering all these things, NoSQL is becoming popular very quickly amongst the developer community. But one has to be very cautious about when to go for NoSQL and when to stick to RDBMS. Sticking to relational databases makes sense when you know the schema is more over static, strong consistency is must and the data is not going to be that big in volume. But when you want to build an application which is Internet scalable, schema is more likely to get evolved over the time, the storage is going to be really big and the operations involved are ok to be eventually consistent then NoSQL is the way to go. There are various types of NoSQL databases. Following is the list of NoSQL database types and popular examples Document Store – MongoDB, CouchDB, MarkLogic etc. Column Store – Hbase, Cassandra etc. Key Value Store – DynamoDB, Azure, Redis etc. Graph Databases – Neo4J, DEX etc. Most of these NoSQL solutions are open source except few like DynamoDB, Azure which are available as service over Internet. DynamoDB being a key-value store indexes data only upon primary keys and one need to go through primary key to access certain attribute. Let's start learning more about DynamoDB by having a look at its history. DynamoDB's History Amazon's ecommerce platform had a huge set of decoupled services developed and managed individually and each and every service had API to be used and consumed for others. Earlier each service had direct database access which was a major bottleneck. In terms of scalability, Amazon's requirements were more than any third party vendors could provide at that time. DynamoDB was built to address Amazon's high availability, extreme scalability and durability needs. Earlier Amazon used to store its production data in relational databases and services had been provided for all required operations. But later they realized that most of the services access data only through its primary key and need not use complex queries to fetch the required data, plus maintaining these RDBMS systems required high end hardware and skilled personnel. So to overcome all such issues, Amazon's engineering team built a NoSQL database which addresses all above mentioned issues. In 2007, Amazon released one research paper on Dynamo which was combining the best of ideas from database and key value store worlds which was inspiration for many open source projects at the time. Cassandra, Voldemort and Riak were one of them. You can find the above mentioned paper at http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf Even though Dynamo had great features which would take care of all engineering needs, it was not widely accepted at that time in Amazon itself as it was not a fully managed service. When Amazon released S3 and SimpleDB, engineering teams were quite excited to adopt those compared to Dynamo as DynamoDB was bit expensive at that time due to SSDs. So finally after rounds of improvement, Amazon released Dynamo as cloud based service and since then it is one the most widely used NoSQL database. Before releasing to public cloud in 2012, DynamoDB has been the core storage service for Amazon's e-commerce platform, which was started shopping cart and session management service. Any downtime or degradation in performance had major impact on Amazon's business and any financial impact was strictly not acceptable and DynamoDB proved itself to be the best choice at the end. Now let's try to understand in more detail about DynamoDB. What is DynamoDB? DynamoDB is a fully managed, Internet scalable and easily administered, cost effective NoSQL database. It is a part of database as a service offering pane of Amazon Web Services. The above diagram shows how Amazon offers its various cloud services and where DynamoDB is exactly placed. AWS RDS is relational database as a service over Internet from Amazon while Simple DB and DynamoDB are NoSQL database as services. Both SimpleDB and DynamoDB are fully managed, non-relational services. DynamoDB is build considering fast, seamless scalability, and high performance. It runs on SSDs to provide faster responses and has no limits on request capacity and storage. It automatically partitions your data throughout the cluster to meet the expectations while in SimpleDB we have storage limit of 10 GB and can only take limited requests per second. Also in SimpleDB we have to manage our own partitions. So depending upon your need you have to choose the correct solution. To use DynamoDB, the first and foremost requirement is having an AWS account. Through easy to use AWS management console, you can directly create new tables, providing necessary information and can start loading data into the tables in few minutes. Modelling Relationships Like any other database, modeling relationships is quite interesting even though DynamoDB is a NoSQL database. Most of the time, people get confused on how do I model the relationships between various tables, in this section, we are trying make an effort to simplify this problem. To understand the relationships better, let's try to understand that using our example of Book Store where we have entities like Book, Author, Publisher, and so on. One to One In this type of relationship, one entity record of a table is related only one entity record of other table. In our book store application, we have BookInfo and Book-Details tables, BookInfo table can have short information about the book which can be used to display book information on web page whereas BookDetails table would be used when someone explicitly needs to see all the details of book. This design helps us keeping our system healthy as even if there are high request on one table, the other table would always be to up and running to fulfil what it is supposed to do. Following diagram shows how the table structure would look like. One to many In this type of relationship, one record from an entity is related to more than one record in another entity. In book store application, we can have Publisher Book Table which would keep information about the book and publisher relationship. Here we can have Publisher Id as hash key and Book Id as range key. Following diagram shows how a table structure would like. Many to many Many to many relationship means many records from an entity is related to many records from another entity. In case of book store application, we can say that a book can be authored by multiple authors and an author can write multiple books. In this we should use two tables with both and range keys.

0
0
7638

Packt

21 Jul 2014

9 min read

Sharding in Action

Packt

21 Jul 2014

9 min read

0
0
4599

How-To Tutorials - Data

Visualization as a Tool to Understand Data

Driving Visual Analyses with Automobile Data (Python)

Caches

Using R for Statistics, Research, and Graphics

Stream Grouping

What is a content provider?

More Line Charts, Area Charts, and Scatter Plots

Classifying Text

Camera Calibration

Report Data Filtering

Trending Topics

Recommender systems dissected

Using Canvas and D3

Importing Dynamic Data

Amazon DynamoDB - Modelling relationships, Error handling

Sharding in Action

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access