Data | Tech News, Tutorials & Expert Insights

article-image-article-creating-your-first-heat-map-r

26 Jun 2013

10 min read

Creating your first heat map in R

26 Jun 2013

(For more resources related to this topic, see here.) The following image shows one of the heat maps that we are going to create in this recipe from the total count of air passengers: Image Getting ready Download the script 5644_01_01.r from your account at http://www.packtpub.com and save it to your hard disk. The first section of the script, below the comment line starting with ### loading packages, will automatically check for the availability of the R packages gplots and lattice, which are required for this recipe. If those packages are not already installed, you will be prompted to select an official server from the Comprehensive R Archive Network (CRAN) to allow the automatic download and installation of the required packages. If you have already installed those two packages prior to executing the script, I recommend you to update them to the most recent version by calling the following function in the R command line: code Use the source() function in the R command-line to execute an external script from any location on your hard drive. If you start a new R session from the same directory as the location of the script, simply provide the name of the script as an argument in the function call as follows: code You have to provide the absolute or relative path to the script on your hard drive if you started your R session from a different directory to the location of the script. Refer to the following example: code You can view the current working directory of your current R session by executing the following command in the R command-line: code How to do it... Run the 5644OS_01_01.r script in R to execute the following code, and take a look at the output printed on the screen as well as the PDF file, first_heatmaps.pdf that will be created by this script: code How it works... There are different functions for drawing heat maps in R, and each has its own advantages and disadvantages. In this recipe, we will take a look at the levelplot() function from the lattice package to draw our first heat map. Furthermore, we will use the advanced heatmap.2() function from gplots to apply a clustering algorithm to our data and add the resulting dendrograms to our heat maps. The following image shows an overview of the different plotting functions that we are using throughout this book: Image Now let us take a look at how we read in and process data from different data files and formats step-by-step: Loading packages: The first eight lines preceding the ### loading data section will make sure that R loads the lattice and gplots package, which we need for the two heat map functions in this recipe: levelplot() and heatmap.2(). Each time we start a new session in R, we have to load the required packages in order to use the levelplot() and heatmap.2() functions. To do so, enter the following function calls directly into the R command-line or include them at the beginning of your script: library(lattice) library(gplots) Loading the data set: R includes a package called data, which contains a variety of different data sets for testing and exploration purposes. More information on the different data sets that are contained in the data package can be found at http:// stat.ethz.ch/ROmanual/ROpatched/library/datasets/. For this recipe, we are loading the AirPassenger data set, which is a collection of the total count of air passengers (in thousands) for international airlines from 1949- 1960 in a time-series format. code Converting the data set into a numeric matrix: Before we can use the heat map functions, we need to convert the AirPassenger time-series data into a numeric matrix first. Numeric matrices in R can have characters as row and column labels, but the content itself must consist of one single mode: numerical. We use the matrix() function to create a numeric matrix consisting of 12 columns to which we pass the AirPassenger time-series data row-by-row. Using the argument dimnames = rowcolNames, we provide row and column names that we assigned previously to the variable rowColNames, which is a list of two vectors: a series of 12 strings representing the years 1949 to 1960, and a series of strings for the 12 three-letter abbreviations of the months from January to December, respectively. code A simple heat map using levelplot(): Now that we have converted the AirPassenger data into a numeric matrix format and assigned it to the variable air_data, we can go ahead and construct our first heat map using the levelplot() function from the lattice package: code The levelplot() function creates a simple heat map with a color key to the righthand side of the map. We can use the argument col.regions = heat.colors to change the default color transition to yellow and red. X and y axis labels are specified by the xlab and ylab parameters, respectively, and the main parameter gives our heat map its caption. In contrast to most of the other plotting functions in R, the lattice package returns objects, so we have to use the print() function in our script if we want to save the plot to a data file. In an interactive R session, the print() call can be omitted. Typing the name of the variable will automatically display the referring object on the screen. Creating enhanced heat maps with heatmap.2(): Next, we will use the heatmap.2() function to apply a clustering algorithm to the AirPassenger data and to add row and column dendrograms to our heat map: code Hierarchical clustering is especially popular in gene expression analyses. It is a very powerful method for grouping data to reveal interesting trends and patterns in the data matrix. Another neat feature of heatmap.2() is that you can display a histogram of the count of the individual values inside the color key by including the argument density.info = NULL in the function call. Alternatively, you can set density. info = "density" for displaying a density plot inside the color key. By adding the argument keysize = 1.8, we are slightly increasing the size of the color key—the default value of keysize is 1.5: code Did you notice the missing row dendrogram in the resulting heat map? This is due to the argument dendrogram = "column" that we passed to the heat map function. Similarly, you can type row instead of column to suppress the column dendrogram, or use neither to draw no dendrogram at all. There's more... By default, levelplot() places the color key on the right-hand side of the heat map, but it can be easily moved to the top, bottom, or left-hand side of the map by modifying the space parameter of colorkey: Replacing top by left or bottom will place the color key on the left-hand side or on the bottom of the heat map, respectively. Moving around the color key for heatmap.2() can be a little bit more of a hassle. In this case we have to modify the parameters of the layout() function. By default, heatmap.2() passes a matrix, lmat, to layout(), which has the following content: code The numbers in the preceding matrix specify the locations of the different visual elements on the plot (1 implies heat map, 2 implies row dendrogram, 3 implies column dendrogram, and 4 implies key). If we want to change the position of the key, we have to modify and rearrange those values of lmat that heatmap.2() passes to layout(). For example, if we want to place the color key at the bottom left-hand corner of the heat map, we need to create a new matrix for lmat as follows: code We can construct such a matrix by using the rbind() function and assigning it to lmat: code Furthermore, we have to pass an argument for the column height parameter lhei to heatmap.2(), which will allow us to use our modified lmat matrix for rearranging the color key: code If you don't need a color key for your heat map, you could turn it off by using the argument key = FALSE for heatmap.2() and colorkey = FALSE for levelplot(), respectively. R also has a base function for creating heat maps that does not require you to install external packages and is most advantageous if you can go without a color key. The syntax is very similar to the heatmap.2() function, and all options for heatmap.2() that we have seen in this recipe also apply to heatmap(): code More information on dendrograms and clustering By default, the dendrograms of heatmap.2() are created by a hierarchical agglomerate clustering method, also known as bottom-up clustering. In this approach, all individual objects start as individual clusters and are successively merged until only one single cluster remains. The distance between a pair of clusters is calculated by the farthest neighbor method, also called the complete linkage method, which is based by default on the Euclidean distance of the two points from both clusters that are farthest apart from each other. The computed dendrograms are then reordered based on the row and column means. By modifying the default parameters of the dist() function, we can use another distance measure rather than the Euclidean distance. For example, if we want to use the Manhattan distance measure (based on a grid-like path rather than a direct connection between two objects), we would modify the method parameter of the dist() function and assign it to a variable distance first: code Other options for the method parameter are: euclidean (default), maximum, canberra, binary, or minkowski. To use other agglomeration methods than the complete linkage method, we modify the method parameter in the hclust() function and assign it to another variable cluster. Note the first argument distance that we pass to the hclust() function, which comes from our previous assignment: code By setting the method parameter to ward, R will use Joe H. Ward's minimum variance method for hierarchical clustering. Other options for the method parameter that we can pass as arguments to hclust() are: complete (default), single, average, mcquitty, median, or centroid. To use our modified clustering parameters, we simply call the as.dendrogram() function within heatmap.2() using the variable cluster that we assigned previously: code We can also draw the cluster dendrogram without the heat map by using the plot() function: code To turn off row and column reordering, we need to turn off the dendrograms and set the parameters Colv and Rowv to NA: code Summary This article has helped us create our first heat maps from a small data set provided in R. We have used different heat map functions in R to get a first impression of their functionalities. Resources for Article : Further resources on this subject: Getting started with Leaflet [Article] Moodle 1.9: Working with Mind Maps [Article] Joomla! with Flash: Showing maps using YOS amMap [Article]

0
0
7099

article-image-linking-section-access-multiple-dimensions

Packt

25 Jun 2013

3 min read

Linking Section Access to multiple dimensions

Packt

25 Jun 2013

3 min read

(For more resources related to this topic, see here.) Getting ready Load the following script: Product:LOAD * INLINE [ ProductID, ProductGroup, ProductName 1, GroupA, Great As 2, GroupC, Super Cs 3, GroupC, Mega Cs 4, GroupB, Good Bs 5, GroupB, Busy Bs];Customer:LOAD * INLINE [ CustomerID, CustomerName, Country 1, Gatsby Gang, USA 2, Charly Choc, USA 3, Donnie Drake, USA 4, London Lamps, UK 5, Shylock Homes, UK];Sales:LOAD * INLINE [ CustomerID, ProductID, Sales 1, 2, 3536 1, 3, 4333 1, 5, 2123 2, 2, 45562, 4, 1223 2, 5, 6789 3, 2, 1323 3, 3, 3245 3, 4, 6789 4, 2, 2311 4, 3, 1333 5, 1, 7654 5, 2, 3455 5, 3, 6547 5, 4, 2854 5, 5, 9877];CountryLink:Load Distinct Country, Upper(Country) As COUNTRY_LINKResident Customer;Load Distinct Country, 'ALL' As COUNTRY_LINKResident Customer;ProductLink:Load Distinct ProductGroup, Upper(ProductGroup) As PRODUCT_LINKResident Product;Load Distinct ProductGroup, 'ALL' As PRODUCT_LINKResident Product;//Section Access;Access:LOAD * INLINE [ ACCESS, USERID, PRODUCT_LINK, COUNTRY_LINKADMIN, ADMIN, *, * USER, GM, ALL, ALL USER, CM1, ALL, USA USER, CM2, ALL, UK USER, PM1, GROUPA, ALL USER, PM2, GROUPB, ALL USER, PM3, GROUPC, ALL USER, SM1, GROUPB, UK USER, SM2, GROUPA, USA];Section Application; Note that there is a loop error generated on reload because there is a loop in the data structure. How to do it… Follow these steps to link Section Access to multiple dimensions: Add list boxes to the layout for ProductGroup and Country. Add a statistics box for Sales. Remove // to uncomment the Section Access statement. From the Settings menu, open Document Properties and select the Opening tab. Turn on the Initial Data Reduction Based on Section Access option. Reload and save the document. Close QlikView. Re-open QlikView and open the document. Log in as the Country Manager, CM1, user. Note that USA is the only country. Also, the product group, GroupA, is missing—there are no sales of this product group in USA. Close QlikView and then re-open again. This time, log in as the Sales Manager, SM2. You will not be allowed access to the document. Log into the document as the ADMIN user. Edit the script. Add a second entry for the SM2 user in the Access table as follows: USER, SM2, GROUPA, USA USER, SM2, GROUPB, UK Reload, save, and close the document and QlikView. Re-open and log in as SM2. Note the selections. How it works… Section Access is really quite simple. The user is connected to the data and the data is reduced accordingly. QlikView allows Section Access tables to be connected to multiple dimensions in the main data structure without causing issues with loops. Each associated field acts in the same way as a selection in the layout. The initial setting for the SM2 user contained values that were mutually exclusive. Because of the default Strict Exclusion setting, the SM2 user cannot log in. We changed the script and included multiple rows for the SM2 user. Intuitively, we might expect that, as the first row did not connect to the data, only the second row would connect to the data. However, each field value is treated as an individual selection and all of the values are included. There's more… If we wanted to include solely the composite association of Country and ProductGroup, we would need to derive a composite key in the data set and connect the user to that. In this example, we used the USERID field to test using QlikView logins. However, we would normally use NTNAME to link the user to either a Windows login or a custom login. Resources for Article : Further resources on this subject: Pentaho Reporting: Building Interactive Reports in Swing [Article] Visual ETL Development With IBM DataStage [Article] A Python Multimedia Application: Thumbnail Maker [Article]

0
0
6166

article-image-ibm-cognos-workspace-advanced

Packt

14 Jun 2013

5 min read

IBM Cognos Workspace Advanced

Packt

14 Jun 2013

5 min read

0
0
3242

Packt

12 Jun 2013

8 min read

A quick start – OpenCV fundamentals

Packt

12 Jun 2013

8 min read

(For more resources related to this topic, see here.) The OpenCV library has a modular structure, and the following diagram depicts the different modules available in it: A brief description of all the modules is as follows: Module Feature Core A compact module defining basic data structures, including the dense multidimensional array Mat and basic functions used by all other modules. Imgproc An image processing module that includes linear and non-linear image filtering, geometrical image transformations (resize, affine and perspective warping, generic table-based remapping), color space conversion, histograms, and so on. Video A video analysis module that includes motion estimation, background subtraction, and object tracking algorithms. Calib3d Basic multiple-view geometry algorithms, single and stereo camera calibration, object pose estimation, stereo correspondence algorithms, and elements of 3D reconstruction. Features2d Salient feature detectors, descriptors, and descriptor matchers. Objdetect Detection of objects and instances of the predefined classes; for example, faces, eyes, mugs, people, cars, and so on. Highgui An easy-to-use interface to video capturing, image and video codecs, as well as simple UI capabilities. Gpu GPU-accelerated algorithms from different OpenCV modules. Task 1 – image basics When trying to recreate the physical world around us in digital format via a camera, for example, the computer just sees the image in the form of a code that just contains the numbers 1 and 0. A digital image is nothing but a collection of pixels (picture elements) which are then stored in matrices in OpenCV for further manipulation. In the matrices, each element contains information about a particular pixel in the image. The pixel value decides how bright or what color that pixel should be. Based on this, we can classify images as: Greyscale Color/RGB Greyscale Here the pixel value can range from 0 to 255 and hence we can see the various shades of gray as shown in the following diagram. Here, 0 represents black and 255 represents white: A special case of grayscale is the binary image or black and white image. Here every pixel is either black or white, as shown in the following diagram: Color/RGB Red, Blue, and Green are the primary colors and upon mixing them in various different proportions, we can get new colors. A pixel in a color image has three separate channels— one each for Red, Blue, and Green. The value ranges from 0 to 255 for each channel, as shown in the following diagram: Task 2 – reading and displaying an image We are now going to write a very simple and basic program using the OpenCV library to read and display an image. This will help you understand the basics. Code A simple program to read and display an image is as follows: // opencv header files #include "opencv2/highgui/highgui.hpp" #include "opencv2/core/core.hpp" // namespaces declaration using namespace cv; using namespace std; // create a variable to store the image Mat image; int main( int argc, char** argv ) { // open the image and store it in the 'image' variable // Replace the path with where you have downloaded the image image=imread("<path to image">/lena.jpg"); // create a window to display the image namedWindow( "Display window", CV_WINDOW_AUTOSIZE ); // display the image in the window created imshow( "Display window", image ); // wait for a keystroke waitKey(0); return 0; } Code explanation Now let us understand how the code works. Short comments have also been included in the code itself to increase the readability. #include "opencv2/highgui/highgui.hpp" #include "opencv2/core/core.hpp" The preceding two header files will be a part of almost every program we write using the OpenCV library. As explained earlier, the highgui header is used for window creation, management, and so on, while the core header is used to access the Mat data structure in OpenCV. using namespace cv; using namespace std; The preceding two lines declare the required namespaces for this code so that we don't have to use the :: (scope resolution) operator every time for accessing the functions. Mat image; With the above command, we have just created a variable image of the datatype Mat that is frequently used in OpenCV to store images. image=imread("<path to image">/lena.jpg"); In the previous command, we opened the image lena.jpg and stored it in the image variable. Replace <path to image> in the preceding command with the location of that picture on your PC. namedWindow( "Display window", CV_WINDOW_AUTOSIZE ); We now need a window to display our image. So, we use the above function to do the same. This function takes two parameters, out of which the first one is the name of the window. In our case, we would like to name our window Display Window. The second parameter is optional, but it resizes the window based on the size of the image so that the image is not cropped. imshow( "Display window", image ); Finally, we are ready to display our image in the window we just created by using the preceding function. This function takes two parameters out of which the first one is the window name in which the image has to be displayed. In our case, obviously, that will be Display Window . The second parameter is the image variable containing the image that we want to display. In our case, it's the image variable. waitKey(0); Last but not least, it is advised that you use the preceding function in most of the codes that you write using the OpenCV library. If we don't write this code, the image will be displayed for a fraction of a second and the program will be immediately terminated. It happens so fast that you will not be able to see the image. What this function does essentially is that it waits for a keystroke from the user and hence it delays the termination of the program. The delay here is in milliseconds. Output The image can be displayed as follows: Task 3 – resizing and saving an image We are now going to write a very simple and basic program using the OpenCV library to resize and save an image. Code The following code helps you to resize a given image: // opencv header files #include "opencv2/highgui/highgui.hpp" #include "opencv2/imgproc/imgproc.hpp" #include "opencv2/core/core.hpp" // namespaces declaration using namespace std; using namespace cv; int main(int argc, char** argv) { // create variables to store the images Mat org, resized,saved; // open the image and store it in the 'org' variable // Replace the path with where you have downloaded the image org=imread("<path to image>/lena.png"); //Create a window to display the image namedWindow("Original Image",CV_WINDOW_AUTOSIZE); //display the image imshow("Original Image",org); //resize the image resize(org,resized,Size(),0.5,0.5,INTER_LINEAR); namedWindow("Resized Image",CV_WINDOW_AUTOSIZE); imshow("Resized Image",resized); //save the image //Replace <path> with your desired location imwrite("<path>/saved.png",resized; namedWindow("Image saved",CV_WINDOW_AUTOSIZE); saved=imread("<path to image>/saved.png"); imshow("Image saved",saved); //wait for a keystroke waitKey(0); return 0; } Code explanation Only the new functions/concepts will be explained in this case. #include "opencv2/imgproc/imgproc.hpp" Imgproc is another useful header that gives us access to the various transformations, color conversions, filters, histograms, and so on. Mat org, resized; We have now created two variables, org and resized, to store the original and resized images respectively. resize(org,resized,Size(),0.5,0.5,INTER_LINEAR); We have used the preceding function to resize the image. The preceding function takes six parameters, out of which the first one is the variable containing the source image to be modified. The second one is the variable to store the resized image. The third parameter is the output image size. In this case we have not specified this, but we have instead used the Size() function, which will automatically calculate it based on the values of the fourth and fifth parameters. The fourth and fifth parameters are the scale factors along the horizontal and vertical axes respectively. The sixth parameter is for choosing the type of interpolation method. We have used the bilinear interpolation, which is the default method. imwrite("<path>/saved.png",final); Finally, using the preceding function, you can save an image to a particular location on our PC. The function takes two parameters, out of which the first one is the location where you want to store the image and the second is the variable in which the image is stored. This function is very useful when you want to perform multiple operations on an image and save the image on your PC for future reference. Replace <path> in the preceding function with your desired location. Output Resizing can be demonstrated through the following output: Summary This section showed you how to perform a few of the basic tasks in OpenCV as well as how to write your first OpenCV program. Resources for Article : Further resources on this subject: OpenCV: Segmenting Images [Article] Tracking Faces with Haar Cascades [Article] OpenCV: Image Processing using Morphological Filters [Article]

0
0
8544

article-image-implementing-persistence-redis-intermediate

Packt

06 Jun 2013

10 min read

Implementing persistence in Redis (Intermediate)

Packt

06 Jun 2013

10 min read

(For more resources related to this topic, see here.) Getting ready Redis provides configuration settings for persistence and for enabling durability of data depending on the project statement. If durability of data is critical If durability of data is not important You can achieve persistence of data using the snapshotting mode, which is the simplest mode in Redis. Depending on the configuration, Redis saves a dump of all the data sets in its memory into a single RDB file. The interval in which Redis dumps the memory can be configured to happen every X seconds or after Y operations. Consider an example of a moderately busy server that receives 15,000 changes every minute over its 1 GB data set in memory. Based on the snapshotting rule, the data will be stored every 60 seconds or whenever there are at least 15,000 writes. So the snapshotting runs every minute and writes the entire data of 1 GB to the disk, which soon turns ugly and very inefficient. To solve this particular problem, Redis provides another way of persistence, Append-only file (AOF), which is the main persistence option in Redis. This is similar to journal files, where all the operations performed are recorded and replayed in the same order to rebuild the exact state. Redis's AOF persistence supports three different modes: No fsync: In this mode, we take a chance and let the operating system decide when to flush the data. This is the fastest of the three modes. fsync every second: This mode is a compromised middle point between performance and durability. Data will be flushed using fsync every second. If the disk is not able to match the write speed, the fsync can take more than a second, in which case Redis delays the write up to another second. So this mode guarantees a write to be committed to OS buffers and transferred to the disk within 2 seconds in the worstcase scenario. fsync always: This is the last and safest mode. This provides complete durability of data at a heavy cost to performance. In this mode, the data needs to be written to the file and synced with the disk using fsync before the client receives an acknowledgment. This is the slowest of all three modes. How to do it... First let us see how to configure snapshotting, followed by the Append-only file method: In Redis, we can configure when a new snapshot of the data set will be performed. For example, Redis can be configured to dump the memory if the last dump was created more than 30 seconds ago and there are at least 100 keys that are modified or created. Snapshotting should be configured in the /etc/redis/6379.conf file. The configuration can be as follows: save 900 1save 60 10000 The first line translates to take a snapshot of data after 900 seconds if at least one key has changed, while the second line translates to snapshotting every 60 seconds if 10,000 keys have been modified in the meantime. The configuration parameter rdbcompression defines whether the RDB file is to be compressed or not. There is a trade-off between the CPU and RDB dump file size. We are interested in changing the dump's filename using the dbfilename parameter. Redis uses the current folder to create the dump files. For convenience, it is advised to store the RDB file in a separate folder. dbfilename redis-snapshot.rdbdir /var/lib/redis/ Let us run a small test to make sure the RDB dump is working. Start the server again. Connect to the server using redis-cli, as we did already. To test whether our snapshotting is working, issue the following commands: SET Key ValueSAVE After the SAVE command, a file should be created in the folder /var/lib/redis with the name redis-snapshot.rdb. This confirms that our installation is able to take a snapshot of our data into a file. Now let us see how to configure persistence in Redis using the AOF method: The configuration for persistence through AOF also goes into the same file located in /etc/redis/6379.conf. By default, the Append-only mode is not enabled. Enable it using the appendonly parameter. appendonly yes Also, if you would like to specify a filename for the AOF log, uncomment the line and change the filename. appendfilename redis-aof.aof The appendfsync everysec command provides a good balance between performance and durability. appendfsync everysec Redis needs to know when it has to rewrite the AOF file. This will be decided based on two configuration parameters, as follows: auto-aof-rewrite-percentage 100auto-aof-rewrite-min-size 64mb Unless the minimum size is reached and the percentage of the increase in size when compared to the last rewrite is less than 100 percent, the AOF rewrite will not be performed. How it works... First let us see how snapshotting works. When one of the criteria is met, Redis forks the process. The child process starts writing the RDB file to the disk at the folder specified in our configuration file. Meanwhile, the parent process continues to serve the requests. The problem with this approach is that the parent process stores the keys, which change during this snapshotting by the child, in the extra memory. In the worst-case scenario, if all the keys are modified, the memory usage spikes to roughly double. Caution Be aware that the bigger the RDB file, the longer it takes Redis to restore the data on startup. Corruption of the RDB file is not possible as it is created by the append-only method from the data in Redis's memory, by the child process. The new RDB file is created as a temporary file and is then renamed to the destination file using the atomic rename system call once the dump is completed. AOF's working is simple. Every time a write operation is performed, the command operation gets logged into a logfile. The format used in the logfile is the same as the format used by clients to communicate to the server. This helps in easy parsing of AOF files, which brings in the possibility of replaying the operation in another Redis instance. Only the operations that change the data set are written to the log. This log will be used on startup to reconstruct the exact data. As we are continuously writing the operations into the log, the AOF file explodes in size as compared to the amount of operations performed. So, usually, the size of the AOF file is larger than the RDB dump. Redis manages the increasing size of the data log by compacting the file in a non-blocking manner periodically. For example, say a specific key, key1, has changed 100 times using the SET command. In order to recreate the final state in the last minute, only the last SET command is required. We do not need information about the previous 99 SET commands. This might look simple in theory, but it gets complex when dealing with complex data structures and operations such as union and intersection. Due to this complexity, it becomes very difficult to compress the existing file. To reduce the complexity of compacting the AOF, Redis starts with the data in the memory and rewrites the AOF file from scratch. This is more similar to the snapshotting method. Redis forks a child process that recreates the AOF file and performs an atomic rename to swap the old file with a new one. The same problem, of the requirement of extra memory for operations performed during the rewrite, is present here. So the memory required can spike up to two times based on the operations while writing an AOF file. There's more... Both snapshotting and AOF have their own advantages and limitations, which makes it ideal to use both at the same time. Let us now discuss the major advantages and limitations in the snapshotting method. Advantages of snapshotting The advantages of configuring snapshotting in Redis are as follows: RDB is a single compact file that cannot get corrupted due to the way it is created. It is very easy to implement. This dump file is perfect to take backups and for disaster recovery of remote servers. The RDB file can just be copied and saved for future recoveries. In comparison, this approach has little or no influence over performance as the only work the parent process needs to perform is forking a child process. The parent process will never perform any disk operations; they are all performed by the child process. As an RDB file can be compressed, it provides a faster restart when compared to the append-only file method. Limitations of snapshotting Snapshotting, in spite of the advantages mentioned, has a few limitations that you should be aware of: The periodic background save can result in significant loss of data in case of server or hardware failure. The fork() process used to save the data might take a moment, during which the server will stop serving clients. The larger the data set to be saved, the longer it takes the fork() process to complete. The memory needed for the data set might double in the worst-case scenario, when all the keys in the memory are modified while snapshotting is in progress. What should we use? Now that we have discussed both the modes of persistence Redis provides us with, the big question is what should we use? The answer to this question is entirely based on our application and requirements. In cases where we expect good durability, both snapshotting and AOF can be turned on and be made to work in unison, providing us with redundant persistence. Redis always restores the data from AOF wherever applicable, as it is supposed to have better durability with little loss of data. Both RDB and AOF files can be copied and stored for future use or for recovering another instance of Redis. In a few cases, where performance is very critical, memory usage is limited, and persistence is also paramount, persistence can be turned off completely. In these cases, replication can be used to get durability. Replication is a process in which two Redis instances, one master and one slave, are in sync with the same data. Clients are served by the master, and the master server syncs the data with a slave. Replication setup for persistence Consider a setup as shown in the preceding image; that is: Master instance with no persistence Slave instance with AOF enabled In this case, the master does not need to perform any background disk operations and is fully dedicated to serve client requests, except for a trivial slave connection. The slave server configured with AOF performs the disk operations. As mentioned before, this file can be used to restore the master in case of a disaster. Persistence in Redis is a matter of configuration, balancing the trade-off between performance, disk I/O, and data durability. If you are looking for more information on persistence in Redis, you will find the article by Salvatore Sanfilippo at http://oldblog.antirez.com/post/redis-persistence-demystified.html interesting. Summary This article helps you to understand the persistence option available in Redis, which could ease your efforts of adding Redis to your application stack. Resources for Article : Further resources on this subject: Using Execnet for Parallel and Distributed Processing with NLTK [Article] Parsing Specific Data in Python Text Processing [Article] Python Text Processing with NLTK: Storing Frequency Distributions in Redis [Article]

0
3
7259

article-image-article-optimizing-programs

Packt

29 May 2013

6 min read

Optimizing Programs

Packt

29 May 2013

6 min read

(For more resources related to this topic, see here.) Using transaction SAT to find problem areas In this recipe, we will see the steps required to analyze the execution of any report, transaction, or function module using the transaction SAT. Getting ready For this recipe, we will analyze the runtime of a standard program RIBELF00 (Display Document Flow Program). The program selection screen contains a number of fields. We will execute the program on the order number (aufnr) and see the behavior. How to do it... For carrying out runtime analysis using transaction SAT, proceed as follows: Call transaction SAT. The screen appears as shown: Enter a suitable name for the variant (in our case, YPERF_VARIANT) and click the Create button below it. This will take you to the Variant creation screen. On the Duration and Type tab, switch on Aggregation by choosing the Per Call Position radio-button. Then, click on the Statements tab. On the Statements tab, make sure Internal Tables, the Read Operations checkbox and the Change Operations checkbox, and the Open SQL checkbox under Database Access are checked. Save your variant. Come back to the main screen of SAT. Make sure that within Data Formatting on the initial screen of SAT, the checkbox for Determine Names of Internal Tables is selected. Next, enter the name of the program that is to be traced in the field provided (in our case, it is RIBELF00). Then click the button. The screen of the program appears as shown. We will enter an order number range and execute the program. Once the program output is generated, click on the Back key to come back to program selection screen. Click on the Back key once again to generate the evaluation results. How it works... We carried out the execution of the program through the transaction SAT and the evaluation results were generated. On the left are the Trace Results (in tree form) listing the statements/ events with the most runtime. These are like a summary report of the entire measurement of the program. They are listed in descending order of the Net time in microseconds and the percentage of the total time. For example, in our case, the OPEN CURSOR event takes 68 percent of the total runtime of the program. Selecting the Hit List tab will show the top time consumer components of the program. In this example, the access of database tables AFRU and VBAK takes most of the time. Double-clicking any item in the Trace Results window on the left-hand side will display (in the Hit List area on the right-hand pane) details of contained items along with execution time of each item. From the Hit List window, double-clicking a particular item will take us to the relevant line in the program code. For example, when we double-click the Open Cursor VBAK line, it will take us to the corresponding program code. We have carried out analysis with Aggregation switched on. The switching on of Aggregation shows one single entry for a multiple calls of a particular line of code. Because of this, the results are less detailed and easier to read, since the hit list and the call hierarchy in the results are much more simplified. Also within the results, by default, the names of the internal table used are not shown. In order for the internal table names to appear in the evaluation result, the Determine Names checkbox of Internal tables indicator is checked. As a general recommendation, the runtime analysis should be carried out several times for best results. The reason being that the DB-measurement time could be dependent on a variety of factors, such as system load, network performance, and so on. Creation of secondary indexes in database tables Very often, the cause of a long running report is full-scan of a database table specified within the code, mainly because no suitable index exists. In this recipe, we will see the steps required in creating a new secondary index in database table for performance improvement. Creating indexes lets you optimize standard reports as well as your own reports. In this recipe, we will create a secondary index on a test table ZST9_VBAK (that is simply a copy of VBAK). How to do it... For creating a secondary index, proceed as follows: Call transaction SE11. Enter the name of the table in the field provided, in our case, ZST9_VBAK. Then click the Display button. This will take you to the Display Table screen. Next, choose the menu path Goto | Indexes. This will display all indexes that currently exist for the table. Click the Create button and then choose the option Create Extension Index The dialog box appears. Enter a three-digit name for the index. Then, press Enter. This will take you to the extension index maintenance screen. On the top part, enter the short description in the Short Description field provided. We will create a non-unique index so the Non-unique index radio button is selected (on the middle part of the screen). On the lower part of the screen, specify the field names to be used in the index. In our case, we use MANDT and AUFNR . Then, activate your index using keys Ctrl + F3. The index will be created in the database with appropriate message of creation shown below Status. How it works... This will create the index on the database. Since we created an extension index, the index will not be overwritten by SAP during an upgrade. Now any report that accesses ZST9_VBAK table specifying MANDT and AUFNR in the WHERE clause, will take advantage of index scan using our new secondary index. There's more... It is recommended by SAP that the index be first created in development system and then transport to quality, and to the production system. Secondary indexes are not automatically generated on target systems after being transported. We should check the status on the Activation Log in the target systems, and use the Database Utility to manually activate the index in question. A secondary index, preferably, must have fields that are not common (or as much as uncommon as possible) with other indexes. Too many redundant secondary indexes (that is, too many common fields across several indexes) on a table has a negative impact on performance. For instance, a table with 10 secondary indexes is sharing more than three fields. In addition, tables that are rarely modified (and very often read) are the ideal candidates for secondary indexes. See also http://help.sap.com/saphelp_erp2005/helpdata/EN/85/685a41cdbf80 47e10000000a1550b0/content.htm http://help.sap.com/saphelp_nw04/helpdata/en/cf/21eb2d446011d1 89700000e8322d00/frameset.htmhttp://docs.oracle.com/cd/ SELECT clause E17076_02/html/programmer_reference/am_second.html http://forums.sdn.sap.com/thread.jspa?threadID=1469347

0
0
2086

article-image-techniques-for-creating-a-multimedia-database

Packt

17 May 2013

37 min read

Techniques for Creating a Multimedia Database

Packt

17 May 2013

37 min read

0
0
9514

article-image-move-further-numpy-modules

Packt

13 May 2013

7 min read

Move Further with NumPy Modules

Packt

13 May 2013

7 min read

(For more resources related to this topic, see here.) Linear algebra Linear algebra is an important branch of mathematics. The numpy.linalg package contains linear algebra functions. With this module, you can invert matrices, calculate eigenvalues, solve linear equations, and determine determinants, among other things. Time for action – inverting matrices The inverse of a matrix A in linear algebra is the matrix A-1, which when multiplied with the original matrix, is equal to the identity matrix I. This can be written, as A* A-1 = I. The inv function in the numpy.linalg package can do this for us. Let's invert an example matrix. To invert matrices, perform the following steps: We will create the example matrix with the mat. A = np.mat("0 1 2;1 0 3;4 -3 8") print "An", A The A matrix is printed as follows: A [[ 0 1 2] [ 1 0 3] [ 4 -3 8]] Now, we can see the inv function in action, using which we will invert the matrix. inverse = np.linalg.inv(A) print "inverse of An", inverse The inverse matrix is shown as follows: inverse of A [[-4.5 7. -1.5] [-2. 4. -1. ] [ 1.5 -2. 0.5]] If the matrix is singular or not square, a LinAlgError exception is raised. If you want, you can check the result manually. This is left as an exercise for the reader. Let's check what we get when we multiply the original matrix with the result of the inv function: print "Checkn", A * inverse The result is the identity matrix, as expected. Check[[ 1. 0. 0.][ 0. 1. 0.][ 0. 0. 1.]] What just happened? We calculated the inverse of a matrix with the inv function of the numpy.linalg package. We checked, with matrix multiplication, whether this is indeed the inverse matrix. import numpy as npA = np.mat("0 1 2;1 0 3;4 -3 8")print "An", Ainverse = np.linalg.inv(A)print "inverse of An", inverseprint "Checkn", A * inverse Solving linear systems A matrix transforms a vector into another vector in a linear way. This transformation mathematically corresponds to a system of linear equations. The numpy.linalg function, solve, solves systems of linear equations of the form Ax = b; here A is a matrix, b can be 1D or 2D array, and x is an unknown variable. We will see the dot function in action. This function returns the dot product of two floating-point arrays. Time for action – solving a linear system Let's solve an example of a linear system. To solve a linear system, perform the following steps: Let's create the matrices A and b. iA = np.mat("1 -2 1;0 2 -8;-4 5 9")print "An", Ab = np.array([0, 8, -9])print "bn", b The matrices A and b are shown as follows: Solve this linear system by calling the solve function. x = np.linalg.solve(A, b)print "Solution", x The following is the solution of the linear system: Solution [ 29. 16. 3.] Check whether the solution is correct with the dot function. print "Checkn", np.dot(A , x) The result is as expected: Check[[ 0. 8. -9.]] What just happened? We solved a linear system using the solve function from the NumPy linalg module and checked the solution with the dot function. import numpy as npA = np.mat("1 -2 1;0 2 -8;-4 5 9")print "An", Ab = np.array([0, 8, -9])print "bn", bx = np.linalg.solve(A, b)print "Solution", xprint "Checkn", np.dot(A , x) Finding eigenvalues and eigenvectors Eigenvalues are scalar solutions to the equation Ax = ax, where A is a two-dimensional matrix and x is a one-dimensional vector. Eigenvectors are vectors corresponding to eigenvalues. The eigvals function in the numpy.linalg package calculates eigenvalues. The eig function returns a tuple containing eigenvalues and eigenvectors. Time for action – determining eigenvalues and eigenvectors Let's calculate the eigenvalues of a matrix. Perform the following steps to do so: Create a matrix as follows: A = np.mat("3 -2;1 0")print "An", A The matrix we created looks like the following: A[[ 3 -2][ 1 0]] Calculate eigenvalues by calling the eig function. print "Eigenvalues", np.linalg.eigvals(A) The eigenvalues of the matrix are as follows: Eigenvalues [ 2. 1.] Determine eigenvalues and eigenvectors with the eig function. This function returns a tuple, where the first element contains eigenvalues and the second element contains corresponding Eigenvectors, arranged column-wise. eigenvalues, eigenvectors = np.linalg.eig(A)print "First tuple of eig", eigenvaluesprint "Second tuple of eign", eigenvectors The eigenvalues and eigenvectors will be shown as follows: First tuple of eig [ 2. 1.]Second tuple of eig[[ 0.89442719 0.70710678][ 0.4472136 0.70710678]] Check the result with the dot function by calculating the right- and left-hand sides of the eigenvalues equation Ax = ax. for i in range(len(eigenvalues)):print "Left", np.dot(A, eigenvectors[:,i])print "Right", eigenvalues[i] * eigenvectors[:,i]print The output is as follows: Left [[ 1.78885438][ 0.89442719]]Right [[ 1.78885438][ 0.89442719]]Left [[ 0.70710678][ 0.70710678]]Right [[ 0.70710678][ 0.70710678]] What just happened? We found the eigenvalues and eigenvectors of a matrix with the eigvals and eig functions of the numpy.linalg module. We checked the result using the dot function . import numpy as npA = np.mat("3 -2;1 0")print "An", Aprint "Eigenvalues", np.linalg.eigvals(A)eigenvalues, eigenvectors = np.linalg.eig(A)print "First tuple of eig", eigenvaluesprint "Second tuple of eign", eigenvectorsfor i in range(len(eigenvalues)):print "Left", np.dot(A, eigenvectors[:,i])print "Right", eigenvalues[i] * eigenvectors[:,i]print Singular value decomposition Singular value decomposition is a type of factorization that decomposes a matrix into a product of three matrices. The singular value decomposition is a generalization of the previously discussed eigenvalue decomposition. The svd function in the numpy.linalg package can perform this decomposition. This function returns three matrices – U, Sigma, and V – such that U and V are orthogonal and Sigma contains the singular values of the input matrix. The asterisk denotes the Hermitian conjugate or the conjugate transpose. Time for action – decomposing a matrix It's time to decompose a matrix with the singular value decomposition. In order to decompose a matrix, perform the following steps: First, create a matrix as follows: A = np.mat("4 11 14;8 7 -2")print "An", A The matrix we created looks like the following: A[[ 4 11 14][ 8 7 -2]] Decompose the matrix with the svd function. U, Sigma, V = np.linalg.svd(A, full_matrices=False)print "U"print Uprint "Sigma"print Sigmaprint "V"print V The result is a tuple containing the two orthogonal matrices U and V on the left- and right-hand sides and the singular values of the middle matrix. [-0.31622777 0.9486833 ]]Sigma[ 18.97366596 9.48683298]V[[-0.33333333 -0.66666667 -0.66666667][ 0.66666667 0.33333333 -0.66666667]]U[[-0.9486833 -0.31622777] We do not actually have the middle matrix—we only have the diagonal values. The other values are all 0. We can form the middle matrix with the diag function. Multiply the three matrices. This is shown, as follows: print "Productn", U * np.diag(Sigma) * V The product of the three matrices looks like the following: Product[[ 4. 11. 14.][ 8. 7. -2.]] What just happened? We decomposed a matrix and checked the result by matrix multiplication. We used the svd function from the NumPy linalg module. import numpy as npA = np.mat("4 11 14;8 7 -2")print "An", AU, Sigma, V = np.linalg.svd(A, full_matrices=False)print "U"print Uprint "Sigma"print Sigmaprint "V"print Vprint "Productn", U * np.diag(Sigma) * V Pseudoinverse The Moore-Penrose pseudoinverse of a matrix can be computed with the pinv function of the numpy.linalg module (visit http://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse). The pseudoinverse is calculated using the singular value decomposition. The inv function only accepts square matrices; the pinv function does not have this restriction.

0
0
2067

article-image-tracking-faces-haar-cascades

Packt

13 May 2013

4 min read

OpenCV: Tracking Faces with Haar Cascades

Packt

13 May 2013

4 min read

Conceptualizing Haar cascades When we talk about classifying objects and tracking their location, what exactly are we hoping to pinpoint? What constitutes a recognizable part of an object? Photographic images, even from a webcam, may contain a lot of detail for our (human) viewing pleasure. However, image detail tends to be unstable with respect to variations in lighting, viewing angle, viewing distance, camera shake, and digital noise. Moreover, even real differences in physical detail might not interest us for the purpose of classification. I was taught in school, that no two snowflakes look alike under a microscope. Fortunately, as a Canadian child, I had already learned how to recognize snowflakes without a microscope, as the similarities are more obvious in bulk. Thus, some means of abstracting image detail is useful in producing stable classification and tracking results. The abstractions are called features, which are said to be extracted from the image data. There should be far fewer features than pixels, though any pixel might influence multiple features. The level of similarity between two images can be evaluated based on distances between the images' corresponding features. For example, distance might be defined in terms of spatial coordinates or color coordinates. Haar-like features are one type of feature that is often applied to real-time face tracking. They were first used f or this purpose by Paul Viola and Michael Jones in 2001. Each Haar-like feature describes the pattern of contrast among adjacent image regions. For example, edges, vertices, and thin lines each generate distinctive features. For any given image, the features may vary depending on the regions' size, which may be called the window size. Two images that differ only in scale should be capable of yielding similar features, albeit for different window sizes. Thus, it is useful to generate features for multiple window sizes. Such a collection of features is called a cascade. We may say a Haar cascade is scale-invariant or, in other words, robust to changes in scale. OpenCV provides a classifier and tracker for scale-invariant Haar cascades, whic h it expects to be in a certain file format. Haar cascades, as implemented in OpenCV, are not robust to changes in rotation. For example, an upside-down face is not considered similar to an upright face and a face viewed in profile is not considered similar to a face viewed from the front. A more complex and more resource-intensive implementation could improve Haar cascades' robustness to rotation by considering multiple transformations of images as well as multiple window sizes. However, we will confine ourselves to the implementation in OpenCV. Getting Haar cascade data As part of your OpenCV setup, you probably have a directory called haarcascades. It contains cascades that are trained for certain subjects using tools that come with OpenCV. The directory's full path depends on your system and method of setting up OpenCV, as follows: Build from source archive:: <unzip_destination>/data/haarcascades Windows with self-extracting ZIP:<unzip_destination>/data/haarcascades Mac with MacPorts:MacPorts: /opt/local/share/OpenCV/haarcascades Mac with Homebrew:The haarcascades file is not included; to get it, download the source archive Ubuntu with apt or Software Center: The haarcascades file is not included; to get it, download the source archive If you cannot find haarcascades, then download the source archive from http://sourceforge.net/projects/opencvlibrary/files/opencv-unix/2.4.3/OpenCV-2.4.3.tar.bz2/download (or the Windows self-extracting ZIP from http://sourceforge.net/projects/opencvlibrary/files/opencvwin/ 2.4.3/OpenCV-2.4.3.exe/download), unzip it, and look for <unzip_destination>/data/haarcascades. Once you find haarcascades, create a directory called cascades in the same folder as cameo.py and copy the following files from haarcascades into cascades: haarcascade_frontalface_alt.xmlhaarcascade_eye.xmlhaarcascade_mcs_nose.xmlhaarcascade_mcs_mouth.xml As their names suggest, these cascades are for tracking faces, eyes, noses, and mouths. They require a frontal, upright view of the subject. With a lot of patience and a powerful computer, you can make your own cascades, trained for various types of objects. Creating modules We should continue to maintain good separation between application-specific code and reusable code. Let's make new modules for tracking classes and their helpers. A file called trackers.py should be created in the same directory as cameo.py (and, equivalently, in the parent directory of cascades ). Let's put the following import statements at the start of trackers.py: import cv2import rectsimport utils Alongside trackers.py and cameo.py, let's make another file called rects.py containing the following import statement: import cv2 Our face tracker and a definition of a face will go in trackers.py, while various helpers will go in rects.py and our preexisting utils.py file.

0
0
22603

Packt

02 May 2013

10 min read

Ten IPython essentials

Packt

02 May 2013

10 min read

(For more resources related to this topic, see here.) Running the IPython console If IPython has been installed correctly, you should be able to run it from a system shell with the ipython command. You can use this prompt like a regular Python interpreter as shown in the following screenshot: Command-line shell on Windows If you are on Windows and using the old cmd.exe shell, you should be aware that this tool is extremely limited. You could instead use a more powerful interpreter, such as Microsoft PowerShell, which is integrated by default in Windows 7 and 8. The simple fact that most common filesystem-related commands (namely, pwd, cd, ls, cp, ps, and so on) have the same name as in Unix should be a sufficient reason to switch. Of course, IPython offers much more than that. For example, IPython ships with tens of little commands that considerably improve productivity. Some of these commands help you get information about any Python function or object. For instance, have you ever had a doubt about how to use the super function to access parent methods in a derived class? Just type super? (a shortcut for the command %pinfo super) and you will find all the information regarding the super function. Appending ? or ?? to any command or variable gives you all the information you need about it, as shown here: In [1]: super? Typical use to call a cooperative superclass method: class C(B): def meth(self, arg): super(C, self).meth(arg) Using IPython as a system shell You can use the IPython command-line interface as an extended system shell. You can navigate throughout your ﬁlesystem and execute any system command. For instance, the standard Unix commands pwd, ls, and cd are available in IPython and work on Windows too, as shown in the following example: In [1]: pwd Out[1]: u'C:' In [2]: cd windows C:windows These commands are particular magic commands that are central in the IPython shell. There are dozens of magic commands and we will use a lot of them throughout this book. You can get a list of all magic commands with the %lsmagic command. Using the IPython magic commands Magic commands actually come with a % prefix, but the automagic system, enabled by default, allows you to conveniently omit this prefix. Using the prefix is always possible, particularly when the unprefixed command is shadowed by a Python variable with the same name. The %automagic command toggles the automagic system. In this book, we will generally use the % prefix to refer to magic commands, but keep in mind that you can omit it most of the time, if you prefer. Using the history Like the standard Python console, IPython offers a command history. However, unlike in Python's console, the IPython history spans your previous interactive sessions. In addition to this, several key strokes and commands allow you to reduce repetitive typing. In an IPython console prompt, use the up and down arrow keys to go through your whole input history. If you start typing before pressing the arrow keys, only the commands that match what you have typed so far will be shown. In any interactive session, your input and output history is kept in the In and Out variables and is indexed by a prompt number. The _, __, ___ and _i, _ii, _iii variables contain the last three output and input objects, respectively. The _n and _in variables return the nth output and input history. For instance, let's type the following command: In [4]: a = 12 In [5]: a ** 2 Out[5]: 144 In [6]: print("The result is {0:d}.".format(_)) The result is 144. In this example, we display the output, that is, 144 of prompt 5 on line 6. Tab completion Tab completion is incredibly useful and you will ﬁnd yourself using it all the time. Whenever you start typing any command, variable name, or function, press the Tab key to let IPython either automatically complete what you are typing if there is no ambiguity, or show you the list of possible commands or names that match what you have typed so far. It also works for directories and ﬁle paths, just like in the system shell. It is also particularly useful for dynamic object introspection. Type any Python object name followed by a point and then press the Tab key; IPython will show you the list of existing attributes and methods, as shown in the following example: In [1]: import os In [2]: os.path.split<tab> os.path.split os.path.splitdrive os.path.splitext os.path.splitunc In the second line, as shown in the previous code, we press the Tab key after having typed os.path.split. IPython then displays all the possible commands. Tab Completion and Private Variables Tab completion shows you all the attributes and methods of an object, except those that begin with an underscore (_). The reason is that it is a standard convention in Python programming to prefix private variables with an underscore. To force IPython to show all private attributes and methods, type myobject._ before pressing the Tab key. Nothing is really private or hidden in Python. It is part of a general Python philosophy, as expressed by the famous saying, "We are all consenting adults here." Executing a script with the %run command Although essential, the interactive console becomes limited when running sequences of multiple commands. Writing multiple commands in a Python script with the .py file extension (by convention) is quite common. A Python script can be executed from within the IPython console with the %run magic command followed by the script filename. The script is executed in a fresh, new Python namespace unless the -i option has been used, in which case the current interactive Python namespace is used for the execution. In all cases, all variables defined in the script become available in the console at the end of script execution. Let's write the following Python script in a file called script.py: print("Running script.") x = 12 print("'x' is now equal to {0:d}.".format(x)) Now, assuming we are in the directory where this file is located, we can execute it in IPython by entering the following command: In [1]: %run script.py Running script. 'x' is now equal to 12. In [2]: x Out[2]: 12 When running the script, the standard output of the console displays any print statement. At the end of execution, the x variable defined in the script is then included in the interactive namespace, which is quite convenient. Quick benchmarking with the %timeit command You can do quick benchmarks in an interactive session with the %timeit magic command. It lets you estimate how much time the execution of a single command takes. The same command is executed multiple times within a loop, and this loop itself is repeated several times by default. The individual execution time of the command is then automatically estimated with an average. The -n option controls the number of executions in a loop, whereas the -r option controls the number of executed loops. For example, let's type the following command: In[1]: %timeit [x*x for x in range(100000)] 10 loops, best of 3: 26.1 ms per loop Here, it took about 26 milliseconds to compute the squares of all integers up to 100000. Quick debugging with the %debug command IPython ships with a powerful command-line debugger. Whenever an exception is raised in the console, use the %debug magic command to launch the debugger at the exception point. You then have access to all the local variables and to the full stack traceback in postmortem mode. Navigate up and down through the stack with the u and d commands and exit the debugger with the q command. See the list of all the available commands in the debugger by entering the ? command. You can use the %pdb magic command to activate the automatic execution of the IPython debugger as soon as an exception is raised. Interactive computing with Pylab The %pylab magic command enables the scientific computing capabilities of the NumPy and matplotlib packages, namely efficient operations on vectors and matrices and plotting and interactive visualization features. It becomes possible to perform interactive computations in the console and plot graphs dynamically. For example, let's enter the following command: In [1]: %pylab Welcome to pylab, a matplotlib-based Python environment [backend: TkAgg]. For more information, type 'help(pylab)'. In [2]: x = linspace(-10., 10., 1000) In [3]: plot(x, sin(x)) In this example, we first define a vector of 1000 values linearly spaced between -10 and 10. Then we plot the graph (x, sin(x)). A window with a plot appears as shown in the following screenshot, and the console is not blocked while this window is opened. This allows us to interactively modify the plot while it is open. Using the IPython Notebook The Notebook brings the functionality of IPython into the browser for multiline textediting features, interactive session reproducibility, and so on. It is a modern and powerful way of using Python in an interactive and reproducible way To use the Notebook, call the ipython notebook command in a shell (make sure you have installed the required dependencies). This will launch a local web server on the default port 8888. Go to http://127.0.0.1:8888/ in a browser and create a new Notebook. You can write one or several lines of code in the input cells. Here are some of the most useful keyboard shortcuts: Press the Enter key to create a new line in the cell and not execute the cell Press Shift + Enter to execute the cell and go to the next cell Press Alt + Enter to execute the cell and append a new empty cell right after it Press Ctrl + Enter for quick instant experiments when you do not want to save the output Press Ctrl + M and then the H key to display the list of all the keyboard shortcuts Customizing IPython You can save your user preferences in a Python ﬁle; this ﬁle is called an IPython proﬁle. To create a default proﬁle, type ipython profile create in a shell. This will create a folder named profile_default in the ~/.ipython or ~/.config/ ipython directory. The ﬁle ipython_config.py in this folder contains preferences about IPython. You can create different proﬁles with different names using ipython profile create profilename, and then launch IPython with ipython --profile=profilename to use that proﬁle. The ~ directory is your home directory, for example, something like /home/ yourname on Unix, or C:Usersyourname or C:Documents and Settings yourname on Windows. Summary We have gone through 10 of the most interesting features offered by IPython in this article. They essentially concern the Python and shell interactive features, including the integrated debugger and proﬁler, and the interactive computing and visualization features brought by the NumPy and Matplotlib packages. Resources for Article : Further resources on this subject: Advanced Matplotlib: Part 1 [Article] Python Testing: Installing the Robot Framework [Article] Running a simple game using Pygame [Article]

0
0
3681

Packt

19 Apr 2013

15 min read

Big Data Analysis

Packt

19 Apr 2013

15 min read

0
0
5563

article-image-comparative-study-nosql-products

Packt

09 Apr 2013

7 min read

Comparative Study of NoSQL Products

Packt

09 Apr 2013

7 min read

(For more resources related to this topic, see here.) Comparison Choosing a technology does not merely involve a technical comparison. Several other factors related to documentation, maintainability, stability and maturity, vendor support, developer community, license, price, and the future of the product or the organization behind it also play important roles. Having said that, I must also add that technical comparison should continue to play a pivotal role. We will start a deep technical comparison of the previously mentioned products and then look at the semi-technical and non-technical aspects for the same. Technical comparison From a technical perspective, we compare on the following parameters: Implementation language Engine types Speed Implementation language One of the more important factors that come into play is how can, if required, the product be extended; the programming language in which the product itself is written determines a large part of it. Some of the database may provide a different language for writing plugins but it may not always be true: Amazon SimpleDB: It is available in cloud and has a client SDK for Java, .NET, PHP, and Ruby. There are libraries for Android and iOS as well. BaseX: Written in Java. To extend, one must code in Java. Cassandra: Everything in Java. CouchDB: Written in Erlang. To extend use Erlang. Google Datastore: It is available in cloud and has SDK for Java, Python, and Go. HBase: It is Java all the way. MemcacheDB: Written in C. Uses the same language to extend. MongoDB: Written in C++. Client drivers are available in several languages including but not limited to JavaScript, Java, PHP, Python, and Ruby. Neo4j: Like several others, it is Java all the way Redis: Written in C. So you can extend using C. Great, so the first parameter itself may have helped you shortlist the products that you may be interested to use based on the developers available in your team or for hire. You may still be tempted to get smart people onboard and then build competency based on the choice that you make, based on subsequent dimensions. Note that for the databases written in high-level languages like Java, it may still be possible to write extensions in languages like C or C++ by using interfaces like JNI or otherwise. Amazon SimpleDB provides access via the HTTP protocol and has SDK in multiple languages. If you do not find an SDK for yourself, say for example, in JavaScript for use with NodeJS, just write one. However, life is not open with Google Datastore that allows access only via its cloud platform App Engine and has SDKs only in Java, Python, and the Go languages. Since the access is provided natively from the cloud servers, you cannot do much about it. In fact, the top requested feature of the Google App Engine is support for PHP ( See http://code.google.com/p/googleappengine/issues/list). Engine types Engine types define how you will structure the data and what data design expertise your team will need. NoSQL provides multiple options to choose from. Database Column oriented Document store Key value store Graph Amazon SimpleDB No No Yes No BaseX No Yes No No Cassandra Yes Yes No No CouchDB No Yes No No Google Datastore Yes No No No HBase Yes No No No MemcacheDB No No Yes No MongoDB No Yes No No Neo4j No No No Yes Redis No Yes Yes No You may notice two aspects of this table – a lot of No and multiple Yes against some databases. I expect the table to be populated with a lot more Yes over the next couple of years. Specifically, I expect the open source databases written in Java to be developed and enhanced actively providing multiple options to the developers. Speed One of the primary reasons for choosing a NoSQL solution is speed. Comparing and benchmarking the databases is a non-trivial task considering that each database has its own set of hardware and other configuration requirements. Having said that, you can definitely find a whole gambit of benchmark results comparing one NoSQL database against the other with details of how the tests were executed. Of all that is available, my personal choice is the Yahoo! Cloud Serving Benchmark (YCSB) tool. It is open source and available on Github at https://github.com/brianfrankcooper/YCSB. It is written in Java and clients are available for Cassandra, DynamoDB, HBase, HyperTable, MongoDB, Redis apart from several others that we have not discuss in this book. Before showing some results from the YCSB, I did a quick run on a couple of easy-to-set-up databases myself. I executed them without any optimizations to just get a feel of how easy it is for software to incorporate it without needing any expert help. I ran it on MongoDB on my personal box (server as well as the client on the same machine), DynamoDB connecting from a High-CPU Medium (c1.medium) box, and MySQL on the same High-CPU Medium box with both server and client on the same machine. Detailed configurations with the results are shown as follows: Server configuration: Parameter MongoDB DynamoDB MySQL Processor 5 EC2 Compute Units N/A 5 EC2 Compute Units RAM 1.7 GB with Apache HTTP server running (effective free: 200 MB, after database is up and running) N/A 1.7GB with Apache HTTP server running (effective free: 500MB, after database is up and running) Hard disk Non-SSD N/A Non-SSD Network configuration N/A US-East-1 N/A Operating system Ubuntu 10.04, 64 bit N/A Ubuntu 10.04, 64 bit Database version 1.2.2 N/A 5.1.41 Configuration Default Max write: 500, Max read: 500 Default Client configuration: Parameter MongoDB DynamoDB MySQL Processor 5 EC2 Compute Units 5 EC2 Compute Units 5 EC2 Compute Units RAM 1.7GB with Apache HTTP server running (effective free: 200MB, after database is up and running) 1.7GB with Apache HTTP server running (effective free: 500MB, after database is up and running) 1.7GB with Apache HTTP server running (effective free: 500MB after database is up and running) Hard disk Non-SSD Non-SSD Non-SSD Network configuration Same Machine as server US-East-1 Same Machine as server Operating system Ubuntu 10.04, 64 bit Ubuntu 10.04, 64 bit Ubuntu 10.04, 64 bit Record count 1,000,000 1,000 1,000,000 Max connections 1 5 1 Operation count (workload a) 1,000,000 1,000 1,000,000 Operation count (workload f) 1,000,000 100,000 1,000,000 Results: Workload Parameter MongoDB DynamoDB MySQL Workload-a (load) Total time 290 seconds 16 seconds 300 seconds Speed (operations/second) 2363 to 4180 (approximately 3700) Bump at 1278 50 to 82 (operations/second) 3135 to 3517 (approximately 3300) Insert latency 245 to 416 microseconds (approximately 260) Bump at 875 microseconds 12 to 19 milliseconds 275 to 300 microseconds (approximately 290) Workload-a (run) Total time 428 seconds 17 seconds 240 seconds Speed 324 to 4653 42 to 78 3970 to 4212 Update latency 272 to 2946 microseconds 13 to 23.7 microseconds 219 to 225.5 microseconds Read latency 112 to 5358 microseconds 12.4 to 22.48 microseconds 240.6 to 248.9 microseconds Workload-f (load) Total time 286 seconds Did not execute 295 seconds Speed 3708 to 4200 3254 to 3529 Insert latency 228 to 265 microseconds 275 to 299 microseconds Workload-f (run) Total time 412 seconds Did not execute 1022 seconds Speed 192 to 4146 224 to 2096 Update latency 219 to 336 microseconds 216 to 233 microseconds, with two bursts at 600 and 2303 microseconds Read latency 119 to 5701 microseconds 1360 to 8246 microseconds Read Modify Write (RMW) latency 346 to 9170 microseconds 1417 to 14648 microseconds Do not read too much into these numbers as they are a result of the default configuration, out-of-the-box setup without any optimizations. Some of the results from YCSB published by Brian F. Cooper (http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf) are shown next. For update-heavy, 50-50 read-update: For read-heavy, under varying hardware: There are some more from Sergey Sverchkov at Altoros (http://altoros.com/nosql-research) who published their white paper recently. Summary In this article, we did a detailed comparative study of ten NoSQL databases on few parameters, both technical and non-technical. Resources for Article : Further resources on this subject: Getting Started with CouchDB and Futon [Article] Ruby with MongoDB for Web Development [Article] An Introduction to Rhomobile [Article]

0
0
2277

article-image-advanced-hadoop-mapreduce-administration

Packt

08 Apr 2013

6 min read

Advanced Hadoop MapReduce Administration

Packt

08 Apr 2013

6 min read

(For more resources related to this topic, see here.) Tuning Hadoop configurations for cluster deployments Getting ready Shut down the Hadoop cluster if it is already running, by executing the bin/stop-dfs.sh and bin/stop-mapred.sh commands from HADOOP_HOME. How to do it... We can control Hadoop configurations through the following three configuration files: conf/core-site.xml: This contains the configurations common to whole Hadoop distribution conf/hdfs-site.xml: This contains configurations for HDFS conf/mapred-site.xml: This contains configurations for MapReduce Each configuration file has name-value pairs expressed in an XML format, and they define the workings of different aspects of Hadoop. The following code snippet shows an example of a property in the configuration file. Here, the <configuration> tag is the top-level XML container, and the <property> tags that define individual properties go as child elements of the <configuration> tag. <configuration><property><name>mapred.reduce.parallel.copies</name><value>20</value></property>...</configuration> The following instructions show how to change the directory to which we write Hadoop logs and configure the maximum number of map and reduce tasks: Create a directory to store the logfiles. For example, /root/hadoop_logs. Uncomment the line that includes HADOOP_LOG_DIR in HADOOP_HOME/conf/ hadoop-env.sh and point it to the new directory. Add the following lines to the HADOOP_HOME/conf/mapred-site.xml file: <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2 </value></property><property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>2 </value></property> Restart the Hadoop cluster by running the bin/stop-mapred.sh and bin/start-mapred.sh commands from the HADOOP_HOME directory. You can verify the number of processes created using OS process monitoring tools. If you are in Linux, run the watch ps –ef|grep hadoop command. If you are in Windows or MacOS use the Task Manager. How it works... HADOOP_LOG_DIR redefines the location to which Hadoop writes its logs. The mapred. tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks. maximum properties define the maximum number of map and reduce tasks that can run within a single TaskTracker at a given moment. These and other server-side parameters are defined in the HADOOP_HOME/conf/*-site. xml files. Hadoop reloads these configurations after a restart. There's more... There are many similar configuration properties defined in Hadoop. You can see some of them in the following tables. The configuration properties for conf/core-site.xml are listed in the following table: Name Default value Description fs.inmemory.size.mb 100 This is the amount of memory allocated to the in-memory filesystem that is used to merge map outputs at reducers in MBs. io.sort.factor 100 This is the maximum number of streams merged while sorting files. io.file.buffer.size 131072 This is the size of the read/write buffer used by sequence files. The configuration properties for conf/mapred-site.xml are listed in the following table: Name Default value Description mapred.reduce. parallel.copies 5 This is the maximum number of parallel copies the reduce step will execute to fetch output from many parallel jobs. mapred.map.child.java. opts -Xmx200M This is for passing Java options into the map JVM. mapred.reduce.child. java.opts -Xmx200M This is for passing Java options into the reduce JVM. io.sort.mb 200 The memory limit while sorting data in MBs. The configuration properties for conf/hdfs-site.xml are listed in the following table: Name Default value Description dfs.block.size 67108864 This is the HDFS block size. dfs.namenode.handler. count 40 This is the number of server threads to handle RPC calls in the NameNode. Running benchmarks to verify the Hadoop installation The Hadoop distribution comes with several benchmarks. We can use them to verify our Hadoop installation and measure Hadoop's performance. This recipe introduces these benchmarks and explains how to run them. Getting ready Start the Hadoop cluster. You can run these benchmarks either on a cluster setup or on a pseudo-distributed setup. How to do it... Let us run the sort benchmark. The sort benchmark consists of two jobs. First, we generate some random data using the randomwriter Hadoop job and then sort them using the sort sample. Change the directory to HADOOP_HOME. Run the randomwriter Hadoop job using the following command: >bin/hadoop jar hadoop-examples-1.0.0.jarrandomwriter-Dtest.randomwrite.bytes_per_map=100-Dtest.randomwriter.maps_per_host=10 /data/unsorted-data Here the two parameters, test.randomwrite.bytes_per_map and test. randomwriter.maps_per_host specify the size of data generated by a map and the number of maps respectively. Run the sort program: >bin/hadoop jar hadoop-examples-1.0.0.jar sort /data/unsorted-data/data/sorted-data Verify the final results by running the following command: >bin/hadoop jar hadoop-test-1.0.0.jar testmapredsort -sortInput /data/unsorted-data -sortOutput /data/sorted-data Finally, when everything is successful, the following message will be displayed: The job took 66 seconds.SUCCESS! Validated the MapReduce framework's 'sort' successfully. How it works... First, the randomwriter application runs a Hadoop job to generate random data that can be used by the second sort program. Then, we verify the results through testmapredsort job. If your computer has more capacity, you may run the initial randomwriter step with increased output sizes. There's more... Hadoop includes several other benchmarks. TestDFSIO: This tests the input output (I/O) performance of HDFS nnbench: This checks the NameNode hardware mrbench: This runs many small jobs TeraSort: This sorts a one terabyte of data More information about these benchmarks can be found at http://www.michaelnoll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoopcluster- with-terasort-testdfsio-nnbench-mrbench/. Reusing Java VMs to improve the performance In its default configuration, Hadoop starts a new JVM for each map or reduce task. However, running multiple tasks from the same JVM can sometimes significantly speed up the execution. This recipe explains how to control this behavior. How to do it... Run the WordCount sample by passing the following option as an argument: >bin/hadoop jar hadoop-examples-1.0.0.jar wordcount –Dmapred.job.reuse.jvm.num.tasks=-1 /data/input1 /data/output1 Monitor the number of processes created by Hadoop (through ps –ef|grephadoop command in Unix or task manager in Windows). Hadoop starts only a single JVM per task slot and then reuses it for an unlimited number of tasks in the job. However, passing arguments through the –D option only works if the job implements the org.apache.hadoop.util.Tools interface. Otherwise, you should set the option through the JobConf.setNumTasksToExecutePerJvm(-1) method. How it works... By setting the job configuration property through mapred.job.reuse.jvm.num.tasks, we can control the number of tasks for the JVM run by Hadoop. When the value is set to -1, Hadoop runs the tasks in the same JVM.

0
0
4186

article-image-line-area-and-scatter-charts

Packt

05 Apr 2013

10 min read

Line, Area, and Scatter Charts

Packt

05 Apr 2013

10 min read

(For more resources related to this topic, see here.) Introducing line charts First let's start with a single series line chart. We will use one of the many data provided by The World Bank organization at www.worldbank.org. The following is the code snippet to create a simple line chart which shows the percentage of population ages, 65 and above, in Japan for the past three decades: var chart = new Highcharts.Chart({chart: {renderTo: 'container'},title: {text: 'Population ages 65 and over (% of total)',},credits: {position: {align: 'left',x: 20},text: 'Data from The World Bank'},yAxis: {title: {text: 'Percentage %'}},xAxis: {categories: ['1980', '1981','1982', ... ],labels: {step: 5}},series: [{name: 'Japan - 65 and over',data: [ 9, 9, 9, 10, 10, 10, 10 ... ]}]}); The following is the display of the simple chart: Instead of specifying the year number manually as strings in categories, we can use the pointStart option in the series config to initiate the x-axis value for the first point. So we have an empty xAxis config and series config, as follows: xAxis: {},series: [{pointStart: 1980,name: 'Japan - 65 and over',data: [ 9, 9, 9, 10, 10, 10, 10 ... ]}] Although this simplifies the example, the x-axis labels are automatically formatted by Highcharts utility method, numberFormat, which adds a comma after every three digits. The following is the outcome on the x axis: To resolve the x-axis label, we overwrite the label's formatter option by simply returning the value to bypass the numberFormat method being called. Also we need to set the allowDecimals option to false. The reason for that is when the chart is resized to elongate the x axis, decimal values are shown. The following is the final change to use pointStart for the year values: xAxis: {labels:{formatter: function() {// 'this' keyword is the label objectreturn this.value;}},allowDecimals: false},series: [{pointStart: 1980,name: 'Japan - 65 and over',data: [ 9, 9, 9, 10, 10, 10, 10 ... ]}] Extending to multiple series line charts We can include several more line series and set the Japan series by increasing the line width to be 6 pixels wide, as follows: series: [{lineWidth: 6,name: 'Japan',data: [ 9, 9, 9, 10, 10, 10, 10 ... ]}, {Name: 'Singapore',data: [ 5, 5, 5, 5, ... ]}, {...}] The line series for Japanese population becomes the focus in the chart, as shown in the following screenshot: Let's move on to a more complicated line graph. For the sake of demonstrating inverted line graphs, we use the chart.inverted option to flip the y and x axes to opposite orientations. Then we change the line colors of the axes to match the same series colors. We also disable data point markers for all the series and finally align the second series to the second entry in the y-axis array, as follows: chart: {renderTo: 'container',inverted: true,},yAxis: [{title: {text: 'Percentage %'},lineWidth: 2,lineColor: '#4572A7'}, {title: {text: 'Age'},opposite: true,lineWidth: 2,lineColor: '#AA4643'}],plotOptions: {series: {marker: {enabled: false}}},series: [{name: 'Japan - 65 and over',type: 'spline',data: [ 9, 9, 9, ... ]}, {name: 'Japan - Life Expectancy',yAxis: 1,data: [ 76, 76, 77, ... ]}] The following is the inverted graph with double y axes: The data representation of the chart may look slightly odd as the usual time labels are swapped to the y axis and the data trend is awkward to comprehend. The inverted option is normally used for showing data in a noncontinuous form and in bar format. If we interpret the data from the graph, 12 percent of the population is 65 and over, and the life expectancy is 79 in 1990. By setting plotOptions.series.marker.enabled to false it switches off all the data point markers. If we want to display a point marker for a particular series, we can either switch off the marker globally and then set the marker on an individual series, or the other way round. plotOptions: {series: {marker: {enabled: false}}},series: [{marker: {enabled: true},name: 'Japan - 65 and over',type: 'spline',data: [ 9, 9, 9, ... ]}, { The following graph demonstrates that only the 65 and over series has point markers: Sketching an area chart In this section, we are going to use our very first example and turn it into a more stylish graph (based on the design of wind energy poster by Kristin Clute), which is an area spline chart. An area spline chart is generated using the combined properties of area and spline charts. The main data line is plotted as a spline curve and the region underneath the line is filled in a similar color with a gradient and an opaque style. Firstly, we want to make the graph easier for viewers to look up the values for the current trend, so we move the y axis next to the latest year, that is, to the opposite side of the chart: yAxis: { ....opposite:true} The next thing is to remove the interval lines and have a thin axis line along the y axis: yAxis: { ....gridLineWidth: 0,lineWidth: 1,} Then we simplify the y-axis title with a percentage sign and align it to the top of the axis: yAxis: { ....title: {text: '(%)',rotation: 0,x: 10,y: 5,align: 'high'},} As for the x axis, we thicken the axis line with a red color and remove the interval ticks: xAxis: { ....lineColor: '#CC2929',lineWidth: 4,tickWidth: 0,offset: 2} For the chart title, we move the title to the right of the chart, increase the margin between the chart and the title, and then adopt a different font for the title: title: {text: 'Population ages 65 and over (% of total) -Japan ',margin: 40,align: 'right',style: {fontFamily: 'palatino'}} After that we are going to modify the whole series presentation, we first set the chart.type property from 'line' to 'areaspline'. Notice that setting the properties inside this series object will overwrite the same properties defined in plotOptions.areaspline and so on in plotOptions.series. Since so far there is only one series in the graph, there is no need to display the legend box. We can disable it with the showInLegend property. We then smarten the area part with gradient color and the spline with a darker color: series: [{showInLegend: false,lineColor: '#145252',fillColor: {linearGradient: {x1: 0, y1: 0,x2: 0, y2: 1},stops:[ [ 0.0, '#248F8F' ] ,[ 0.7, '#70DBDB' ],[ 1.0, '#EBFAFA' ] ]},data: [ ... ]}] After that, we introduce a couple of data labels along the line to indicate that the ranking of old age population has increased over time. We use the values in the series data array corresponding to the year 1995 and 2010, and then convert the numerical value entries into data point objects. Since we only want to show point markers for these two years, we turn off markers globally in plotOptions.series. marker.enabled and set the marker on, individually inside the point objects accompanied with style settings: plotOptions: {series: {marker: {enabled: false}}},series: [{ ...,data:[ 9, 9, 9, ...,{ marker: {radius: 2,lineColor: '#CC2929',lineWidth: 2,fillColor: '#CC2929',enabled: true},y: 14}, 15, 15, 16, ... ]}] We then set a bounding box around the data labels with round corners (borderRadius) in the same border color (borderColor) as the x axis. The data label positions are then finely adjusted with the x and y options. Finally, we change the default implementation of the data label formatter. Instead of returning the point value, we print the country ranking. series: [{ ...,data:[ 9, 9, 9, ...,{ marker: {...},dataLabels: {enabled: true,borderRadius: 3,borderColor: '#CC2929',borderWidth: 1,y: -23,formatter: function() {return "Rank: 15th";}},y: 14}, 15, 15, 16, ... ]}] The final touch is to apply a gray background to the chart and add extra space into spacingBottom. The extra space for spacingBottom is to avoid the credit label and x-axis label getting too close together, because we have disabled the legend box. chart: {renderTo: 'container',spacingBottom: 30,backgroundColor: '#EAEAEA'}, When all these configurations are put together, it produces the exact chart, as shown in the screenshot at the start of this section. Mixing line and area series In this section we are going to explore different plots including line and area series together, as follows: Projection chart, where a single trend line is joined with two series in different line styles Plotting an area spline chart with another step line series Exploring a stacked area spline chart, where two area spline series are stacked on top of each other Simulating a projection chart The projection chart has spline area with the section of real data and continues in a dashed line with projection data. To do that we separate the data into two series, one for real data and the other for projection data. The following is the series configuration code for the future data up to 2024. This data is based on the National Institute of Population and Social Security Research report (http://www.ipss.go.jp/pp-newest/e/ppfj02/ppfj02.pdf). series: [{name: 'project data',type: 'spline',showInLegend: false,lineColor: '#145252',dashStyle: 'Dash',data: [ [ 2010, 23 ], [ 2011, 22.8 ],... [ 2024, 28.5 ] ]}] The future series is configured as a spline in a dashed line style and the legend box is disabled, because we want to show both series as being from the same series. Then we set the future (second) series color the same as the first series. The final part is to construct the series data. As we specify the x-axis time data with the pointStart property, we need to align the projection data after 2010. There are two approaches that we can use to specify the time data in a continuous form, as follows: Insert null values into the second series data array for padding to align with the real data series Specify the second series data in tuples, which is an array with both time and projection data Next we are going to use the second approach because the series presentation is simpler. The following is the screenshot only for the future data series: The real data series is exactly the same as the graph in the screenshot at the start of the Sketching an area chart section, except without the point markers and data label decorations. The next step is to join both series together, as follows: series: [{name: 'real data',type: 'areaspline',....}, {name: 'project data',type: 'spline',....}] Since there is no overlap between both series data, they produce a smooth projection graph: Contrasting spline with step line In this section we are going to plot an area spline series with another line series but in a step presentation. The step line transverses vertically and horizontally only according to the changes in series data. It is generally used for presenting discrete data, that is, data without continuous/gradual movement. For the purpose of showing a step line, we will continue from the first area spline example. First of all, we need to enable the legend by removing the disabled showInLegend setting and also remove dataLabels in the series data. Next is to include a new series, Ages 0 to 14, in the chart with a default line type. Then we will change the line style slightly differently into steps. The following is the configuration for both series: series: [{name: 'Ages 65 and over',type: 'areaspline',lineColor: '#145252',pointStart: 1980,fillColor: {....},data: [ 9, 9, 9, 10, ...., 23 ]}, {name: 'Ages 0 to 14',// default type is line seriesstep: true,pointStart: 1980,data: [ 24, 23, 23, 23, 22, 22, 21,20, 20, 19, 18, 18, 17, 17, 16, 16, 16,15, 15, 15, 15, 14, 14, 14, 14, 14, 14,14, 14, 13, 13 ]}] The following screenshot shows the second series in the stepped line style:

0
0
3032

Packt

04 Apr 2013

6 min read

Obtaining a binary backup

Packt

04 Apr 2013

6 min read

Getting ready Next we need to modify the postgresql.conf file for our database to run in the proper mode for this type of backup. Change the following configuration variables: wal_level = archive max_wal_senders = 5 Then we must allow a super user to connect to the replication database, which is used by pg_basebackup. We do that by adding the following line to pg_hba.conf: local replication postgres peer Finally, restart the database instance to commit the changes. How to do it... Though it is only one command, pg_basebackup requires at least one switch to obtain a binary backup, as shown in the following step: Execute the following command to create the backup in a new directory named db_backup: $> pg_basebackup -D db_backup -x How it works... For PostgreSQL, WAL stands for Write Ahead Log. By changing wal_level to archive, those logs are written in a format compatible with pg_basebackup and other replicationbased tools. By increasing max_wal_senders from the default of zero, the database will allow tools to connect and request data files. In this case, up to five streams can request data files simultaneously. This maximum should be sufficient for all but the most advanced systems. The pg_hba.conf file is essentially a connection access control list (ACL). Since pg_basebackup uses the replication protocol to obtain data files, we need to allow local connections to request replication. Next, we send the backup itself to a directory (-D) named db_backup. This directory will effectively contain a complete copy of the binary files that make up the database. Finally, we added the -x flag to include transaction logs (xlogs), which the database will require to start, if we want to use this backup. When we get into more complex scenarios, we will exclude this option, but for now, it greatly simplifies the process. There's more... The pg_basebackup tool is actually fairly complicated. There is a lot more involved under the hood. Viewing backup progress For manually invoked backups, we may want to know how long the process might take, and its current status. Luckily, pg_basebackup has a progress indicator, which does that by using the following command: $> pg_basebackup -P -D db_backup Like many of the other switches, -P can be combined with tape archive format, standalone backups, database clones, and so on. This is clearly not necessary for automated backup routines, but could be useful for one-off backups monitored by an administrator. Compressed tape archive backups Many binary backup files come in the TAR (Tape Archive) format, which we can activate using the -f flag and setting it to t for TAR. Several Unix backup tools can directly process this type of backup, and most administrators are familiar with it. If we want a compressed output, we can set the -z flag, especially in the case of large databases. For our sample database, we should see almost a 20x compression ratio. Try the following command: $> pg_basebackup -Ft -z -D db_backup The backup file itself will be named base.tar.gz within the db_backup directory, reflecting its status as a compressed tape archive. In case the database contains extra tablespaces, each becomes a separate compressed archive. Each file can be extracted to a separate location, such as a different set of disks, for very complicated database instances. For the sake of this example, we ignored the possible presence of extra tablespaces than the pg_default default included in every installation. User-created tablespaces will greatly complicate your backup process. Making the backup standalone By specifying -x, we tell the database that we want a "complete" backup. This means we could extract or copy the backup anywhere and start it as a fully qualified database. As we mentioned before, the flag means that you want to include transaction logs, which is how the database recovers from crashes, checks integrity, and performs other important tasks. The following is the command again, for reference: $> pg_basebackup -x -D db_backup When combined with the TAR output format and compression, standalone binary backups are perfect for archiving to tape for later retrieval, as each backup is compressed and self-contained. By default, pg_basebackup does not include transaction logs, because many (possibly most) administrators back these up separately. These files have multiple uses, and putting them in the basic backup would duplicate efforts and make backups larger than necessary. We include them at this point because it is still too early for such complicated scenarios. We will get there eventually, of course. Database clones Because pg_basebackup operates through PostgreSQL's replication protocol, it can execute remotely. For instance, if the database was on a server named Production, and we wanted a copy on a server named Recovery, we could execute the following command from Recovery: $> pg_basebackup -h Production -x -D /full/db/path For this to work, we would also need this line in pg_hba.conf for Recovery: host replication postgres Recovery trust Though we set the authentication method to trust, this is not recommended for a production server installation. However, it is sufficient to allow Recovery to copy all data from Production. With the -x flag, it also means that the database can be started and kept online in case of emergency. It is a backup and a running server. Parallel compression Compression is very CPU intensive, but there are some utilities capable of threading the process. Tools such as pbzip2 or pigz can do the compression instead. Unfortunately, this only works in the case of a single tablespace (the default one; if you create more, this will not work). The following is the command for compression using pigz: $> pg_basebackup -Ft -D - | pigz -j 4 > db_backup.tar.gz It uses four threads of compression, and sets the backup directory to standard output (-) so that pigz can process the output itself. Summary In this article we saw the process of obtaining a binary backup. Though, we saw that this process is more complex and tedious, but at the same time it is much faster. Further resources on this subject: Introduction to PostgreSQL 9 Backup in PostgreSQL 9 Recovery in PostgreSQL 9

0
0
1189

How-To Tutorials - Data

Creating your first heat map in R

Linking Section Access to multiple dimensions

IBM Cognos Workspace Advanced

A quick start – OpenCV fundamentals

Implementing persistence in Redis (Intermediate)

Optimizing Programs

Techniques for Creating a Multimedia Database

Move Further with NumPy Modules

OpenCV: Tracking Faces with Haar Cascades

Ten IPython essentials

Trending Topics

Big Data Analysis

Comparative Study of NoSQL Products

Advanced Hadoop MapReduce Administration

Line, Area, and Scatter Charts

Obtaining a binary backup

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access