Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-implementing-persistence-redis-intermediate
Packt
06 Jun 2013
10 min read
Save for later

Implementing persistence in Redis (Intermediate)

Packt
06 Jun 2013
10 min read
(For more resources related to this topic, see here.) Getting ready Redis provides configuration settings for persistence and for enabling durability of data depending on the project statement. If durability of data is critical If durability of data is not important You can achieve persistence of data using the snapshotting mode, which is the simplest mode in Redis. Depending on the configuration, Redis saves a dump of all the data sets in its memory into a single RDB file. The interval in which Redis dumps the memory can be configured to happen every X seconds or after Y operations. Consider an example of a moderately busy server that receives 15,000 changes every minute over its 1 GB data set in memory. Based on the snapshotting rule, the data will be stored every 60 seconds or whenever there are at least 15,000 writes. So the snapshotting runs every minute and writes the entire data of 1 GB to the disk, which soon turns ugly and very inefficient. To solve this particular problem, Redis provides another way of persistence, Append-only file (AOF), which is the main persistence option in Redis. This is similar to journal files, where all the operations performed are recorded and replayed in the same order to rebuild the exact state. Redis's AOF persistence supports three different modes: No fsync: In this mode, we take a chance and let the operating system decide when to flush the data. This is the fastest of the three modes. fsync every second: This mode is a compromised middle point between performance and durability. Data will be flushed using fsync every second. If the disk is not able to match the write speed, the fsync can take more than a second, in which case Redis delays the write up to another second. So this mode guarantees a write to be committed to OS buffers and transferred to the disk within 2 seconds in the worstcase scenario. fsync always: This is the last and safest mode. This provides complete durability of data at a heavy cost to performance. In this mode, the data needs to be written to the file and synced with the disk using fsync before the client receives an acknowledgment. This is the slowest of all three modes. How to do it... First let us see how to configure snapshotting, followed by the Append-only file method: In Redis, we can configure when a new snapshot of the data set will be performed. For example, Redis can be configured to dump the memory if the last dump was created more than 30 seconds ago and there are at least 100 keys that are modified or created. Snapshotting should be configured in the /etc/redis/6379.conf file. The configuration can be as follows: save 900 1save 60 10000 The first line translates to take a snapshot of data after 900 seconds if at least one key has changed, while the second line translates to snapshotting every 60 seconds if 10,000 keys have been modified in the meantime. The configuration parameter rdbcompression defines whether the RDB file is to be compressed or not. There is a trade-off between the CPU and RDB dump file size. We are interested in changing the dump's filename using the dbfilename parameter. Redis uses the current folder to create the dump files. For convenience, it is advised to store the RDB file in a separate folder. dbfilename redis-snapshot.rdbdir /var/lib/redis/ Let us run a small test to make sure the RDB dump is working. Start the server again. Connect to the server using redis-cli, as we did already. To test whether our snapshotting is working, issue the following commands: SET Key ValueSAVE After the SAVE command, a file should be created in the folder /var/lib/redis with the name redis-snapshot.rdb. This confirms that our installation is able to take a snapshot of our data into a file. Now let us see how to configure persistence in Redis using the AOF method: The configuration for persistence through AOF also goes into the same file located in /etc/redis/6379.conf. By default, the Append-only mode is not enabled. Enable it using the appendonly parameter. appendonly yes Also, if you would like to specify a filename for the AOF log, uncomment the line and change the filename. appendfilename redis-aof.aof The appendfsync everysec command provides a good balance between performance and durability. appendfsync everysec Redis needs to know when it has to rewrite the AOF file. This will be decided based on two configuration parameters, as follows: auto-aof-rewrite-percentage 100auto-aof-rewrite-min-size 64mb Unless the minimum size is reached and the percentage of the increase in size when compared to the last rewrite is less than 100 percent, the AOF rewrite will not be performed. How it works... First let us see how snapshotting works. When one of the criteria is met, Redis forks the process. The child process starts writing the RDB file to the disk at the folder specified in our configuration file. Meanwhile, the parent process continues to serve the requests. The problem with this approach is that the parent process stores the keys, which change during this snapshotting by the child, in the extra memory. In the worst-case scenario, if all the keys are modified, the memory usage spikes to roughly double. Caution Be aware that the bigger the RDB file, the longer it takes Redis to restore the data on startup. Corruption of the RDB file is not possible as it is created by the append-only method from the data in Redis's memory, by the child process. The new RDB file is created as a temporary file and is then renamed to the destination file using the atomic rename system call once the dump is completed. AOF's working is simple. Every time a write operation is performed, the command operation gets logged into a logfile. The format used in the logfile is the same as the format used by clients to communicate to the server. This helps in easy parsing of AOF files, which brings in the possibility of replaying the operation in another Redis instance. Only the operations that change the data set are written to the log. This log will be used on startup to reconstruct the exact data. As we are continuously writing the operations into the log, the AOF file explodes in size as compared to the amount of operations performed. So, usually, the size of the AOF file is larger than the RDB dump. Redis manages the increasing size of the data log by compacting the file in a non-blocking manner periodically. For example, say a specific key, key1, has changed 100 times using the SET command. In order to recreate the final state in the last minute, only the last SET command is required. We do not need information about the previous 99 SET commands. This might look simple in theory, but it gets complex when dealing with complex data structures and operations such as union and intersection. Due to this complexity, it becomes very difficult to compress the existing file. To reduce the complexity of compacting the AOF, Redis starts with the data in the memory and rewrites the AOF file from scratch. This is more similar to the snapshotting method. Redis forks a child process that recreates the AOF file and performs an atomic rename to swap the old file with a new one. The same problem, of the requirement of extra memory for operations performed during the rewrite, is present here. So the memory required can spike up to two times based on the operations while writing an AOF file. There's more... Both snapshotting and AOF have their own advantages and limitations, which makes it ideal to use both at the same time. Let us now discuss the major advantages and limitations in the snapshotting method. Advantages of snapshotting The advantages of configuring snapshotting in Redis are as follows: RDB is a single compact file that cannot get corrupted due to the way it is created. It is very easy to implement. This dump file is perfect to take backups and for disaster recovery of remote servers. The RDB file can just be copied and saved for future recoveries. In comparison, this approach has little or no influence over performance as the only work the parent process needs to perform is forking a child process. The parent process will never perform any disk operations; they are all performed by the child process. As an RDB file can be compressed, it provides a faster restart when compared to the append-only file method. Limitations of snapshotting Snapshotting, in spite of the advantages mentioned, has a few limitations that you should be aware of: The periodic background save can result in significant loss of data in case of server or hardware failure. The fork() process used to save the data might take a moment, during which the server will stop serving clients. The larger the data set to be saved, the longer it takes the fork() process to complete. The memory needed for the data set might double in the worst-case scenario, when all the keys in the memory are modified while snapshotting is in progress. What should we use? Now that we have discussed both the modes of persistence Redis provides us with, the big question is what should we use? The answer to this question is entirely based on our application and requirements. In cases where we expect good durability, both snapshotting and AOF can be turned on and be made to work in unison, providing us with redundant persistence. Redis always restores the data from AOF wherever applicable, as it is supposed to have better durability with little loss of data. Both RDB and AOF files can be copied and stored for future use or for recovering another instance of Redis. In a few cases, where performance is very critical, memory usage is limited, and persistence is also paramount, persistence can be turned off completely. In these cases, replication can be used to get durability. Replication is a process in which two Redis instances, one master and one slave, are in sync with the same data. Clients are served by the master, and the master server syncs the data with a slave. Replication setup for persistence Consider a setup as shown in the preceding image; that is: Master instance with no persistence Slave instance with AOF enabled In this case, the master does not need to perform any background disk operations and is fully dedicated to serve client requests, except for a trivial slave connection. The slave server configured with AOF performs the disk operations. As mentioned before, this file can be used to restore the master in case of a disaster. Persistence in Redis is a matter of configuration, balancing the trade-off between performance, disk I/O, and data durability. If you are looking for more information on persistence in Redis, you will find the article by Salvatore Sanfilippo at http://oldblog.antirez.com/post/redis-persistence-demystified.html interesting. Summary This article helps you to understand the persistence option available in Redis, which could ease your efforts of adding Redis to your application stack. Resources for Article : Further resources on this subject: Using Execnet for Parallel and Distributed Processing with NLTK [Article] Parsing Specific Data in Python Text Processing [Article] Python Text Processing with NLTK: Storing Frequency Distributions in Redis [Article]
Read more
  • 0
  • 3
  • 7259

article-image-article-optimizing-programs
Packt
29 May 2013
6 min read
Save for later

Optimizing Programs

Packt
29 May 2013
6 min read
(For more resources related to this topic, see here.) Using transaction SAT to find problem areas In this recipe, we will see the steps required to analyze the execution of any report, transaction, or function module using the transaction SAT. Getting ready For this recipe, we will analyze the runtime of a standard program RIBELF00 (Display Document Flow Program). The program selection screen contains a number of fields. We will execute the program on the order number (aufnr) and see the behavior. How to do it... For carrying out runtime analysis using transaction SAT, proceed as follows: Call transaction SAT. The screen appears as shown: Enter a suitable name for the variant (in our case, YPERF_VARIANT) and click the Create button below it. This will take you to the Variant creation screen. On the Duration and Type tab, switch on Aggregation by choosing the Per Call Position radio-button. Then, click on the Statements tab. On the Statements tab, make sure Internal Tables, the Read Operations checkbox and the Change Operations checkbox, and the Open SQL checkbox under Database Access are checked. Save your variant. Come back to the main screen of SAT. Make sure that within Data Formatting on the initial screen of SAT, the checkbox for Determine Names of Internal Tables is selected. Next, enter the name of the program that is to be traced in the field provided (in our case, it is RIBELF00). Then click the   button. The screen of the program appears as shown. We will enter an order number range and execute the program. Once the program output is generated, click on the Back key to come back to program selection screen. Click on the Back key once again to generate the evaluation results. How it works... We carried out the execution of the program through the transaction SAT and the evaluation results were generated. On the left are the Trace Results (in tree form) listing the statements/ events with the most runtime. These are like a summary report of the entire measurement of the program. They are listed in descending order of the Net time in microseconds and the percentage of the total time. For example, in our case, the OPEN CURSOR event takes 68 percent of the total runtime of the program. Selecting the Hit List tab will show the top time consumer components of the program. In this example, the access of database tables AFRU and VBAK takes most of the time. Double-clicking any item in the Trace Results window on the left-hand side will display (in the Hit List area on the right-hand pane) details of contained items along with execution time of each item. From the Hit List window, double-clicking a particular item will take us to the relevant line in the program code. For example, when we double-click the Open Cursor VBAK line, it will take us to the corresponding program code. We have carried out analysis with Aggregation switched on. The switching on of Aggregation shows one single entry for a multiple calls of a particular line of code. Because of this, the results are less detailed and easier to read, since the hit list and the call hierarchy in the results are much more simplified. Also within the results, by default, the names of the internal table used are not shown. In order for the internal table names to appear in the evaluation result, the Determine Names checkbox of Internal tables indicator is checked. As a general recommendation, the runtime analysis should be carried out several times for best results. The reason being that the DB-measurement time could be dependent on a variety of factors, such as system load, network performance, and so on. Creation of secondary indexes in database tables Very often, the cause of a long running report is full-scan of a database table specified within the code, mainly because no suitable index exists. In this recipe, we will see the steps required in creating a new secondary index in database table for performance improvement. Creating indexes lets you optimize standard reports as well as your own reports. In this recipe, we will create a secondary index on a test table ZST9_VBAK (that is simply a copy of VBAK). How to do it... For creating a secondary index, proceed as follows: Call transaction SE11. Enter the name of the table in the field provided, in our case, ZST9_VBAK. Then click the Display button. This will take you to the Display Table screen. Next, choose the menu path Goto | Indexes. This will display all indexes that currently exist for the table. Click the Create button and then choose the option Create Extension Index The dialog box appears. Enter a three-digit name for the index. Then, press Enter. This will take you to the extension index maintenance screen. On the top part, enter the short description in the Short Description field provided. We will create a non-unique index so the Non-unique index radio button is selected (on the middle part of the screen). On the lower part of the screen, specify the field names to be used in the index. In our case, we use MANDT and AUFNR . Then, activate your index using keys Ctrl + F3. The index will be created in the database with appropriate message of creation shown below Status. How it works... This will create the index on the database. Since we created an extension index, the index will not be overwritten by SAP during an upgrade. Now any report that accesses ZST9_VBAK table specifying MANDT and AUFNR in the WHERE clause, will take advantage of index scan using our new secondary index. There's more... It is recommended by SAP that the index be first created in development system and then transport to quality, and to the production system. Secondary indexes are not automatically generated on target systems after being transported. We should check the status on the Activation Log in the target systems, and use the Database Utility to manually activate the index in question. A secondary index, preferably, must have fields that are not common (or as much as uncommon as possible) with other indexes. Too many redundant secondary indexes (that is, too many common fields across several indexes) on a table has a negative impact on performance. For instance, a table with 10 secondary indexes is sharing more than three fields. In addition, tables that are rarely modified (and very often read) are the ideal candidates for secondary indexes. See also http://help.sap.com/saphelp_erp2005/helpdata/EN/85/685a41cdbf80 47e10000000a1550b0/content.htm http://help.sap.com/saphelp_nw04/helpdata/en/cf/21eb2d446011d1 89700000e8322d00/frameset.htmhttp://docs.oracle.com/cd/ SELECT clause E17076_02/html/programmer_reference/am_second.html http://forums.sdn.sap.com/thread.jspa?threadID=1469347
Read more
  • 0
  • 0
  • 2086

article-image-techniques-for-creating-a-multimedia-database
Packt
17 May 2013
37 min read
Save for later

Techniques for Creating a Multimedia Database

Packt
17 May 2013
37 min read
(For more resources related to this topic, see here.) Tier architecture The rules surrounding technology are constantly changing. Decisions and architectures based on current technology might easily become out of date with hardware changes. To best understand how multimedia and unstructured data fit and can adapt to the changing technology, it's important to understand how and why we arrived at our different current architectural positions. In some cases we have come full circle and reinvented concepts that were in use 20 years ago. Only by learning from the lessons of the past can we see how to move forward to deal with this complex environment. In the past 20 years a variety of architectures have come about in an attempt to satisfy some core requirements: Allow as many users as possible to access the system Ensure those users had good performance for accessing the data Enable those users to perform DML (insert/update/delete) safely and securely (safely implies ability to restore data in the event of failure) The goal of a database management system was to provide an environment where these points could be met. The first databases were not relational. They were heavily I/O focused as the computers did not have much memory and the idea of caching data was deemed to be too expensive. The servers had kilobytes and then eventually, megabytes of memory. This memory was required foremost by the programs to run in them. The most efficient architecture was to use pointers to link the data together. The architecture that emerged naturally was hierarchical and a program would navigate the hierarchy to find rows related to each other. Users connected in via a dumb terminal. This was a monitor with a keyboard that could process input and output from a basic protocol and display it on the screen. All the processing of information, including how the screen should display it (using simple escape sequence commands), was controlled in the server. Traditional no tier The mainframes used a block mode structure, where the user would enter a screen full of data and press the Enter key. After doing this the whole screen of information was sent to the server for processing. Other servers used asynchronous protocols, where each letter, as it was typed, was sent to the server for processing. This method was not as efficient as block mode because it required more server processing power to handle the data coming in. It did provide a friendlier interface for data entry as mistakes made could be relayed immediately back to the user. Block mode could only display errors once the screen of data was sent, processed, and returned. As more users started using these systems, the amount of data in them began to grow and the users wanted to get more intelligence out of the data entered. Requirements for reporting appeared as well as the ability to do ad hoc querying. The databases were also very hard to maintain and enhance as the pointer structure linked everything together tightly. It was very difficult to perform maintenance and changes to code. In the 1970s the relational database concept was formulated and it was based on sound mathematical principles. In the early 1980s the first conceptual relational databases appeared in the marketplace with Oracle leading the way. The relational databases were not received well. They performed poorly and used a huge amount of server resources. Though they achieved a stated goal of being flexible and adaptable, enabling more complex applications to be built quicker, the performance overheads of performing joins proved to be a major issue. Benefits could be seen in them, but they could never be seen as being able to be used in any environment that required tens to hundreds or thousands of concurrent users. The technology wasn't there to handle them. To initially achieve better performance the relational database vendors focused on using a changing hardware feature and that was memory. By the late 1980s the computer servers were starting to move from 16 bit to 32 bit. The memory was increasing and there was drop in the price. By adapting to this the vendors managed to take advantage of memory and improved join performance. The relational databases in effect achieved a balancing act between memory and disk I/O. Accessing a disk was about a thousand times slower than accessing memory. Memory was transient, meaning if there was a power failure and if there was data stored in memory, it would be lost. Memory was also measured in megabytes, but disk was measured in gigabytes. Disk was not transient and generally reliable, but still required safeguards to be put in place to protect from disk failure. So the balancing act the databases performed involved caching data in memory that was frequently accessed, while ensuring any modifications made to that data were always stored to disk. Additionally, the database had to ensure no data was lost if a disk failed. To improve join performance the database vendors came up with their own solutions involving indexing, optimization techniques, locking, and specialized data storage structures. Databases were judged on the speed at which they could perform joins. The flexibility and ease in which applications could be updated and modified compared to the older systems soon made the relational database become popular and must have. As all relational databases conformed to an international SQL standard, there was a perception that a customer was never locked into a propriety system and could move their data between different vendors. Though there were elements of truth to this, the reality has shown otherwise. The Oracle Database key strength was that you were not locked into the hardware and they offered the ability to move a database between a mainframe to Windows to Unix. This portability across hardware effectively broke the stranglehold a number of hardware vendors had, and opened up the competition enabling hardware vendors to focus on the physical architecture rather than the operating system within it. In the early 1990s with the rise in popularity of the Apple Macintosh, the rules changed dramatically and the concept of a user friendly graphical environment appeared. The Graphical User Interface (GUI) screen offered a powerful interface for the user to perform data entry. Though it can be argued that data entry was not (and is still not) as fast as data entry via a dumb terminal interface, the use of colors, varying fonts, widgets, comboboxes, and a whole repository of specialized frontend data entry features made the interface easier to use and more data could be entered with less typing. Arguably, the GUI opened up the computer to users who could not type well. The interface was easier to learn and less training was needed to use the interface. Two tier The GUI interface had one major drawback; it was expensive to run on the CPU. Some vendors experimented with running the GUI directly on the server (the Solaris operating system offered this capability), but it become obvious that this solution would not scale. To address this, the two-tier architecture was born. This involved using the GUI, which was running on an Apple Macintosh or Microsoft Windows or other Windows environment (Microsoft Windows wasn't the only GUI to run on Intel platforms) to handle the display processing. This was achieved by moving the application displayed to the computer that the user was using. Thus splitting the GUI presentation layer and application from the database. This seemed like an ideal solution as the database could now just focus on handling and processing SQL queries and DML. It did not have to be burdened with application processing as well. As there were no agreed network protocols, a number had to be used, including named pipes, LU6.2, DECNET, and TCP/IP. The database had to handle language conversion as the data was moved between the client and the server. The client might be running on a 16-bit platform using US7ASCII as the character set, but the server might be running on 32-bit using EBCDIC as the character set. The network suddenly became very complex to manage. What proved to be the ultimate show stopper with the architecture had nothing to do with the scalability of client or database performance, but rather something which is always neglected in any architecture, and that is the scalability of maintenance. Having an environment of a hundred users, each with their own computer accessing the server, requires a team of experts to manage those computers and ensure the software on it is correct. Application upgrades meant upgrading hundreds of computers at the same time. This was a time-consuming and manual task. Compounded by this is that if the client computer is running multiple applications, upgrading one might impact the other applications. Even applying an operating system patch could impact other applications. Users also might install their own software on their computer and impact the application running on it. A lot of time was spent supporting users and ensuring their computers were stable and could correctly communicate with the server. Three tier Specialized software vendors tried to come to the rescue by offering the ability to lock down a client computer from being modified and allowing remote access to the computer to perform remote updates. Even then, the maintenance side proved very difficult to deal with and when the idea of a three tier architecture was pushed by vendors, it was very quickly adopted as the ideal solution to move towards because it critically addressed the maintenance issue. In the mid 1990s the rules changed again. The Internet started to gain in popularity and the web browser was invented. The browser opened up the concept of a smart presentation layer that is very flexible and configured using a simple mark up language. The browser ran on top of the protocol called HTTP, which uses TCP/IP as the underlying network protocol. The idea of splitting the presentation layer from the application became a reality as more applications appeared in the browser. The web browser was not an ideal platform for data entry as the HTTP protocol was stateless making it very hard to perform transactions in it. The HTTP protocol could scale. The actual usage involved the exact same concepts as block mode data entry performed on mainframe computers. In a web browser all the data is entered on the screen, and then sent in one go to the application handling the data. The web browser also pushed the idea that the operating system the client is running on is immaterial. The web browsers were ported to Apple computers, Windows, Solaris, and Unix platforms. The web browser also introduced the idea of standard for the presentation layer. All vendors producing a web browser had to conform to the agreed HTML standard. This ensured that anyone building an application that confirmed to HTML would be able to run on any web browser. The web browser pushed the concept that the presentation layer had to run on any client computer (later on, any mobile device as well) irrespective of the operating system and what else was installed on it. The web browser was essentially immune from anything else running on the client computer. If all the client had to use was a browser, maintenance on the client machine would be simplified. HTML had severe limitations and it was not designed for data entry. To address this, the Java language came about and provided the concept of an applet which could run inside the browser, be safe, and provide an interface to the user for data entry. Different vendors came up with different architectures for splitting their two tier application into a three tier one. Oracle achieved this by taking their Oracle Forms product and moving it to the middle application tier, and providing a framework where the presentation layer would run as a Java applet inside the browser. The Java applet would communicate with a process on the application server and it would give it its own instructions for how to draw the display. When the Forms product was replaced with JDeveloper, the same concept was maintained and enhanced. The middle tier became more flexible and multiple middle application tiers could be configured enabling more concurrent users. The three tier architecture has proven to be an ideal environment for legacy systems, giving them a new life and enabling them be put in an environment where they can scale. The three tier environment has a major flaw preventing it from truly scaling. The flaw is the bottleneck between the application layer and the database. The three tier environment also is designed for relational databases. It is not designed for multimedia databases.In the architecture if the digital objects are stored in the database, then to be delivered to the customer they need to pass through the application-database network (exaggerating the bottleneck capacity issues), and from there passed to the presentation layer. Those building in this environment naturally lend themselves to the concept that the best location for the digital objects is the middle tier. This then leads to issues of security, backing up, management, and all the issues previously cited for why storing the digital objects in the database is ideal. The logical conclusion to this is to move the database to the middle tier to address this. In reality, the logical conclusion is to move the application tier back into the database tier. Virtualized architecture In the mid 2000s the idea of a virtualization began to appear in the marketplace. A virtualization was not really a new idea and the concept has existed on the IBM MVS environment since the late 1980s. What made this virtualization concept powerful was that it could run Windows, Linux, Solaris, and Mac environments within them. A virtualized environment was basically the ability to run a complete operating system within another operating system. If the computer server had sufficient power and memory, it could run multiple virtualizations (VMs). We can take the snapshot of a VM, which involves taking a view of the disk and memory and storing it. It then became possible to rollback to the snapshot. A VM could be easily cloned (copied) and backed up. VMs could also be easily transferred to different computer servers. The VM was not tied to a physical server and the same environment could be moved to new servers as their capacity increased. A VM environment became attractive to administrators simply because they were easy to manage. Rather than running five separate servers, an administrator could have the one server with five virtualizations in it. The VM environment entered at a critical moment in the evolution of computer servers. Prior to 2005 most computer servers had one or two CPUs in them. The advanced could have as many as 64 (for example, the Sun E10000), but generally, one or two was the simplest solution. The reason was that computer power was doubling every two years following Moore's law. By around 2005 the market began to realize that there was a limit to the speed of an individual CPU due to physical limitations in the size of the transistors in the chips. The solution was to grow the CPUs sideways and the concept of cores came about. A CPU could be broken down into multiple cores, where each one acted like a separate CPU but was contained in one chip. With the introduction of smart threading, the number of virtual cores increased. A single CPU could now simulate eight or more CPUs. This concept has changed the rules. A server can now run with a large number of cores whereas 10 years ago it was physically limited to one or two CPUs. If a process went wild and consumed all the resources of one CPU, it impacted all users. In the multicore CPU environment, a rogue process will not impact the others. In a VM the controlling operating system (which is also called a hypervisor, and can be hardware, firmware, or software centric) can enable VMs to be constrained to certain cores as well as CPU thresholds within that core. This allows a VM to be fenced in. This concept was taken by Amazon and the concept of the cloud environment formed. This architecture is now moving into a new path where users can now use remote desktop into their own VM on a server. The user now needs a simple laptop (resulting in the demise of the tower computer) to use remote desktop (or equivalent) into the virtualization. They then become responsible for managing their own laptop, and in the event of an issue, it can be replaced or wiped and reinstalled with a base operating system on it. This simplifies the management. As all the business data and application logic is in the VM, the administrator can now control it, easily back it up, and access it. Though this VM cloud environment seems like a good solution to resolving the maintenance scalability issue, a spanner has been thrown in the works at the same time as VMs are becoming popular, so was the evolution of the mobile into a portable hand held device with applications running on it. Mobile applications architecture The iPhone, iPad, Android, Samsung, and other devices have caused a disruption in the marketplace as to how the relationship between the user and the application is perceived and managed. These devices are simpler and on the face of it employ a variety of architectures including two tier and three tier. Quality control of the application is managed by having an independent and separate environment, where the user can obtain their application for the mobile device. The strict controls Apple employs for using iTunes are primarily to ensure that the Trojan code or viruses are not embedded in the application, resulting in a mobile device not requiring a complex and constantly updating anti-virus software. Though the interface is not ideal for heavy data entry, the applications are naturally designed to be very friendly and use touch screen controls. The low cost combined with their simple interface has made them an ideal product for most people and are replacing the need for a laptop in a number of cases. Application vendors that have applications that naturally lend themselves to this environment are taking full advantage of it to provide a powerful interface for clients to use. The result is that there are two architectures today that exist and are moving in different directions. Each one is popular and resolves certain issues. Each has different interfaces and when building and configuring a storage repository for digital objects, both these environments need to be taken into consideration. For a multimedia environment the ideal solution to implement the application is based on the Web. This is because the web environment over the last 15 years has evolved into one which is very flexible and adaptable for dealing with the display of those objects. From the display of digital images to streaming video, the web browser (with sometimes plugins to improve the display) is ideal. This includes the display of documents. The browser environment though is not strong for the editing of these digital objects. Adobe Photoshop, Gimp, Garage Band, Office, and a whole suite of other products are available that are designed to edit each type of digital object perfectly. This means that currently the editing of those digital objects requires a different solution to the loading, viewing and delivery of those digital objects. There is no right solution for the tier architecture to manage digital objects. The N-Tier model moves the application and database back into the database tier. An HTTP server can also be located in this tier or for higher availability it can be located externally. Optimal performance is achieved by locating the application as close to the database as possible. This reduces the network bottleneck. By locating the application within the database (in Oracle this is done by using PL/SQL or Java) an ideal environment is configured where there is no overhead between the application and database. The N-Tier model also supports the concept of having the digital objects stored outside the environment and delivered using other methods. This could include a streaming server. The N-Tier model also supports the concept of transformation servers. Scalability is achieved by adding more tiers and spreading the database between them. The model also deals with the issue of the connection to the Internet becoming a bottleneck. A database server in the tier is moved to another network to help balance the load. For Oracle this can be done using RAC to achieve a form of transparent scalability. In most situations, Tuning, scalability at the server is achieved using manual methods using a form of application partitioning. Basic database configuration concepts When a database administrator first creates a database that they know will contain digital objects, they will be confronted with some basic database configuration questions covering key sizing features of the database. When looking at the Oracle Database there are a number of physical and logical structures built inside the database. To avoid confusion with other database management systems, it's important to note that an Oracle Database is a collection of schemas, whereas in other database management the terminology for a database equates to exactly one schema. This confusion has caused a lot of issues in the past. An Oracle Database administrator will say it can take 30 minutes to an hour to create a database, whereas a SQL Server administrator will say it takes seconds to create a database. In Oracle to create a schema (the same as a SQL Server database) also takes seconds to perform. For the physical storage of tables, the Oracle Database is composed of logical structures called tablespaces. The tablespace is designed to provide a transparent layer between the developer creating a table and the physical disk system and to ensure the two are independent. Data in a table that resides in a tablespace can span multiple disks and disk subsystem or a network storage system. A subsystem equating to a Raid structure has been covered in greater detail at the end of this article. A tablespace is composed of many physical datafiles. Each datafile equates to one physical file on the disk. The goal when creating a datafile is to ensure its allocation of storage is contiguous in that the operating system and doesn't split its location into different areas on the disk (Raid and NAS structures store the data in different locations based on their core structure so this rule does not apply to them). A contiguous file will result in less disk activity being performed when full tablespace scans are performed. In some cases, especially, when reading in very large images, this can improve performance. A datafile is fragmented (when using locally managed tablespaces, the default in Oracle) into fixed size extents. Access to the extents is controlled via a bitmap which is managed in the header of the tablespace (which will reside on a datafile). An extent is based on the core Oracle block size. So if the extent is 128 KB and the database block size is 8 KB, 16 Oracle blocks will exist within the extent. An Oracle block is the smallest unit of storage within the database. Blocks are read into memory for caching, updated, and changes stored in the redo logs. Even though the Oracle block is the smallest unit of storage, as a datafile is an operating system file, based on the type of server filesystem (UNIX can be UFS and Windows can be NTFS), the unit of storage at this level can change. The default in Windows was once 512 bytes, but with NTFS can be as high as 64 KB. This means every time a request is made to the disk to retrieve data from the filesystem it does a read to return this amount of data. So if the Oracle block's size was 8 KB in size and the filesystem block size was 64 KB, when Oracle requests a block to be read in, the filesystem will read in 64 KB, return the 8 KB requested, and reject the rest. Most filesystems cache this data to improve performance, but this example highlights how in some cases not balancing the database block size with the filesystem block size can result in wasted I/O. The actual answer to this is operating system and filesystem dependent, and it also depends on whether Oracle is doing read aheads (using the init.ora parameter db_file_multiblock_read_count). When Oracle introduced the Exadata they put forward the idea of putting smarts into the disk layer. Rather than the database working out how best to retrieve the physical blocks of data, the database passes a request for information to the disk system. As the Exadata knows about its own disk performance, channel speed, and I/O throughput, it is in a much better position for working out the optimal method for extracting the data. It then works out the best way of retrieving it based on the request (which can be a query). In some cases it might do a full table scan because it can process the blocks faster than if it used an index. It now becomes a smart disk system rather than a dumb/blind one. This capability has changed the rules for how a database works with the underlying storage system. ASM—Automated Storage Management In Oracle 10G, Oracle introduced ASM primarily to improve the performance of Oracle RAC (clustered systems, where multiple separate servers share the same database on the same disk). It replaces the server filesystem and can handle mirroring and load balancing of datafiles. ASM takes the filesystem and operating system out of the equation and enables the database administrator to have a different degree of control over the management of the disk system. Block size The database block size is the fundamental unit of storage within an Oracle Database. Though the database can support different block sizes, a tablespace is restricted to one fixed block size. The block sizes available are 4 KB, 8 KB, 16 KB, and 32 KB (a 32 KB block size is valid only on 64-bit platforms). The current tuning mentality says it's best to have one block size for the whole database. This is based on the idea that the one block size makes it easier to manage the SGA and ensure that memory isn't wasted. If multiple block sizes are used, the database administrator has to partition the SGA into multiple areas and assign each a block size. So if the administrator decided to have the database at 8 KB and 16 KB, they would have to set up a database startup parameter indicating the size of each: DB_8K_CACHE_SIZE = 2GDB_16K_CACHE_SIZE = 1G The problem that an administrator faces is that it can be hard to judge memory usage with table usage. In the above scenario the tables residing in the 8 KB block might be accessed a lot more than 16 KB ones, meaning the memory needs to be adjusted to deal with that. This balancing act of tuning invariably results in the decision that unless exceptional situations warrant its use, it's best to keep to the same database blocks size across the whole database. This makes the job of tuning simpler. As is always the case when dealing with unstructured data, the rules change. The current thinking is that it's more efficient to store the data in a large block size. This ensures there is less wasted overhead and fewer block reads to read in a row of data. The challenge is that the size of the unstructured data can vary dramatically. It's realistic for an image thumbnail to be under 4 KB in size. This makes it an ideal candidate to be stored in the row with the other relational data. Even if an 8 KB block size is used, the thumbnail and other relational data might happily exist in the one block. A photo might be 10 MB in size requiring a large number of blocks to be used to store it. If a 16 KB block size is used, it requires about 64 blocks to store 1 MB (assuming there is some overhead that requires overall extra storage for the block header). An 8 KB block size requires about 130 blocks. If you have to store 10 MB, the number of blocks increases 10 times. For an 8 KB block that is over 1300 reads is sufficient for one small-sized 10 MB image. With images now coming close to 100 MB in size, this figure again increases by a factor of 10. It soon becomes obvious that a very large block size is needed. When storing video at over 4 GB in size, even a 32 KB block size seems too small. As is covered later in the article, unstructured data stored in an Oracle blob does not have to be cached in the SGA. In fact, it's discouraged because in most situations the data is not likely to be accessed on a frequent basis. This generally holds true but there are cases, especially with video, where this does not hold true and this situation is covered later. Under the assumption that the thumbnails are accessed frequently and should be cached and the originals are accessed infrequently and should not be cached, the conclusion is that it now becomes practical to split the SGA in two. The unstructured, uncached data is stored in a tablespace using a large block size (32 KB) and the remaining data is stored in a more acceptable and reasonable 8 KB block. The SGA for the 32 KB is kept to a bare minimum as it will not be used, thus bypassing the issue of perceived wasted memory by splitting the SGA in two. In the following table a simple test was done using three tablespace block sizes. The aim was to see if the block size would impact load and read times. The load involved reading in 67 TIF images totaling 3 GB in size. The result was that the tablespace block size made no statistical significant difference. The test was done using a 50-MB extent size and as shown shown in the next segment, this size will impact performance. So to correctly understand how important block size can be, one has to look at not only the block size but also the extent size. Details of the environment used to perform these tests CREATE TABLESPACE tbls_name BLOCKSIZE 4096/8192/16384 EXTENTMANAGEMENT LOCAL UNIFORM SIZE 50M segment space management autodatafile 'directory/datafile' size 5G reuse; The following table compares the various block sizes: Tablespace block size Blocks Extents Load time Read time 4 KB 819200 64 3.49 minutes 1.02 minutes 8 KB 403200 63 3.46 minutes 0.59 minutes 16 KB 201600 63 3.55 minutes 0.59 minutes UNIFORM extent size and AUTOALLOCATE When creating a tablespace to store the unstructured data, the next step after the block size is determined is to work out what the most efficient extent size will be. As a table might contain data ranging from hundreds of gigabytes to terabytes determining the extent size is important. The larger the extent, the potential to possible waste space if the table doesn't use it all is greater. The smaller the extent size the risk is that the table will grow into tens or hundreds of thousands of extents. As a locally managed tablespace uses a bitmap to manage the access to the extents and is generally quite fast, having it manage tens of thousands of extents might be pushing its performance capabilities. There are two methods available to the administrator when creating a tablespace. They can manually specify the fragment size using the UNIFORM extent size clause or they can let the Oracle Database calculate it using the AUTOALLOCATE clause. Tests were done to determine what the optimal fragment size was when AUTOALLOCATE was not used. The AUTOALLOCATE is a more set-and-forget method and one goal was to see if this clause was as efficient as manually setting it. Locally managed tablespace UNIFORM extent size Covers testing performed to try to find an optimal extent and block size. The results showed that a block size of 16384 (16 KB) is ideal, though 8192 (8 KB) is acceptable. The block size of 32 KB was not tested. The administrator, who might be tempted to think the larger the extent size, the better the performance, would be surprised that the results show that this is not always the case and an extent size between 50 MB-200 MB is optimal. For reads with SECUREFILES the number of extents was not a major performance factor but it was for writes. When compared to the AUTOALLOCATE clause, it was shown there was no real performance improvement or loss when used. The testing showed that an administrator can use this clause knowing they will get a good all round result when it comes to performance. The syntax for configuration is as follows: EXTENT MANAGEMENT LOCAL AUTOALLOCATE segment space management auto Repeated tests showed that this configuration produced optimal read/write times without the database administrator having to worry about what the extent size should be. For a 300 GB tablespace it produced a similar number of extents as when a 50M extent size was used. As has been covered, once an image is loaded it is rare that it is updated. A relational database fragmentation within a tablespace is caused by repeated creation/dropping of schema objects and extents of different sizes, resulting in physical storage gaps, which are not easily reused. Storage is lost. This is analogous to the Microsoft Windows environment with its disk storage. After a period of time, the disk becomes fragmented making it hard to find contiguous storage and locate similar items together. Locating all the pieces in a file as close together as possible can dramatically reduce the number of disk reads required to read it in. With NTFS (a Microsoft disk filesystem format) the system administrator can on creation determine whether extents are autoallocated or fragmented. This is similar in concept to the Oracle tablespace creation. Testing was not done to check if the fragmentation scenario is avoided with the AUTOALLOCATE clause. The database administrator should therefore be aware of the tablespace usage and whether it is likely going to be stable once rows are added (in which case AUTOALLOCATE can be used simplifying storage management). If it is volatile, the UNIFORM clause might be considered as a better option. Temporary tablespace For working with unstructured data, the primary uses of the TEMPORARY tablespace is to hold the contents of temporary tables and temporary lobs. A temporary lob is used for processing a temporary multimedia object. In the following example, a temporary blob is created. It is not cached in memory. A multimedia image type is created and loaded into it. Information is extracted and the blob is freed. This is useful if images are stored temporarily outside the database. This is not the same case as using a bfile which Oracle Multimedia supports. The bfile is a permanent pointer to an image stored outside the database. SQL>declareimage ORDSYS.ORDImage;ctx raw(4000);beginimage := ordsys.ordimage.init();dbms_lob.createtemporary(image.source.localdata,FALSE);image.importfrom(ctx, 'file', 'LOADING_DIR', 'myimg.tif');image.setProperties;dbms_output.put_line( 'width x height = ' || image.width ||'x' || image.height);dbms_lob.freetemporary(image.source.localdata);end;/width x height = 2809x4176 It's important when using this tablespace to ensure that all code, especially on failure, performs a dbms_lob.freetemporary function, to ensure that storage leakage doesn't occur. This will result in the tablespace continuing to grow until it runs out of room. In this case the only way to clean it up is to either stop all database processes referencing, then resize the datafile (or drop and recreate the temporary tablespace after creating another interim one), or to restart the database and mount it. The tablespace can then be resized or dropped and recreated. UNDO tablespace The UNDO tablespace is used by the database to store sufficient information to rollback a transaction. In a database containing a lot of digital objects, the size of the database just for storage of the objects can exceed terabytes. In this situation the UNDO tablespace can be sized larger giving added opportunity for the database administrator to perform flashback recovery from user error. It's reasonable to size the UNDO tablespace at 50 GB even growing it to 100 GB in size. The larger the UNDO tablespace the further back in time the administrator can go and the greater the breathing space between user failure, user failure detected and reported, and the database administrator doing the flash back recovery. The following is an example flashback SQL statement. The as of timestamp clause tells Oracle to find rows that match the timestamp from the current time going back so that we can have a look at a table an hour ago: select t.vimg.source.srcname || '=' ||dbms_lob.getlength(t.vimg.source.localdata)from test_load as of timestamp systimestamp - (1/24) t; SYSTEM tablespace The SYSTEM tablespace contains the data dictionary. In Oracle 11g R2 it also contains any compiled PL/SQL code (where PLSQL_CODE_TYPE=NATIVE). The recommended initial starting size of the tablespace should be 1500 MB. Redo logs The following test results highlight how important it is to get the size and placement of the redo logs correct. The goal was to determine what combination of database parameters and redo/undo size were optimal. In addition, an SSD was used as a comparison. Based on the result of each test, the parameters and/or storage was modified to see whether it would improve the results. When it appeared an optimal parameter/storage setting was found, it was locked in while the other parameters were tested further. This enabled multiple concurrent configurations to be tested and an optimal result to be calculated. The test involved loading 67 images into the database. Each image varied in size between 40 to 80 MB resulting in 2.87 GB of data being loaded. As the test involved only image loading, no processing such as setting properties or extraction of metadata was performed. Archiving on the database was not enabled. All database files resided on hard disk unless specified. In between each test a full database reboot was done. The test was run at least three times with the range of results shown as follows: Database parameter descriptions used:Redo Buffer Size = LOG_BUFFERMultiblock Read Count = db_file_multiblock_read_count Source disk Redo logs Database parameters Fastest time Slowest time Hard disk Hard disk 3 x 50 MB Redo buffer size = 4 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 3 minutes and 22 sec 3 minutes and 53 sec Hard disk Hard disk 3 x 1 GB Redo buffer size = 4 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 2 minutes and 49 sec 2 minutes and 57 sec Hard disk SSD 3 x 1 GB Redo buffer size = 4 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 30 sec 1 minute and 41 sec Hard disk SSD 3 x 1 GB Redo buffer size = 64 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 23 sec 1 minute and 48 sec Hard disk SSD 3 x 1 GB Redo buffer size = 8 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 18 sec 1 minute and 29 sec Hard disk SSD 3 x 1 GB Redo buffer size = 16 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 19 sec 1 minute and 27 sec Hard disk SSD 3 x 1 GB Redo buffer size = 16 MB Multiblock read count = 256 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 27 sec 1 minute and 41 sec Hard disk SSD 3 x 1 GB Redo buffer size = 8 MB Multiblock read count = 64 UNDO tablespace = 1 GB on SSD Table datafile on HD 1 minute and 21 sec 1 minute and 49 sec SSD SSD 3 x 1 GB Redo buffer size = 8 MB Multiblock read count = 64 UNDO tablespace = 1 GB on SSD Table datafile on HD 53 sec 54 sec SSD SSD 3 x 1 GB Redo buffer size = 8 MB Multiblock read count = 64 UNDO tablespace = 1 GB on SSD Table datafile on SSD 1 minute and 20 sec 1 minute and 20 sec Analysis The tests show a huge improvement when the redo logs were moved to a Solid State Drive (SSD). Though the conclusion that can be drawn is this: the optimal step to perform it might be self defeating. A number of manufacturers of SSD acknowledge there are limitations with the SSD when it comes to repeated writes. The Mean Time to Failure (MTF) might be 2 million hours for reads; for writes the failure rate can be very high. Modern SSD and flash cards offer much improved wear leveling algorithms to reduce failures and make performance more consistent. No doubt improvements will continue in the future. A redo log by its nature is constant and has heavy writes. So, moving the redo logs to the SSD might quickly result in it becoming damaged and failing. For an organization that on configuration performs one very large load of multimedia, the solution might be to initially keep the redo logs on SSD, and once the load is finished, to move the redo logs to a hard drive. Increasing the size of the redo logs from 50 MB to 1 GB improves performance and all database containing unstructured data should have a redo log size of at least 1 GB. The number of logs should be at least 10; preferred is from 50 to 100. As is covered later, disk is cheaper today than it once was, and 100 GB of redo logs is not that large a volume of data as it once was. The redo logs should always be mirrored. The placement or size of the UNDO tablespace makes no difference with performance. The redo buffer size (LOG_BUFFER) showed a minor improvement when it was increased in size, but the results were inconclusive as the figures varied. A figure of LOG_BUFFER=8691712, showed the best results and database administrators might use this figure as a starting point for tuning. The changing of multiblock read count (DB_FILE_MULTIBLOCK_READ_COUNT) from the default value of 64 to 256 showed no improvement. As the default value (in this case 64) is set by the database as optimal for the platform, the conclusion that can be drawn is that the database has set this figure to be a good size. By moving the original images to an SSD showed another huge improvement in performance. This highlighted how the I/O bottleneck of reading from disk and the writing to disk (redo logs) is so critical for digital object loading. The final test involved moving the datafile containing the table to the SSD. It highlighted a realistic issue that DBAs face in dealing with I/O. The disk speed and seek time might not be critical in tuning if the bottleneck is the actual time it takes to transfer the data to and from the disk to the server. In the test case the datafile was moved to the same SSD as the redo logs resulting in I/O competition. In the previous tests the datafile was on the hard disk and the database could write to the disk (separate I/O channel) and to the redo logs (separate I/O channel) without one impacting the other. Even though the SSD is a magnitude faster in performance than the disk, it quickly became swamped with calls for reads and writes. The lesson is that it's better to have multiple smaller SSDs on different I/O channels into the server than one larger channel. Sites using a SAN will soon realize that even though SAN might offer speed, unless it offers multiple I/O channels into the server, its channel to the server will quickly become the bottleneck, especially if the datafiles and the images for loading are all located on the server. The original tuning notion of separating data fi les onto separate disks that was performed more than 15 years ago still makes sense when it comes to image loading into a multimedia database. It's important to stress that this is a tuning issue while dealing with image loading not when running the database in general. Tuning the database in general is a completely different story and might result in a completely different architecture.
Read more
  • 0
  • 0
  • 9514

article-image-tracking-faces-haar-cascades
Packt
13 May 2013
4 min read
Save for later

OpenCV: Tracking Faces with Haar Cascades

Packt
13 May 2013
4 min read
Conceptualizing Haar cascades When we talk about classifying objects and tracking their location, what exactly are we hoping to pinpoint? What constitutes a recognizable part of an object? Photographic images, even from a webcam, may contain a lot of detail for our (human) viewing pleasure. However, image detail tends to be unstable with respect to variations in lighting, viewing angle, viewing distance, camera shake, and digital noise. Moreover, even real differences in physical detail might not interest us for the purpose of classification. I was taught in school, that no two snowflakes look alike under a microscope. Fortunately, as a Canadian child, I had already learned how to recognize snowflakes without a microscope, as the similarities are more obvious in bulk. Thus, some means of abstracting image detail is useful in producing stable classification and tracking results. The abstractions are called features, which are said to be extracted from the image data. There should be far fewer features than pixels, though any pixel might influence multiple features. The level of similarity between two images can be evaluated based on distances between the images' corresponding features. For example, distance might be defined in terms of spatial coordinates or color coordinates. Haar-like features are one type of feature that is often applied to real-time face tracking. They were first used f or this purpose by Paul Viola and Michael Jones in 2001. Each Haar-like feature describes the pattern of contrast among adjacent image regions. For example, edges, vertices, and thin lines each generate distinctive features. For any given image, the features may vary depending on the regions' size, which may be called the window size. Two images that differ only in scale should be capable of yielding similar features, albeit for different window sizes. Thus, it is useful to generate features for multiple window sizes. Such a collection of features is called a cascade. We may say a Haar cascade is scale-invariant or, in other words, robust to changes in scale. OpenCV provides a classifier and tracker for scale-invariant Haar cascades, whic h it expects to be in a certain file format. Haar cascades, as implemented in OpenCV, are not robust to changes in rotation. For example, an upside-down face is not considered similar to an upright face and a face viewed in profile is not considered similar to a face viewed from the front. A more complex and more resource-intensive implementation could improve Haar cascades' robustness to rotation by considering multiple transformations of images as well as multiple window sizes. However, we will confine ourselves to the implementation in OpenCV. Getting Haar cascade data As part of your OpenCV setup, you probably have a directory called haarcascades. It contains cascades that are trained for certain subjects using tools that come with OpenCV. The directory's full path depends on your system and method of setting up OpenCV, as follows: Build from source archive:: <unzip_destination>/data/haarcascades Windows with self-extracting ZIP:<unzip_destination>/data/haarcascades Mac with MacPorts:MacPorts: /opt/local/share/OpenCV/haarcascades Mac with Homebrew:The haarcascades file is not included; to get it, download the source archive Ubuntu with apt or Software Center: The haarcascades file is not included; to get it, download the source archive If you cannot find haarcascades, then download the source archive from http://sourceforge.net/projects/opencvlibrary/files/opencv-unix/2.4.3/OpenCV-2.4.3.tar.bz2/download (or the Windows self-extracting ZIP from http://sourceforge.net/projects/opencvlibrary/files/opencvwin/ 2.4.3/OpenCV-2.4.3.exe/download), unzip it, and look for <unzip_destination>/data/haarcascades. Once you find haarcascades, create a directory called cascades in the same folder as cameo.py and copy the following files from haarcascades into cascades: haarcascade_frontalface_alt.xmlhaarcascade_eye.xmlhaarcascade_mcs_nose.xmlhaarcascade_mcs_mouth.xml As their names suggest, these cascades are for tracking faces, eyes, noses, and mouths. They require a frontal, upright view of the subject. With a lot of patience and a powerful computer, you can make your own cascades, trained for various types of objects. Creating modules We should continue to maintain good separation between application-specific code and reusable code. Let's make new modules for tracking classes and their helpers. A file called trackers.py should be created in the same directory as cameo.py (and, equivalently, in the parent directory of cascades ). Let's put the following import statements at the start of trackers.py: import cv2import rectsimport utils Alongside trackers.py and cameo.py, let's make another file called rects.py containing the following import statement: import cv2 Our face tracker and a definition of a face will go in trackers.py, while various helpers will go in rects.py and our preexisting utils.py file.
Read more
  • 0
  • 0
  • 22603

article-image-move-further-numpy-modules
Packt
13 May 2013
7 min read
Save for later

Move Further with NumPy Modules

Packt
13 May 2013
7 min read
(For more resources related to this topic, see here.) Linear algebra Linear algebra is an important branch of mathematics. The numpy.linalg package contains linear algebra functions. With this module, you can invert matrices, calculate eigenvalues, solve linear equations, and determine determinants, among other things. Time for action – inverting matrices The inverse of a matrix A in linear algebra is the matrix A-1, which when multiplied with the original matrix, is equal to the identity matrix I. This can be written, as A* A-1 = I. The inv function in the numpy.linalg package can do this for us. Let's invert an example matrix. To invert matrices, perform the following steps: We will create the example matrix with the mat. A = np.mat("0 1 2;1 0 3;4 -3 8") print "An", A The A matrix is printed as follows: A [[ 0 1 2] [ 1 0 3] [ 4 -3 8]] Now, we can see the inv function in action, using which we will invert the matrix. inverse = np.linalg.inv(A) print "inverse of An", inverse The inverse matrix is shown as follows: inverse of A [[-4.5 7. -1.5] [-2. 4. -1. ] [ 1.5 -2. 0.5]] If the matrix is singular or not square, a LinAlgError exception is raised. If you want, you can check the result manually. This is left as an exercise for the reader. Let's check what we get when we multiply the original matrix with the result of the inv function: print "Checkn", A * inverse The result is the identity matrix, as expected. Check[[ 1. 0. 0.][ 0. 1. 0.][ 0. 0. 1.]] What just happened? We calculated the inverse of a matrix with the inv function of the numpy.linalg package. We checked, with matrix multiplication, whether this is indeed the inverse matrix. import numpy as npA = np.mat("0 1 2;1 0 3;4 -3 8")print "An", Ainverse = np.linalg.inv(A)print "inverse of An", inverseprint "Checkn", A * inverse Solving linear systems A matrix transforms a vector into another vector in a linear way. This transformation mathematically corresponds to a system of linear equations. The numpy.linalg function, solve, solves systems of linear equations of the form Ax = b; here A is a matrix, b can be 1D or 2D array, and x is an unknown variable. We will see the dot function in action. This function returns the dot product of two floating-point arrays. Time for action – solving a linear system Let's solve an example of a linear system. To solve a linear system, perform the following steps: Let's create the matrices A and b. iA = np.mat("1 -2 1;0 2 -8;-4 5 9")print "An", Ab = np.array([0, 8, -9])print "bn", b The matrices A and b are shown as follows: Solve this linear system by calling the solve function. x = np.linalg.solve(A, b)print "Solution", x The following is the solution of the linear system: Solution [ 29. 16. 3.] Check whether the solution is correct with the dot function. print "Checkn", np.dot(A , x) The result is as expected: Check[[ 0. 8. -9.]] What just happened? We solved a linear system using the solve function from the NumPy linalg module and checked the solution with the dot function. import numpy as npA = np.mat("1 -2 1;0 2 -8;-4 5 9")print "An", Ab = np.array([0, 8, -9])print "bn", bx = np.linalg.solve(A, b)print "Solution", xprint "Checkn", np.dot(A , x) Finding eigenvalues and eigenvectors Eigenvalues are scalar solutions to the equation Ax = ax, where A is a two-dimensional matrix and x is a one-dimensional vector. Eigenvectors are vectors corresponding to eigenvalues. The eigvals function in the numpy.linalg package calculates eigenvalues. The eig function returns a tuple containing eigenvalues and eigenvectors. Time for action – determining eigenvalues and eigenvectors Let's calculate the eigenvalues of a matrix. Perform the following steps to do so: Create a matrix as follows: A = np.mat("3 -2;1 0")print "An", A The matrix we created looks like the following: A[[ 3 -2][ 1 0]] Calculate eigenvalues by calling the eig function. print "Eigenvalues", np.linalg.eigvals(A) The eigenvalues of the matrix are as follows: Eigenvalues [ 2. 1.] Determine eigenvalues and eigenvectors with the eig function. This function returns a tuple, where the first element contains eigenvalues and the second element contains corresponding Eigenvectors, arranged column-wise. eigenvalues, eigenvectors = np.linalg.eig(A)print "First tuple of eig", eigenvaluesprint "Second tuple of eign", eigenvectors The eigenvalues and eigenvectors will be shown as follows: First tuple of eig [ 2. 1.]Second tuple of eig[[ 0.89442719 0.70710678][ 0.4472136 0.70710678]] Check the result with the dot function by calculating the right- and left-hand sides of the eigenvalues equation Ax = ax. for i in range(len(eigenvalues)):print "Left", np.dot(A, eigenvectors[:,i])print "Right", eigenvalues[i] * eigenvectors[:,i]print The output is as follows: Left [[ 1.78885438][ 0.89442719]]Right [[ 1.78885438][ 0.89442719]]Left [[ 0.70710678][ 0.70710678]]Right [[ 0.70710678][ 0.70710678]] What just happened? We found the eigenvalues and eigenvectors of a matrix with the eigvals and eig functions of the numpy.linalg module. We checked the result using the dot function . import numpy as npA = np.mat("3 -2;1 0")print "An", Aprint "Eigenvalues", np.linalg.eigvals(A)eigenvalues, eigenvectors = np.linalg.eig(A)print "First tuple of eig", eigenvaluesprint "Second tuple of eign", eigenvectorsfor i in range(len(eigenvalues)):print "Left", np.dot(A, eigenvectors[:,i])print "Right", eigenvalues[i] * eigenvectors[:,i]print Singular value decomposition Singular value decomposition is a type of factorization that decomposes a matrix into a product of three matrices. The singular value decomposition is a generalization of the previously discussed eigenvalue decomposition. The svd function in the numpy.linalg package can perform this decomposition. This function returns three matrices – U, Sigma, and V – such that U and V are orthogonal and Sigma contains the singular values of the input matrix. The asterisk denotes the Hermitian conjugate or the conjugate transpose. Time for action – decomposing a matrix It's time to decompose a matrix with the singular value decomposition. In order to decompose a matrix, perform the following steps: First, create a matrix as follows: A = np.mat("4 11 14;8 7 -2")print "An", A The matrix we created looks like the following: A[[ 4 11 14][ 8 7 -2]] Decompose the matrix with the svd function. U, Sigma, V = np.linalg.svd(A, full_matrices=False)print "U"print Uprint "Sigma"print Sigmaprint "V"print V The result is a tuple containing the two orthogonal matrices U and V on the left- and right-hand sides and the singular values of the middle matrix. [-0.31622777 0.9486833 ]]Sigma[ 18.97366596 9.48683298]V[[-0.33333333 -0.66666667 -0.66666667][ 0.66666667 0.33333333 -0.66666667]]U[[-0.9486833 -0.31622777] We do not actually have the middle matrix—we only have the diagonal values. The other values are all 0. We can form the middle matrix with the diag function. Multiply the three matrices. This is shown, as follows: print "Productn", U * np.diag(Sigma) * V The product of the three matrices looks like the following: Product[[ 4. 11. 14.][ 8. 7. -2.]] What just happened? We decomposed a matrix and checked the result by matrix multiplication. We used the svd function from the NumPy linalg module. import numpy as npA = np.mat("4 11 14;8 7 -2")print "An", AU, Sigma, V = np.linalg.svd(A, full_matrices=False)print "U"print Uprint "Sigma"print Sigmaprint "V"print Vprint "Productn", U * np.diag(Sigma) * V Pseudoinverse The Moore-Penrose pseudoinverse of a matrix can be computed with the pinv function of the numpy.linalg module (visit http://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse). The pseudoinverse is calculated using the singular value decomposition. The inv function only accepts square matrices; the pinv function does not have this restriction.
Read more
  • 0
  • 0
  • 2067

article-image-ten-ipython-essentials
Packt
02 May 2013
10 min read
Save for later

Ten IPython essentials

Packt
02 May 2013
10 min read
(For more resources related to this topic, see here.) Running the IPython console If IPython has been installed correctly, you should be able to run it from a system shell with the ipython command. You can use this prompt like a regular Python interpreter as shown in the following screenshot: Command-line shell on Windows If you are on Windows and using the old cmd.exe shell, you should be aware that this tool is extremely limited. You could instead use a more powerful interpreter, such as Microsoft PowerShell, which is integrated by default in Windows 7 and 8. The simple fact that most common filesystem-related commands (namely, pwd, cd, ls, cp, ps, and so on) have the same name as in Unix should be a sufficient reason to switch. Of course, IPython offers much more than that. For example, IPython ships with tens of little commands that considerably improve productivity. Some of these commands help you get information about any Python function or object. For instance, have you ever had a doubt about how to use the super function to access parent methods in a derived class? Just type super? (a shortcut for the command %pinfo super) and you will find all the information regarding the super function. Appending ? or ?? to any command or variable gives you all the information you need about it, as shown here: In [1]: super? Typical use to call a cooperative superclass method: class C(B): def meth(self, arg): super(C, self).meth(arg) Using IPython as a system shell You can use the IPython command-line interface as an extended system shell. You can navigate throughout your filesystem and execute any system command. For instance, the standard Unix commands pwd, ls, and cd are available in IPython and work on Windows too, as shown in the following example: In [1]: pwd Out[1]: u'C:' In [2]: cd windows C:windows These commands are particular magic commands that are central in the IPython shell. There are dozens of magic commands and we will use a lot of them throughout this book. You can get a list of all magic commands with the %lsmagic command. Using the IPython magic commands Magic commands actually come with a % prefix, but the automagic system, enabled by default, allows you to conveniently omit this prefix. Using the prefix is always possible, particularly when the unprefixed command is shadowed by a Python variable with the same name. The %automagic command toggles the automagic system. In this book, we will generally use the % prefix to refer to magic commands, but keep in mind that you can omit it most of the time, if you prefer. Using the history Like the standard Python console, IPython offers a command history. However, unlike in Python's console, the IPython history spans your previous interactive sessions. In addition to this, several key strokes and commands allow you to reduce repetitive typing. In an IPython console prompt, use the up and down arrow keys to go through your whole input history. If you start typing before pressing the arrow keys, only the commands that match what you have typed so far will be shown. In any interactive session, your input and output history is kept in the In and Out variables and is indexed by a prompt number. The _, __, ___ and _i, _ii, _iii variables contain the last three output and input objects, respectively. The _n and _in variables return the nth output and input history. For instance, let's type the following command: In [4]: a = 12 In [5]: a ** 2 Out[5]: 144 In [6]: print("The result is {0:d}.".format(_)) The result is 144. In this example, we display the output, that is, 144 of prompt 5 on line 6. Tab completion Tab completion is incredibly useful and you will find yourself using it all the time. Whenever you start typing any command, variable name, or function, press the Tab key to let IPython either automatically complete what you are typing if there is no ambiguity, or show you the list of possible commands or names that match what you have typed so far. It also works for directories and file paths, just like in the system shell. It is also particularly useful for dynamic object introspection. Type any Python object name followed by a point and then press the Tab key; IPython will show you the list of existing attributes and methods, as shown in the following example: In [1]: import os In [2]: os.path.split<tab> os.path.split os.path.splitdrive os.path.splitext os.path.splitunc In the second line, as shown in the previous code, we press the Tab key after having typed os.path.split. IPython then displays all the possible commands. Tab Completion and Private Variables Tab completion shows you all the attributes and methods of an object, except those that begin with an underscore (_). The reason is that it is a standard convention in Python programming to prefix private variables with an underscore. To force IPython to show all private attributes and methods, type myobject._ before pressing the Tab key. Nothing is really private or hidden in Python. It is part of a general Python philosophy, as expressed by the famous saying, "We are all consenting adults here." Executing a script with the %run command Although essential, the interactive console becomes limited when running sequences of multiple commands. Writing multiple commands in a Python script with the .py file extension (by convention) is quite common. A Python script can be executed from within the IPython console with the %run magic command followed by the script filename. The script is executed in a fresh, new Python namespace unless the -i option has been used, in which case the current interactive Python namespace is used for the execution. In all cases, all variables defined in the script become available in the console at the end of script execution. Let's write the following Python script in a file called script.py: print("Running script.") x = 12 print("'x' is now equal to {0:d}.".format(x)) Now, assuming we are in the directory where this file is located, we can execute it in IPython by entering the following command: In [1]: %run script.py Running script. 'x' is now equal to 12. In [2]: x Out[2]: 12 When running the script, the standard output of the console displays any print statement. At the end of execution, the x variable defined in the script is then included in the interactive namespace, which is quite convenient. Quick benchmarking with the %timeit command You can do quick benchmarks in an interactive session with the %timeit magic command. It lets you estimate how much time the execution of a single command takes. The same command is executed multiple times within a loop, and this loop itself is repeated several times by default. The individual execution time of the command is then automatically estimated with an average. The -n option controls the number of executions in a loop, whereas the -r option controls the number of executed loops. For example, let's type the following command: In[1]: %timeit [x*x for x in range(100000)] 10 loops, best of 3: 26.1 ms per loop Here, it took about 26 milliseconds to compute the squares of all integers up to 100000. Quick debugging with the %debug command IPython ships with a powerful command-line debugger. Whenever an exception is raised in the console, use the %debug magic command to launch the debugger at the exception point. You then have access to all the local variables and to the full stack traceback in postmortem mode. Navigate up and down through the stack with the u and d commands and exit the debugger with the q command. See the list of all the available commands in the debugger by entering the ? command. You can use the %pdb magic command to activate the automatic execution of the IPython debugger as soon as an exception is raised. Interactive computing with Pylab The %pylab magic command enables the scientific computing capabilities of the NumPy and matplotlib packages, namely efficient operations on vectors and matrices and plotting and interactive visualization features. It becomes possible to perform interactive computations in the console and plot graphs dynamically. For example, let's enter the following command: In [1]: %pylab Welcome to pylab, a matplotlib-based Python environment [backend: TkAgg]. For more information, type 'help(pylab)'. In [2]: x = linspace(-10., 10., 1000) In [3]: plot(x, sin(x)) In this example, we first define a vector of 1000 values linearly spaced between -10 and 10. Then we plot the graph (x, sin(x)). A window with a plot appears as shown in the following screenshot, and the console is not blocked while this window is opened. This allows us to interactively modify the plot while it is open. Using the IPython Notebook The Notebook brings the functionality of IPython into the browser for multiline textediting features, interactive session reproducibility, and so on. It is a modern and powerful way of using Python in an interactive and reproducible way To use the Notebook, call the ipython notebook command in a shell (make sure you have installed the required dependencies). This will launch a local web server on the default port 8888. Go to http://127.0.0.1:8888/ in a browser and create a new Notebook. You can write one or several lines of code in the input cells. Here are some of the most useful keyboard shortcuts: Press the Enter key to create a new line in the cell and not execute the cell Press Shift + Enter to execute the cell and go to the next cell Press Alt + Enter to execute the cell and append a new empty cell right after it Press Ctrl + Enter for quick instant experiments when you do not want to save the output Press Ctrl + M and then the H key to display the list of all the keyboard shortcuts Customizing IPython You can save your user preferences in a Python file; this file is called an IPython profile. To create a default profile, type ipython profile create in a shell. This will create a folder named profile_default in the ~/.ipython or ~/.config/ ipython directory. The file ipython_config.py in this folder contains preferences about IPython. You can create different profiles with different names using ipython profile create profilename, and then launch IPython with ipython --profile=profilename to use that profile. The ~ directory is your home directory, for example, something like /home/ yourname on Unix, or C:Usersyourname or C:Documents and Settings yourname on Windows. Summary We have gone through 10 of the most interesting features offered by IPython in this article. They essentially concern the Python and shell interactive features, including the integrated debugger and profiler, and the interactive computing and visualization features brought by the NumPy and Matplotlib packages. Resources for Article : Further resources on this subject: Advanced Matplotlib: Part 1 [Article] Python Testing: Installing the Robot Framework [Article] Running a simple game using Pygame [Article]
Read more
  • 0
  • 0
  • 3681
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-big-data-analysis
Packt
19 Apr 2013
15 min read
Save for later

Big Data Analysis

Packt
19 Apr 2013
15 min read
(For more resources related to this topic, see here.) Counting distinct IPs in weblog data using MapReduce and Combiners This recipe will walk you through creating a MapReduce program to count distinct IPs in weblog data. We will demonstrate the application of a combiner to optimize data transfer overhead between the map and reduce stages. The code is implemented in a generic fashion and can be used to count distinct values in any tab-delimited dataset. Getting ready This recipe assumes that you have a basic familiarity with the Hadoop 0.20 MapReduce API. You will need access to the weblog_entries dataset supplied with this book and stored in an HDFS folder at the path /input/weblog. You will need access to a pseudo-distributed or fully-distributed cluster capable of running MapReduce jobs using the newer MapReduce API introduced in Hadoop 0.20. You will also need to package this code inside a JAR file to be executed by the Hadoop JAR launcher from the shell. Only the core Hadoop libraries are required to compile and run this example. How to do it... Perform the following steps to count distinct IPs using MapReduce: Open a text editor/IDE of your choice, preferably one with Java syntax highlighting. Create a class named DistinctCounterJob.java in your JAR file at whatever source package is appropriate. The following code will serve as the Tool implementation for job submission: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import java.io.IOException; import java.util.regex.Pattern; public class DistinctCounterJob implements Tool { private Configuration conf; public static final String NAME = "distinct_counter"; public static final String COL_POS = "col_pos"; public static void main(String[] args) throws Exception { ToolRunner.run(new Configuration(), new DistinctCounterJob(), args); } The run() method is where we set the input/output formats, mapper class configuration, combiner class, and key/value class configuration: public int run(String[] args) throws Exception { if(args.length != 3) { System.err.println("Usage: distinct_counter <input> <output> <element_position>"); System.exit(1); } conf.setInt(COL_POS, Integer.parseInt(args[2])); Job job = new Job(conf, "Count distinct elements at position"); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setMapperClass(DistinctMapper.class); job.setReducerClass(DistinctReducer.class); job.setCombinerClass(DistinctReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setJarByClass(DistinctCounterJob.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 1 : 0; } public void setConf(Configuration conf) { this.conf = conf; } public Configuration getConf() { return conf; } } The map() function is implemented in the following code by extending mapreduce.Mapper: public static class DistinctMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private static int col_pos; private static final Pattern pattern = Pattern. compile("t"); private Text outKey = new Text(); private static final IntWritable outValue = new IntWritable(1); @Override protected void setup(Context context ) throws IOException, InterruptedException { col_pos = context.getConfiguration(). getInt(DistinctCounterJob.COL_POS, 0); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String field = pattern.split(value.toString())[col_ pos]; outKey.set(field); context.write(outKey, outValue); } } The reduce() function is implemented in the following code by extending mapreduce.Reducer: public static class DistinctReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable count = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int total = 0; for(IntWritable value: values) { total += value.get(); } count.set(total); context.write(key, count); } } The following command shows the sample usage against weblog data with column position number 4, which is the IP column: hadoop jar myJobs.jar distinct_counter /input/weblog/ /output/ weblog_distinct_counter 4 How it works... First we set up DistinctCounterJob to implement a Tool interface for remote submission. The static constant NAME is of potential use in the Hadoop Driver class, which supports the launching of different jobs from the same JAR file. The static constant COL_POS is initialized to the third required argument from the command line <element_position>. This value is set within the job configuration, and should match the position of the column you wish to count for each distinct entry. Supplying 4 will match the IP column for the weblog data. Since we are reading and writing text, we can use the supplied TextInputFormat and TextOutputFormat classes. We will set the Mapper and Reduce classes to match our DistinctMapper and DistinctReducer implemented classes respectively. We also supply DistinctReducer as a combiner class. This decision is explained in more detail as follows: It's also very important to call setJarByClass() so that the TaskTrackers can properly unpack and find the Mapper and Reducer classes. The job uses the static helper methods on FileInputFormat and FileOutputFormat to set the input and output directories respectively. Now we're set up and ready to submit the job. The Mapper class sets up a few member variables as follows: col_pos: This is initialized to a value supplied in the configuration. It allows users to change which column to parse and apply the count distinct operation on. pattern: This defines the column's split point for each row based on tabs. outKey: This is a class member that holds output values. This avoids having to create a new instance for each output that is written. outValue: This is an integer representing one occurrence of the given key. It is similar to the WordCount example. The map() function splits each incoming line's value and extracts the string located at col_ pos. We reset the internal value for outKey to the string found on that line's position. For our example, this will be the IP value for the row. We emit the value of the newly reset outKey variable along with the value of outValue to mark one occurrence of that given IP address. Without the assistance of the combiner, this would present the reducer with an iterable collection of 1s to be counted. The following is an example of a reducer {key, value:[]} without a combiner: {10.10.1.1, [1,1,1,1,1,1]} = six occurrences of the IP "10.10.1.1". The implementation of the reduce() method will sum the integers and arrive at the correct total, but there's nothing that requires the integer values to be limited to the number 1. We can use a combiner to process the intermediate key-value pairs as they are output from each mapper and help improve the data throughput in the shuffle phase. Since the combiner is applied against the local map output, we may see a performance improvement as the amount of data we need to transfer for an intermediate key/value can be reduced considerably. Instead of seeing {10.10.1.1, [1,1,1,1,1,1]}, the combiner can add the 1s and replace the value of the intermediate value for that key to {10.10.1.1, [6]}. The reducer can then sum the various combined values for the intermediate key and arrive at the same correct total. This is possible because addition is both a commutative and associative operation. In other words: Commutative: The order in which we process the addition operation against the values has no effect on the final result. For example, 1 + 2 + 3 = 3 + 1 + 2. Associative: The order in which we apply the addition operation has no effect on the final result. For example, (1 + 2) + 3 = 1 + (2 + 3). For counting the occurrences of distinct IPs, we can use the same code in our reducer as a combiner for output in the map phase. When applied to our problem, the normal output with no combiner from two separate independently running map tasks might look like the following where {key: value[]} is equal to the intermediate key-value collection: Map Task A = {10.10.1.1, [1,1,1]} = three occurrences Map Task B = {10.10.1.1, [1,1,1,1,1,1]} = six occurrences Without the aid of a combiner, this will be merged in the shuffle phase and presented to a single reducer as the following key-value collection: {10.10.1.1, [1,1,1,1,1,1,1,1,1]} = nine total occurrences Now let's revisit what would happen when using a Combiner against the exact same sample output: Map Task A = {10.10.1.1, [1,1,1]} = three occurrences Combiner = {10.10,1,1, [3] = still three occurrences, but reduced for this mapper. Map Task B = {10.10.1.1, [1,1,1,1,1,1] = six occurrences Combiner = {10.10.1.1, [6] = still six occurrences Now the reducer will see the following for that key-value collection: {10.10.1.1, [3,6]} = nine total occurrences We arrived at the same total count for that IP address, but we used a combiner to limit the amount of network I/O during the MapReduce shuffle phase by pre-reducing the intermediate key-value output from each mapper. There's more... The combiner can be confusing to newcomers. Here are some useful tips: The Combiner does not always have to be the same class as your Reducer The previous recipe and the default WordCount example show the Combiner class being initialized to the same implementation as the Reducer class. This is not enforced by the API, but ends up being common for many types of distributed aggregate operations such as sum(), min(), and max(). One basic example might be the min() operation of the Reducer class that specifically formats output in a certain way for readability. This will take a slightly different form from that of the min() operator of the Combiner class, which does not care about the specific output formatting. Combiners are not guaranteed to run Whether or not the framework invokes your combiner during execution depends on the intermediate spill file size from each map output, and is not guaranteed to run for every intermediate key. Your job should not depend on the combiner for correct results, it should be used only for optimization. You can control the spill file threshold when MapReduce tries to combine intermediate values with the configuration property min.num.spills.for.combine. Using Hive date UDFs to transform and sort event dates from geographic event data This recipe will illustrate the efficient use of the Hive date UDFs to list the 20 most recent events and the number of days between the event date and the current system date. Getting ready Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account. This recipe depends on having the Nigera_ACLED_cleaned.tsv dataset loaded into a Hive table named acled_nigeria_cleaned with the fields mapped to the respective datatypes. Issue the following command to the Hive client to see the mentioned fields: describe acled_nigeria_cleaned You should see the following response: OK Loc string event_date string event_type string actor string latitude double longitude double source string fatalities int How to do it... Perform the following steps to utilize Hive UDFs for sorting and transformation: Open a text editor of your choice, ideally one with SQL syntax highlighting. Add the inline creation and transform syntax: SELECT event_type,event_date,days_since FROM ( SELECT event_type,event_date, datediff(to_date(from_unixtime(unix_timestamp())), to_date(from_unixtime( unix_timestamp(event_date, 'yyyy-MM-dd')))) AS days_since FROM acled_nigeria_cleaned) date_differences ORDER BY event_date DESC LIMIT 20; Save the file as top_20_recent_events.sql in the active folder. Run the script from the operating system shell by supplying the –f option to the Hive client. You should see the following five rows appear first in the output console: OK Battle-No change of territory 2011-12-31 190 Violence against civilians 2011-12-27 194 Violence against civilians 2011-12-25 196 Violence against civilians 2011-12-25 196 Violence against civilians 2011-12-25 196 How it works... Let's start with the nested SELECT subqueries. We select three fields from our Hive table acled_nigeria_cleaned: event_type, event_date, and the result of calling the UDF datediff(), which takes as arguments an end date and a start date. Both are expected in the form yyyy-MM-dd. The first argument to datediff() is the end date, with which we want to represent the current system date. Calling unix_timestamp() with no arguments will return the current system time in milliseconds. We send that return value to from_ unixtimestamp() to get a formatted timestamp representing the current system date in the default Java 1.6 format (yyyy-MM-dd HH:mm:ss). We only care about the date portion, so calling to_date() with the output of this function strips the HH:mm:ss. The result is the current date in the yyyy-MM-dd form. The second argument to datediff() is the start date, which for our query is the event_ date. The series of function calls operate in almost the exact same manner as our previous argument, except that when we call unix_timestamp(), we must tell the function that our argument is in the SimpleDateFormat format that is yyyy-MM-dd. Now we have both start_date and end_date arguments in the yyyy-MM-dd format and can perform the datediff() operation for the given row. We alias the output column of datediff() as days_since for each row. The outer SELECT statement takes these three columns per row and sorts the entire output by event_date in descending order to get reverse chronological ordering. We arbitrarily limit the output to only the first 20. The net result is the 20 most recent events with the number of days that have passed since that event occurred. There's more... The date UDFs can help tremendously in performing string date comparisons. Here are some additional pointers: Date format strings follow Java SimpleDateFormat guidelines Check out the Javadocs for SimpleDateFormat to learn how your custom date strings can be used with the date transform UDFs. Default date and time formats Many of the UDFs operate under a default format assumption. For UDFs requiring only date, your column values must be in the form yyyy-MM-dd. For UDFs that require date and time, your column values must be in the form yyyy- MM-dd HH:mm:ss. Using Hive to build a per-month report of fatalities over geographic event data This recipe will show a very simple analytic that uses Hive to count fatalities for every month appearing in the dataset and print the results to the console. Getting ready Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account. This recipe depends on having the Nigera_ACLED_cleaned.tsv dataset loaded into a Hive table named acled_nigeria_cleaned with the following fields mapped to the respective datatypes. Issue the following command to the Hive client: describe acled_nigeria_cleaned You should see the following response: OK loc string event_date string event_type string actor string latitude double longitude double source string fatalities int How to do it... Follow the steps to use Hive for report generation: Open a text editor of your choice, ideally one with SQL syntax highlighting. Add the inline creation and transformation syntax: SELECT from_unixtime(unix_timestamp(event_date, 'yyyy-MM-dd'), 'yyyy-MMM'), COALESCE(CAST(sum(fatalities) AS STRING), 'Unknown') FROM acled_nigeria_cleaned GROUP BY from_unixtime(unix_timestamp(event_date, 'yyyy-MMdd'),' yyyy-MMM'); Save the file as monthly_violence_totals.sql in the active folder. Run the script from the operating system shell by supplying the –f option to the Hive client. You should see the following three rows appear first in the output console. Note that the output is sorted lexicographically, and not on the order of dates. OK 1997-Apr 115 1997-Aug 4 1997-Dec 26 How it works... The SELECT statement uses unix_timestamp() and from_unixtime() to reformat the event_date for each row as just a year-month concatenated field. This is also in the GROUP BY expression for totaling fatalities using sum(). The coalesce() method returns the first non-null argument passed to it. We pass as the first argument, the value of fatalities summed for that given year-month, cast as a string. If that value is NULL for any reason, return the constant Unknown. Otherwise return the string representing the total fatalities counted for that year-month combination. Print everything to the console over stdout. There's more... The following are some additional helpful tips related to the code in this recipe: The coalesce() method can take variable length arguments. As mentioned in the Hive documentation, coalesce() supports one or more arguments. The first non-null argument will be returned. This can be useful for evaluating several different expressions for a given column before deciding the right one to choose. The coalesce() will return NULL if no argument is non-null. It's not uncommon to provide a type literal to return if all other arguments are NULL. Date reformatting code template Having to reformat dates stored in your raw data is very common. Proper use of from_ unixtime() and unix_timestamp() can make your life much easier. Remember this general code template for concise date format transformation in Hive: from_unixtime(unix_timestamp(<col>,<in-format>),<out-format>);
Read more
  • 0
  • 0
  • 5563

article-image-comparative-study-nosql-products
Packt
09 Apr 2013
7 min read
Save for later

Comparative Study of NoSQL Products

Packt
09 Apr 2013
7 min read
(For more resources related to this topic, see here.) Comparison Choosing a technology does not merely involve a technical comparison. Several other factors related to documentation, maintainability, stability and maturity, vendor support, developer community, license, price, and the future of the product or the organization behind it also play important roles. Having said that, I must also add that technical comparison should continue to play a pivotal role. We will start a deep technical comparison of the previously mentioned products and then look at the semi-technical and non-technical aspects for the same. Technical comparison From a technical perspective, we compare on the following parameters: Implementation language Engine types Speed Implementation language One of the more important factors that come into play is how can, if required, the product be extended; the programming language in which the product itself is written determines a large part of it. Some of the database may provide a different language for writing plugins but it may not always be true: Amazon SimpleDB: It is available in cloud and has a client SDK for Java, .NET, PHP, and Ruby. There are libraries for Android and iOS as well. BaseX: Written in Java. To extend, one must code in Java. Cassandra: Everything in Java. CouchDB: Written in Erlang. To extend use Erlang. Google Datastore: It is available in cloud and has SDK for Java, Python, and Go. HBase: It is Java all the way. MemcacheDB: Written in C. Uses the same language to extend. MongoDB: Written in C++. Client drivers are available in several languages including but not limited to JavaScript, Java, PHP, Python, and Ruby. Neo4j: Like several others, it is Java all the way Redis: Written in C. So you can extend using C. Great, so the first parameter itself may have helped you shortlist the products that you may be interested to use based on the developers available in your team or for hire. You may still be tempted to get smart people onboard and then build competency based on the choice that you make, based on subsequent dimensions. Note that for the databases written in high-level languages like Java, it may still be possible to write extensions in languages like C or C++ by using interfaces like JNI or otherwise. Amazon SimpleDB provides access via the HTTP protocol and has SDK in multiple languages. If you do not find an SDK for yourself, say for example, in JavaScript for use with NodeJS, just write one. However, life is not open with Google Datastore that allows access only via its cloud platform App Engine and has SDKs only in Java, Python, and the Go languages. Since the access is provided natively from the cloud servers, you cannot do much about it. In fact, the top requested feature of the Google App Engine is support for PHP ( See http://code.google.com/p/googleappengine/issues/list). Engine types Engine types define how you will structure the data and what data design expertise your team will need. NoSQL provides multiple options to choose from. Database Column oriented Document store Key value store Graph Amazon SimpleDB No No Yes No BaseX No Yes No No Cassandra Yes Yes No No CouchDB No Yes No No Google Datastore Yes No No No HBase Yes No No No MemcacheDB No No Yes No MongoDB No Yes No No Neo4j No No No Yes Redis No Yes Yes No You may notice two aspects of this table – a lot of No and multiple Yes against some databases. I expect the table to be populated with a lot more Yes over the next couple of years. Specifically, I expect the open source databases written in Java to be developed and enhanced actively providing multiple options to the developers. Speed One of the primary reasons for choosing a NoSQL solution is speed. Comparing and benchmarking the databases is a non-trivial task considering that each database has its own set of hardware and other configuration requirements. Having said that, you can definitely find a whole gambit of benchmark results comparing one NoSQL database against the other with details of how the tests were executed. Of all that is available, my personal choice is the Yahoo! Cloud Serving Benchmark (YCSB) tool. It is open source and available on Github at https://github.com/brianfrankcooper/YCSB. It is written in Java and clients are available for Cassandra, DynamoDB, HBase, HyperTable, MongoDB, Redis apart from several others that we have not discuss in this book. Before showing some results from the YCSB, I did a quick run on a couple of easy-to-set-up databases myself. I executed them without any optimizations to just get a feel of how easy it is for software to incorporate it without needing any expert help. I ran it on MongoDB on my personal box (server as well as the client on the same machine), DynamoDB connecting from a High-CPU Medium (c1.medium) box, and MySQL on the same High-CPU Medium box with both server and client on the same machine. Detailed configurations with the results are shown as follows: Server configuration: Parameter MongoDB DynamoDB MySQL Processor 5 EC2 Compute Units N/A 5 EC2 Compute Units RAM 1.7 GB with Apache HTTP server running (effective free: 200 MB, after database is up and running) N/A 1.7GB with Apache HTTP server running (effective free: 500MB, after database is up and running) Hard disk Non-SSD N/A Non-SSD Network configuration N/A US-East-1 N/A Operating system Ubuntu 10.04, 64 bit N/A Ubuntu 10.04, 64 bit Database version 1.2.2 N/A 5.1.41 Configuration Default Max write: 500, Max read: 500 Default Client configuration: Parameter MongoDB DynamoDB MySQL Processor 5 EC2 Compute Units 5 EC2 Compute Units 5 EC2 Compute Units RAM 1.7GB with Apache HTTP server running (effective free: 200MB, after database is up and running) 1.7GB with Apache HTTP server running (effective free: 500MB, after database is up and running) 1.7GB with Apache HTTP server running (effective free: 500MB after database is up and running) Hard disk Non-SSD Non-SSD Non-SSD Network configuration Same Machine as server US-East-1 Same Machine as server Operating system Ubuntu 10.04, 64 bit Ubuntu 10.04, 64 bit Ubuntu 10.04, 64 bit Record count 1,000,000 1,000 1,000,000 Max connections 1 5 1 Operation count (workload a) 1,000,000 1,000 1,000,000 Operation count (workload f) 1,000,000 100,000 1,000,000 Results: Workload Parameter MongoDB DynamoDB MySQL Workload-a (load) Total time 290 seconds 16 seconds 300 seconds   Speed (operations/second) 2363 to 4180 (approximately 3700) Bump at 1278 50 to 82 (operations/second) 3135 to 3517 (approximately 3300)   Insert latency 245 to 416 microseconds (approximately 260) Bump at 875 microseconds 12 to 19 milliseconds 275 to 300 microseconds (approximately 290) Workload-a (run) Total time 428 seconds 17 seconds 240 seconds   Speed 324 to 4653 42 to 78 3970 to 4212   Update latency 272 to 2946 microseconds 13 to 23.7 microseconds 219 to 225.5 microseconds   Read latency 112 to 5358 microseconds 12.4 to 22.48 microseconds 240.6 to 248.9 microseconds Workload-f (load) Total time 286 seconds Did not execute 295 seconds   Speed 3708 to 4200   3254 to 3529   Insert latency 228 to 265 microseconds   275 to 299 microseconds Workload-f (run) Total time 412 seconds Did not execute 1022 seconds   Speed 192 to 4146   224 to 2096   Update latency 219 to 336 microseconds   216 to 233 microseconds, with two bursts at 600 and 2303 microseconds   Read latency 119 to 5701 microseconds   1360 to 8246 microseconds   Read Modify Write (RMW) latency 346 to 9170 microseconds   1417 to 14648 microseconds Do not read too much into these numbers as they are a result of the default configuration, out-of-the-box setup without any optimizations. Some of the results from YCSB published by Brian F. Cooper (http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf) are shown next. For update-heavy, 50-50 read-update: For read-heavy, under varying hardware: There are some more from Sergey Sverchkov at Altoros (http://altoros.com/nosql-research) who published their white paper recently. Summary In this article, we did a detailed comparative study of ten NoSQL databases on few parameters, both technical and non-technical. Resources for Article : Further resources on this subject: Getting Started with CouchDB and Futon [Article] Ruby with MongoDB for Web Development [Article] An Introduction to Rhomobile [Article]  
Read more
  • 0
  • 0
  • 2277

article-image-advanced-hadoop-mapreduce-administration
Packt
08 Apr 2013
6 min read
Save for later

Advanced Hadoop MapReduce Administration

Packt
08 Apr 2013
6 min read
(For more resources related to this topic, see here.) Tuning Hadoop configurations for cluster deployments Getting ready Shut down the Hadoop cluster if it is already running, by executing the bin/stop-dfs.sh and bin/stop-mapred.sh commands from HADOOP_HOME. How to do it... We can control Hadoop configurations through the following three configuration files: conf/core-site.xml: This contains the configurations common to whole Hadoop distribution conf/hdfs-site.xml: This contains configurations for HDFS conf/mapred-site.xml: This contains configurations for MapReduce Each configuration file has name-value pairs expressed in an XML format, and they define the workings of different aspects of Hadoop. The following code snippet shows an example of a property in the configuration file. Here, the <configuration> tag is the top-level XML container, and the <property> tags that define individual properties go as child elements of the <configuration> tag. <configuration><property><name>mapred.reduce.parallel.copies</name><value>20</value></property>...</configuration> The following instructions show how to change the directory to which we write Hadoop logs and configure the maximum number of map and reduce tasks: Create a directory to store the logfiles. For example, /root/hadoop_logs. Uncomment the line that includes HADOOP_LOG_DIR in HADOOP_HOME/conf/ hadoop-env.sh and point it to the new directory. Add the following lines to the HADOOP_HOME/conf/mapred-site.xml file: <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2 </value></property><property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>2 </value></property> Restart the Hadoop cluster by running the bin/stop-mapred.sh and bin/start-mapred.sh commands from the HADOOP_HOME directory. You can verify the number of processes created using OS process monitoring tools. If you are in Linux, run the watch ps –ef|grep hadoop command. If you are in Windows or MacOS use the Task Manager. How it works... HADOOP_LOG_DIR redefines the location to which Hadoop writes its logs. The mapred. tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks. maximum properties define the maximum number of map and reduce tasks that can run within a single TaskTracker at a given moment. These and other server-side parameters are defined in the HADOOP_HOME/conf/*-site. xml files. Hadoop reloads these configurations after a restart. There's more... There are many similar configuration properties defined in Hadoop. You can see some of them in the following tables. The configuration properties for conf/core-site.xml are listed in the following table: Name Default value Description fs.inmemory.size.mb 100 This is the amount of memory allocated to the in-memory filesystem that is used to merge map outputs at reducers in MBs. io.sort.factor 100 This is the maximum number of streams merged while sorting files. io.file.buffer.size 131072 This is the size of the read/write buffer used by sequence files. The configuration properties for conf/mapred-site.xml are listed in the following table: Name Default value Description mapred.reduce. parallel.copies 5 This is the maximum number of parallel copies the reduce step will execute to fetch output from many parallel jobs. mapred.map.child.java. opts -Xmx200M This is for passing Java options into the map JVM. mapred.reduce.child. java.opts -Xmx200M This is for passing Java options into the reduce JVM. io.sort.mb 200 The memory limit while sorting data in MBs. The configuration properties for conf/hdfs-site.xml are listed in the following table: Name Default value Description dfs.block.size 67108864 This is the HDFS block size. dfs.namenode.handler. count 40 This is the number of server threads to handle RPC calls in the NameNode. Running benchmarks to verify the Hadoop installation The Hadoop distribution comes with several benchmarks. We can use them to verify our Hadoop installation and measure Hadoop's performance. This recipe introduces these benchmarks and explains how to run them. Getting ready Start the Hadoop cluster. You can run these benchmarks either on a cluster setup or on a pseudo-distributed setup. How to do it... Let us run the sort benchmark. The sort benchmark consists of two jobs. First, we generate some random data using the randomwriter Hadoop job and then sort them using the sort sample. Change the directory to HADOOP_HOME. Run the randomwriter Hadoop job using the following command: >bin/hadoop jar hadoop-examples-1.0.0.jarrandomwriter-Dtest.randomwrite.bytes_per_map=100-Dtest.randomwriter.maps_per_host=10 /data/unsorted-data Here the two parameters, test.randomwrite.bytes_per_map and test. randomwriter.maps_per_host specify the size of data generated by a map and the number of maps respectively. Run the sort program: >bin/hadoop jar hadoop-examples-1.0.0.jar sort /data/unsorted-data/data/sorted-data Verify the final results by running the following command: >bin/hadoop jar hadoop-test-1.0.0.jar testmapredsort -sortInput /data/unsorted-data -sortOutput /data/sorted-data Finally, when everything is successful, the following message will be displayed: The job took 66 seconds.SUCCESS! Validated the MapReduce framework's 'sort' successfully. How it works... First, the randomwriter application runs a Hadoop job to generate random data that can be used by the second sort program. Then, we verify the results through testmapredsort job. If your computer has more capacity, you may run the initial randomwriter step with increased output sizes. There's more... Hadoop includes several other benchmarks. TestDFSIO: This tests the input output (I/O) performance of HDFS nnbench: This checks the NameNode hardware mrbench: This runs many small jobs TeraSort: This sorts a one terabyte of data More information about these benchmarks can be found at http://www.michaelnoll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoopcluster- with-terasort-testdfsio-nnbench-mrbench/. Reusing Java VMs to improve the performance In its default configuration, Hadoop starts a new JVM for each map or reduce task. However, running multiple tasks from the same JVM can sometimes significantly speed up the execution. This recipe explains how to control this behavior. How to do it... Run the WordCount sample by passing the following option as an argument: >bin/hadoop jar hadoop-examples-1.0.0.jar wordcount –Dmapred.job.reuse.jvm.num.tasks=-1 /data/input1 /data/output1 Monitor the number of processes created by Hadoop (through ps –ef|grephadoop command in Unix or task manager in Windows). Hadoop starts only a single JVM per task slot and then reuses it for an unlimited number of tasks in the job. However, passing arguments through the –D option only works if the job implements the org.apache.hadoop.util.Tools interface. Otherwise, you should set the option through the JobConf.setNumTasksToExecutePerJvm(-1) method. How it works... By setting the job configuration property through mapred.job.reuse.jvm.num.tasks, we can control the number of tasks for the JVM run by Hadoop. When the value is set to -1, Hadoop runs the tasks in the same JVM.
Read more
  • 0
  • 0
  • 4186

article-image-line-area-and-scatter-charts
Packt
05 Apr 2013
10 min read
Save for later

Line, Area, and Scatter Charts

Packt
05 Apr 2013
10 min read
(For more resources related to this topic, see here.) Introducing line charts First let's start with a single series line chart. We will use one of the many data provided by The World Bank organization at www.worldbank.org. The following is the code snippet to create a simple line chart which shows the percentage of population ages, 65 and above, in Japan for the past three decades: var chart = new Highcharts.Chart({chart: {renderTo: 'container'},title: {text: 'Population ages 65 and over (% of total)',},credits: {position: {align: 'left',x: 20},text: 'Data from The World Bank'},yAxis: {title: {text: 'Percentage %'}},xAxis: {categories: ['1980', '1981','1982', ... ],labels: {step: 5}},series: [{name: 'Japan - 65 and over',data: [ 9, 9, 9, 10, 10, 10, 10 ... ]}]}); The following is the display of the simple chart: Instead of specifying the year number manually as strings in categories, we can use the pointStart option in the series config to initiate the x-axis value for the first point. So we have an empty xAxis config and series config, as follows: xAxis: {},series: [{pointStart: 1980,name: 'Japan - 65 and over',data: [ 9, 9, 9, 10, 10, 10, 10 ... ]}] Although this simplifies the example, the x-axis labels are automatically formatted by Highcharts utility method, numberFormat, which adds a comma after every three digits. The following is the outcome on the x axis: To resolve the x-axis label, we overwrite the label's formatter option by simply returning the value to bypass the numberFormat method being called. Also we need to set the allowDecimals option to false. The reason for that is when the chart is resized to elongate the x axis, decimal values are shown. The following is the final change to use pointStart for the year values: xAxis: {labels:{formatter: function() {// 'this' keyword is the label objectreturn this.value;}},allowDecimals: false},series: [{pointStart: 1980,name: 'Japan - 65 and over',data: [ 9, 9, 9, 10, 10, 10, 10 ... ]}] Extending to multiple series line charts We can include several more line series and set the Japan series by increasing the line width to be 6 pixels wide, as follows: series: [{lineWidth: 6,name: 'Japan',data: [ 9, 9, 9, 10, 10, 10, 10 ... ]}, {Name: 'Singapore',data: [ 5, 5, 5, 5, ... ]}, {...}] The line series for Japanese population becomes the focus in the chart, as shown in the following screenshot: Let's move on to a more complicated line graph. For the sake of demonstrating inverted line graphs, we use the chart.inverted option to flip the y and x axes to opposite orientations. Then we change the line colors of the axes to match the same series colors. We also disable data point markers for all the series and finally align the second series to the second entry in the y-axis array, as follows: chart: {renderTo: 'container',inverted: true,},yAxis: [{title: {text: 'Percentage %'},lineWidth: 2,lineColor: '#4572A7'}, {title: {text: 'Age'},opposite: true,lineWidth: 2,lineColor: '#AA4643'}],plotOptions: {series: {marker: {enabled: false}}},series: [{name: 'Japan - 65 and over',type: 'spline',data: [ 9, 9, 9, ... ]}, {name: 'Japan - Life Expectancy',yAxis: 1,data: [ 76, 76, 77, ... ]}] The following is the inverted graph with double y axes: The data representation of the chart may look slightly odd as the usual time labels are swapped to the y axis and the data trend is awkward to comprehend. The inverted option is normally used for showing data in a noncontinuous form and in bar format. If we interpret the data from the graph, 12 percent of the population is 65 and over, and the life expectancy is 79 in 1990. By setting plotOptions.series.marker.enabled to false it switches off all the data point markers. If we want to display a point marker for a particular series, we can either switch off the marker globally and then set the marker on an individual series, or the other way round. plotOptions: {series: {marker: {enabled: false}}},series: [{marker: {enabled: true},name: 'Japan - 65 and over',type: 'spline',data: [ 9, 9, 9, ... ]}, { The following graph demonstrates that only the 65 and over series has point markers: Sketching an area chart In this section, we are going to use our very first example and turn it into a more stylish graph (based on the design of wind energy poster by Kristin Clute), which is an area spline chart. An area spline chart is generated using the combined properties of area and spline charts. The main data line is plotted as a spline curve and the region underneath the line is filled in a similar color with a gradient and an opaque style. Firstly, we want to make the graph easier for viewers to look up the values for the current trend, so we move the y axis next to the latest year, that is, to the opposite side of the chart: yAxis: { ....opposite:true} The next thing is to remove the interval lines and have a thin axis line along the y axis: yAxis: { ....gridLineWidth: 0,lineWidth: 1,} Then we simplify the y-axis title with a percentage sign and align it to the top of the axis: yAxis: { ....title: {text: '(%)',rotation: 0,x: 10,y: 5,align: 'high'},} As for the x axis, we thicken the axis line with a red color and remove the interval ticks: xAxis: { ....lineColor: '#CC2929',lineWidth: 4,tickWidth: 0,offset: 2} For the chart title, we move the title to the right of the chart, increase the margin between the chart and the title, and then adopt a different font for the title: title: {text: 'Population ages 65 and over (% of total) -Japan ',margin: 40,align: 'right',style: {fontFamily: 'palatino'}} After that we are going to modify the whole series presentation, we first set the chart.type property from 'line' to 'areaspline'. Notice that setting the properties inside this series object will overwrite the same properties defined in plotOptions.areaspline and so on in plotOptions.series. Since so far there is only one series in the graph, there is no need to display the legend box. We can disable it with the showInLegend property. We then smarten the area part with gradient color and the spline with a darker color: series: [{showInLegend: false,lineColor: '#145252',fillColor: {linearGradient: {x1: 0, y1: 0,x2: 0, y2: 1},stops:[ [ 0.0, '#248F8F' ] ,[ 0.7, '#70DBDB' ],[ 1.0, '#EBFAFA' ] ]},data: [ ... ]}] After that, we introduce a couple of data labels along the line to indicate that the ranking of old age population has increased over time. We use the values in the series data array corresponding to the year 1995 and 2010, and then convert the numerical value entries into data point objects. Since we only want to show point markers for these two years, we turn off markers globally in plotOptions.series. marker.enabled and set the marker on, individually inside the point objects accompanied with style settings: plotOptions: {series: {marker: {enabled: false}}},series: [{ ...,data:[ 9, 9, 9, ...,{ marker: {radius: 2,lineColor: '#CC2929',lineWidth: 2,fillColor: '#CC2929',enabled: true},y: 14}, 15, 15, 16, ... ]}] We then set a bounding box around the data labels with round corners (borderRadius) in the same border color (borderColor) as the x axis. The data label positions are then finely adjusted with the x and y options. Finally, we change the default implementation of the data label formatter. Instead of returning the point value, we print the country ranking. series: [{ ...,data:[ 9, 9, 9, ...,{ marker: {...},dataLabels: {enabled: true,borderRadius: 3,borderColor: '#CC2929',borderWidth: 1,y: -23,formatter: function() {return "Rank: 15th";}},y: 14}, 15, 15, 16, ... ]}] The final touch is to apply a gray background to the chart and add extra space into spacingBottom. The extra space for spacingBottom is to avoid the credit label and x-axis label getting too close together, because we have disabled the legend box. chart: {renderTo: 'container',spacingBottom: 30,backgroundColor: '#EAEAEA'}, When all these configurations are put together, it produces the exact chart, as shown in the screenshot at the start of this section. Mixing line and area series In this section we are going to explore different plots including line and area series together, as follows: Projection chart, where a single trend line is joined with two series in different line styles Plotting an area spline chart with another step line series Exploring a stacked area spline chart, where two area spline series are stacked on top of each other Simulating a projection chart The projection chart has spline area with the section of real data and continues in a dashed line with projection data. To do that we separate the data into two series, one for real data and the other for projection data. The following is the series configuration code for the future data up to 2024. This data is based on the National Institute of Population and Social Security Research report (http://www.ipss.go.jp/pp-newest/e/ppfj02/ppfj02.pdf). series: [{name: 'project data',type: 'spline',showInLegend: false,lineColor: '#145252',dashStyle: 'Dash',data: [ [ 2010, 23 ], [ 2011, 22.8 ],... [ 2024, 28.5 ] ]}] The future series is configured as a spline in a dashed line style and the legend box is disabled, because we want to show both series as being from the same series. Then we set the future (second) series color the same as the first series. The final part is to construct the series data. As we specify the x-axis time data with the pointStart property, we need to align the projection data after 2010. There are two approaches that we can use to specify the time data in a continuous form, as follows: Insert null values into the second series data array for padding to align with the real data series Specify the second series data in tuples, which is an array with both time and projection data Next we are going to use the second approach because the series presentation is simpler. The following is the screenshot only for the future data series: The real data series is exactly the same as the graph in the screenshot at the start of the Sketching an area chart section, except without the point markers and data label decorations. The next step is to join both series together, as follows: series: [{name: 'real data',type: 'areaspline',....}, {name: 'project data',type: 'spline',....}] Since there is no overlap between both series data, they produce a smooth projection graph: Contrasting spline with step line In this section we are going to plot an area spline series with another line series but in a step presentation. The step line transverses vertically and horizontally only according to the changes in series data. It is generally used for presenting discrete data, that is, data without continuous/gradual movement. For the purpose of showing a step line, we will continue from the first area spline example. First of all, we need to enable the legend by removing the disabled showInLegend setting and also remove dataLabels in the series data. Next is to include a new series, Ages 0 to 14, in the chart with a default line type. Then we will change the line style slightly differently into steps. The following is the configuration for both series: series: [{name: 'Ages 65 and over',type: 'areaspline',lineColor: '#145252',pointStart: 1980,fillColor: {....},data: [ 9, 9, 9, 10, ...., 23 ]}, {name: 'Ages 0 to 14',// default type is line seriesstep: true,pointStart: 1980,data: [ 24, 23, 23, 23, 22, 22, 21,20, 20, 19, 18, 18, 17, 17, 16, 16, 16,15, 15, 15, 15, 14, 14, 14, 14, 14, 14,14, 14, 13, 13 ]}] The following screenshot shows the second series in the stepped line style:
Read more
  • 0
  • 0
  • 3032
article-image-obtaining-binary-backup
Packt
04 Apr 2013
6 min read
Save for later

Obtaining a binary backup

Packt
04 Apr 2013
6 min read
Getting ready Next we need to modify the postgresql.conf file for our database to run in the proper mode for this type of backup. Change the following configuration variables: wal_level = archive max_wal_senders = 5 Then we must allow a super user to connect to the replication database, which is used by pg_basebackup. We do that by adding the following line to pg_hba.conf: local replication postgres peer Finally, restart the database instance to commit the changes. How to do it... Though it is only one command, pg_basebackup requires at least one switch to obtain a binary backup, as shown in the following step: Execute the following command to create the backup in a new directory named db_backup: $> pg_basebackup -D db_backup -x How it works... For PostgreSQL, WAL stands for Write Ahead Log. By changing wal_level to archive, those logs are written in a format compatible with pg_basebackup and other replicationbased tools. By increasing max_wal_senders from the default of zero, the database will allow tools to connect and request data files. In this case, up to five streams can request data files simultaneously. This maximum should be sufficient for all but the most advanced systems. The pg_hba.conf file is essentially a connection access control list (ACL). Since pg_basebackup uses the replication protocol to obtain data files, we need to allow local connections to request replication. Next, we send the backup itself to a directory (-D) named db_backup. This directory will effectively contain a complete copy of the binary files that make up the database. Finally, we added the -x flag to include transaction logs (xlogs), which the database will require to start, if we want to use this backup. When we get into more complex scenarios, we will exclude this option, but for now, it greatly simplifies the process. There's more... The pg_basebackup tool is actually fairly complicated. There is a lot more involved under the hood. Viewing backup progress For manually invoked backups, we may want to know how long the process might take, and its current status. Luckily, pg_basebackup has a progress indicator, which does that by using the following command: $> pg_basebackup -P -D db_backup Like many of the other switches, -P can be combined with tape archive format, standalone backups, database clones, and so on. This is clearly not necessary for automated backup routines, but could be useful for one-off backups monitored by an administrator. Compressed tape archive backups Many binary backup files come in the TAR (Tape Archive) format, which we can activate using the -f flag and setting it to t for TAR. Several Unix backup tools can directly process this type of backup, and most administrators are familiar with it. If we want a compressed output, we can set the -z flag, especially in the case of large databases. For our sample database, we should see almost a 20x compression ratio. Try the following command: $> pg_basebackup -Ft -z -D db_backup The backup file itself will be named base.tar.gz within the db_backup directory, reflecting its status as a compressed tape archive. In case the database contains extra tablespaces, each becomes a separate compressed archive. Each file can be extracted to a separate location, such as a different set of disks, for very complicated database instances. For the sake of this example, we ignored the possible presence of extra tablespaces than the pg_default default included in every installation. User-created tablespaces will greatly complicate your backup process. Making the backup standalone By specifying -x, we tell the database that we want a "complete" backup. This means we could extract or copy the backup anywhere and start it as a fully qualified database. As we mentioned before, the flag means that you want to include transaction logs, which is how the database recovers from crashes, checks integrity, and performs other important tasks. The following is the command again, for reference: $> pg_basebackup -x -D db_backup When combined with the TAR output format and compression, standalone binary backups are perfect for archiving to tape for later retrieval, as each backup is compressed and self-contained. By default, pg_basebackup does not include transaction logs, because many (possibly most) administrators back these up separately. These files have multiple uses, and putting them in the basic backup would duplicate efforts and make backups larger than necessary. We include them at this point because it is still too early for such complicated scenarios. We will get there eventually, of course. Database clones Because pg_basebackup operates through PostgreSQL's replication protocol, it can execute remotely. For instance, if the database was on a server named Production, and we wanted a copy on a server named Recovery, we could execute the following command from Recovery: $> pg_basebackup -h Production -x -D /full/db/path For this to work, we would also need this line in pg_hba.conf for Recovery: host replication postgres Recovery trust Though we set the authentication method to trust, this is not recommended for a production server installation. However, it is sufficient to allow Recovery to copy all data from Production. With the -x flag, it also means that the database can be started and kept online in case of emergency. It is a backup and a running server. Parallel compression Compression is very CPU intensive, but there are some utilities capable of threading the process. Tools such as pbzip2 or pigz can do the compression instead. Unfortunately, this only works in the case of a single tablespace (the default one; if you create more, this will not work). The following is the command for compression using pigz: $> pg_basebackup -Ft -D - | pigz -j 4 > db_backup.tar.gz It uses four threads of compression, and sets the backup directory to standard output (-) so that pigz can process the output itself. Summary In this article we saw the process of obtaining a binary backup. Though, we saw that this process is more complex and tedious, but at the same time it is much faster. Further resources on this subject: Introduction to PostgreSQL 9 Backup in PostgreSQL 9 Recovery in PostgreSQL 9
Read more
  • 0
  • 0
  • 1189

article-image-ease-chaos-automated-patching
Packt
02 Apr 2013
19 min read
Save for later

Ease the Chaos with Automated Patching

Packt
02 Apr 2013
19 min read
(For more resources related to this topic, see here.) We have seen how the provisioning capabilities of the Oracle Enterprise Manager's Database Lifecycle Management (DBLM) Pack enable you to deploy fully patched Oracle Database homes and databases, as replicas of the gold copy in the Software Library of Enterprise Manager. However, nothing placed in production should be treated as static. Software changes in development cycles, enhancements take place, or security/functional issues are found. For almost anything in the IT world, new patches are bound to be released. These will also need to be applied to production, testing, reporting, staging, and development environments in the data center on an ongoing basis. For the database side of things, Oracle releases quarterly a combination of security fixes known as the Critical Patch Update (CPU). Other patches are bundled together and released every quarter in the form of a Patch Set Update (PSU), and this also includes the CPU for that quarter. Oracle strongly recommends applying either the PSU or the CPU every calendar quarter. If you prefer to apply the CPU, continue doing so. If you wish to move to the PSU, you can do so, but in that case continue only with the PSU. The quarterly patching requirement, as a direct recommendation from Oracle, is followed by many companies that prefer to have their databases secured with the latest security fixes. This underscores the importance of patching. However, if there are hundreds of development, testing, staging, and production databases in the data center to be patched, the situation quickly turns into a major manual exercise every three months. DBAs and their managers start planning for the patch exercise in advance, and a lot of resources are allocated to make it happen—with the administrators working on each database serially, at times overnight and at times over the weekend. There are a number of steps involved in patching each database, such as locating the appropriate patch in My Oracle Support (MOS), downloading the patch, transferring it to each of the target servers, upgrading the OPATCH facility in each Oracle home, shutting down the databases and listeners running from that home, applying the patch, starting each of the databases in restricted mode, applying any supplied SQL scripts, restarting the databases in normal mode, and checking the patch inventory. These steps have to be manually repeated on every database home on every server, and on every database in that home. Dull repetition of these steps in patching the hundreds of servers in a data center is a very monotonous task, and it can lead to an increase in human errors. To avoid these issues inherent in manual patching, some companies decide not to apply the quarterly patches on their databases. They wait for a year, or a couple of years before they consider patching, and some even prefer to apply year-old patches instead of the latest patches. This is counter-productive and leads to their databases being insecure and vulnerable to attacks, since the latest recommended CPUs from Oracle have not been applied. What then is the solution, to convince these companies to apply patches regularly? If the patching process can be mostly automated (but still under the control of the DBAs), it would reduce the quarterly patching effort to a great extent. Companies would then have the confidence that their existing team of DBAs would be able to manage the patching of hundreds of databases in a controlled and automated manner, keeping human error to a minimum. The Database Lifecycle Management Pack of Enterprise Manager Cloud Control 12c is able to achieve this by using its Patch Automation capability. We will now look into Patch Automation and the close integration of Enterprise Manager with My Oracle Support. Recommended patches By navigating to Enterprise | Summary, a Patch Recommendations section will be visible in the lower left-hand corner, as shown in the following screenshot: The graph displays either the Classification output of the recommended patches, or the Target Type output. Currently for this system, more than five security patches are recommended as can be seen in this graph. This recommendation has been derived via a connection to My Oracle Support (the OMS can be connected either directly to the Internet, or by using a proxy server). Target configuration information is collected by the Enterprise Manager Agent and is stored in the Configuration Management Database (CMDB) within the repository. This configuration information is collated regularly by the Enterprise Manager's Harvester process and pushed to My Oracle Support. Thus, configuration information about your targets is known to My Oracle Support, and it is able to recommend appropriate patches as and when they are released. However, the recommended patch engine also runs within Enterprise Manager 12c at your site, working off the configuration data in the CMDB in Enterprise Manager, so recommendations can in fact be achieved without the configuration having been uploaded on MOS by the Harvester (this upload is more useful now for other purposes, such as attaching configuration details during SR creation). It is also possible to get metadata about the latest available patches from My Oracle Support in offline mode, but more manual steps are involved in this case, so Internet connectivity is recommended to get the full benefits of Enterprise Manager's integration with My Oracle Support. To view the details about the patches, click on the All Recommendations link or on the graph itself. This connects to My Oracle Support (you may be asked to log in to your company-specific MOS account) and brings up the list of the patches in the Patch Recommendations section. The database (and other types of) targets managed by the Enterprise Manager system are displayed on the screen, along with the recommended CPU (or other) patches. We select the CPU July patch for our saiprod database. This displays the details about the patch in the section in the lower part of the screen. We can see the list Bugs Resolved by This Patch, the Last Updated date and Size of the patch and also Read Me—which has important information about the patch. The number of total downloads for this patch is visible, as is the Community Discussion on this patch in the Oracle forums. You can add your own comment for this patch, if required, by selecting Reply to the Discussion. Thus, at a glance, you can find out how popular the patch is (number of downloads) and any experience of other Oracle DBAs regarding this patch—whether positive or negative. Patch plan You can view the information about the patch by clicking on the Full Screen button. You can download the patch either to the Software Library in Enterprise Manager or to your desktop. Finally, you can directly add this patch to a new or existing patch plan, which we will do next. Go to Add to Plan | Add to New, and enter Plan Name as Sainath_patchplan. Then click on Create Plan. If you would like to add multiple patches to the plan, select both the patches first and then add to the plan. (You can also add patches later to the plan). After the plan is created, click on View Plan. This brings up the following screen: A patch plan is nothing but a collection of patches that can be applied as a group to one or more targets. On the Create Plan page that appears, there are five steps that can be seen in the left-hand pane. By default, the second step appears first. In this step, you can see all the patches that have been added to the plan. It is possible to include more patches by clicking on the Add Patch... button. Besides the ability to manually add a patch to this list, the analysis process may also result in additional patches being added to the plan. If you click on the first step, Plan Information, you can put in a description for this plan. You can also change the plan permissions, either Full or View, for various Enterprise Manager roles. Note that the Full permission allows the role to validate the plan, however, the View permission does not allow validation. Move to step 3, Deployment Options. The following screen appears. Out-of-place patching A new mechanism for patching has been provided in the Enterprise Manager Cloud Control 12c version, known as out-of-place patching. This is now the recommended method and creates a new Oracle home which is then patched while the previous home is still operational. All this is done using an out of the box deployment procedure in Enterprise Manager. Using this mechanism means that the only downtime will take place when the databases from the previous home are switched to run from the new home. If there is any issue with the database patch, you can switch back to the previous unpatched home since it is still available. So, patch rollback is a lot faster. Also, if there are multiple databases running in the previous home, you can decide which ones to switch to the new patched home. This is obviously an advantage, otherwise you would be forced to simultaneously patch all the databases in a home. A disadvantage of this method would be the space requirements for a duplicate home. Also, if proper housekeeping is not carried out later on, it can lead to a proliferation of Oracle homes on a server where patches are being applied regularly using this mechanism. This kind of selective patching and minimal downtime is not possible if you use the previously available method of in-place patching, which uses a separate deployment procedure to shut down all databases running from an Oracle home before applying the patches on the same home. The databases can only be restarted normally after the patching process is over, and this obviously takes more downtime and affects all databases in a home. Depending on the method you choose, the appropriate deployment procedure will be automatically selected and used. We will now use the out-of-place method in this patch plan. On the Step 3: Deployment Options page, make sure the Out of Place (Recommended) option is selected. Then click on Create New Location. Type in the name and location of the new Oracle home, and click on the Validate button. This checks the Oracle home path on the Target server. After this is done, click on the Create button. The deployment options of the patch plan are successfully updated, and the new home appears on the Step 3 page. Click on the Credentials tab. Here you need to select or enter the normal and privileged credentials for the Oracle home. Click on the Next button. This moves us to step 4, the Validation step. Pre-patching analysis Click on the Analyze button. A job to perform prepatching analysis is started in the background. This will compare the installed software and patches on the targets with the new patches you have selected in your plan, and attempt to validate them. This validation may take a few minutes to complete, since it also checks the Oracle home for readiness, computes the space requirements for the home, and conducts other checks such as cluster node connectivity (if you are patching a RAC database). If you drill down to the analysis job itself by clicking on Show Detailed Progress here, you can see that it does a number of checks to validate if the targets are supported for patching, verifies the normal and super user credentials of the Oracle home, verifies the target tools, commands, and permissions, upgrades OPATCH to the latest version, stages the selected patches to Oracle homes, and then runs the prerequisite checks including those for cloning an Oracle home. If the prerequisite checks succeed, the analysis job skips the remaining steps and stops at this point with a successful status. The patch is seen as Ready for Deployment. If there are any issues, they will show up at this point. For example, if there is a conflict with any of the patches, a replacement patch or a merge patch may be suggested. If there is no replacement or merge patch and you want to request such a patch, it will allow you to make the request directly from the screen. If you are applying a PSU and the CPU for that same release is already applied to the Oracle home, for example, July 2011 CPU, then because the PSU is a superset of the CPU, the MOS analysis will stop and mention that the existing patch fixes the issues. Such a message can be seen in the Informational Messages section of the Validation page. Deployment In our case, the patch is Ready for Deployment. At this point, you can move directly to step 5, Review & Deploy, by clicking on it in the left-hand side pane. On the Review & Deploy page, the patch plan is described in detail along with Impacted Targets. Along with the database that is in the patch plan, a new impacted target has been found by the analysis process and added to the list of impacted targets. This is the listener that is running from the home that is to be cloned and patched. The patches that are to be applied are also listed on this review page, in our case the CPUJUL2011 patch is shown with the status Conflict Free. The deployment procedure that will be used is Clone and Patch Oracle Database, since out-of-place patching is being used, and all instances and listeners running in the previous Oracle home are being switched to the new home. Click on the Prepare button. The status on the screen changes to Preparation in Progress. A job for preparation of the out-of-place patching starts, including cloning of the original Oracle home and applying the patches to the cloned home. No downtime is required while this job is running; it can happen in the background. This preparation phase is like a pre-deploy and is only possible in the case of out-of-place patching, whereas in the case of in-place patching, there is no Prepare button and you deploy straightaway. Clicking on Show Detailed Progress here opens a new window showing the job details. When the preparation job has successfully completed (after about two hours in our virtual machine), we can see that it performs the cloning of the Oracle home, applies the patches on the new home, validates the patches, runs the post patch scripts, and then skips all the remaining steps. It also collects target properties for the Oracle home in order to refresh the configurations in Enterprise Manager. The Review & Deploy page now shows Preparation Successful!. The plan is now ready to be deployed. Click on the Deploy button. The status on the screen changes to Deployment in Progress. A job for deployment of the out-of-place patching starts. At this time, downtime will be required since the database instances using the previous Oracle home will be shut down and switched across. The deploy job successfully completes (after about 21 minutes in our virtual machine); we can see that it works iteratively over the list of hosts and Oracle homes in the patch plan. It starts a blackout for the database instances in the Oracle home (so that no alerts are raised), stops the instances, migrates them to the cloned Oracle home, starts them in upgrade mode, applies SQL scripts to patch the instance, applies post-SQL scripts, and then restarts the database in normal mode. The deploy job applies other SQL scripts and recompiles invalid objects (except in the case of patch sets). It then migrates the listener from the previous Oracle home using the Network Configuration Assistant (NetCA), updates the Target properties, stops the blackout, and detaches the previous Oracle home. Finally, the configuration information of the cloned Oracle home is refreshed. The Review & Deploy page of the patch plan now shows the status of Deployment Successful!, as can be seen in the following screenshot: Plan template On the Deployment Successful page, it is possible to click on Save as Template at the bottom of the screen in order to save a patch plan as a plan template. The patch plan should be successfully analyzed and deployable, or successfully deployed, before it can be saved as a template. The plan template, when thus created, will not have any targets included, and such a template can then be used to apply the successful patch plan to multiple other targets. Inside the plan template, the Create Plan button is used to create a new plan based on this template, and this can be done repeatedly for multiple targets. Go to Enterprise | Provisioning and Patching | Patches & Updates; this screen displays a list of all the patch plans and plan templates that have been created. The successfully deployed Sainath_patchplan and the new patch plan template also shows up here. To see a list of the saved patches in the Software Library, go to Enterprise | Provisioning and Patching | Saved Patches. This brings up the following screen: This page also allows you to manually upload patches to the Software Library. This scenario is mostly used when there is no connection to the Internet (either direct or via a proxy server) from the Enterprise Manager OMS servers, and consequently you need to download the patches manually. For more details on setting up the offline mode and downloading the patch recommendations and latest patch information in the form of XML files from My Oracle Support, please refer to Oracle Enterprise Manager Lifecycle Management Administrator's Guide 12c Release 2 (12.1.0.2) at the following URL: http://docs.oracle.com/cd/E24628_01/em.121/e27046/pat_mosem_new. htm#BABBIEAI Patching roles The new version of Enterprise Manager Cloud Control 12c supplies out of the box administrator roles specifically for patching. These roles are EM_PATCH_ ADMINISTRATOR, EM_PATCH_DESIGNER, and EM_PATCH_OPERATOR. You need to grant these roles to appropriate administrators. Move to Setup | Security | Roles. On this page, search for the roles specifically meant for patching. The three roles appear as follows: The EM_PATCH_ADMINISTRATOR role can create, edit, deploy, or delete any patch plan and can also grant privileges to other administrators after creating them. This role has full privileges on any patch plan or patch template in the Enterprise Manager system and maintains the patching infrastructure. The EM_PATCH_DESIGNER role normally identifies patches to be used in the patching cycle across development, testing, and production. This role would be the one of the senior DBA in real life. The patch designer creates patch plans and plan templates, and grants privileges for these plan templates to the EM_PATCH_ OPERATOR role. As an example, the patch designer will select a set of recommended and other manually selected patches for an Oracle 11g database and create a patch plan. This role will then test the patching process in a development environment, and save the successfully analyzed or deployed patch plan as a plan template. The patch designer will then publish the Oracle 11g database patching plan template to the patch operator—probably the junior DBA or application DBA in real life. Next, the patch operator creates new patch plans using the template (but cannot create a template), and adds a different list of targets, such as other Oracle 11g databases in the test, staging, or production environment. This role then schedules the deployment of the patches to all these environments—using the same template again and again. Summary: Enterprise Manager Cloud Control 12 c allows automation of the tedious patching procedure used in many organizations today, to patch their Oracle databases and servers. This is achieved via the Database Lifecycle Management Pack, which is one of the main licensable packs of Enterprise Manager. Sophisticated Deployment Procedures are provided out of the box to fulfill many different types of patching tasks, and this helps you to achieve mass patching of multiple targets with multiple patches in a fully automated manner, thus making tremendous savings in administrative time and effort. Some companies have estimated savings of up to 98 percent in patching tasks in their data centers. Different types of patches can be applied in this manner, including CPUs, PSUs, Patch sets and other one-off patches. Different versions of databases are supported, such as 9i, 10 g and 11 g. For the first time, the upgrade of single-instance databases is also possible via Enterprise Manager Cloud Control 12c. There is full integration of the patching capabilities of Enterprise Manager with My Oracle Support (MOS). The support site retains the configuration of all the components managed by Enterprise Manager inside the company. Since the current version and patch information of the components is known, My Oracle Support is able to provide appropriate patch recommendations for many targets, including the latest security fixes. This ensures that the company is up to date with regards to security protection. A full division of roles is available, such as Patch Administrator, Designer, and Operator. It is possible to take the My Oracle Support recommendations, select patches for targets, put them into a patch plan, deploy the patch plan and then create a plan template from it. The template can then be published to any operator who can then create their own patch plans for other targets. In this way patching can be tested, verified, and then pushed to production. In all, Enterprise Manager Cloud Control 12 c offers valuable automation methods for Mass Patching, allowing Administrators to ensure that their systems have the latest security patches, and enabling them to control the application of patches on development, test, and production servers from the centralized location of the Software Library. Resources for Article : Further resources on this subject: Author Podcast - Bob Griesemer on Oracle Warehouse Builder 11g [Article] Managing Oracle Business Intelligence [Article] Author Podcast - Ronald Rood discusses the birth of Oracle Scheduler [Article]
Read more
  • 0
  • 0
  • 5969

article-image-follow-money
Packt
28 Mar 2013
13 min read
Save for later

Follow the Money

Packt
28 Mar 2013
13 min read
(For more resources related to this topic, see here.) It starts with the Cost Worksheet In PCM, the Cost Worksheet is the common element for all the monetary modules. The Cost Worksheet is a spreadsheet-like module with rows and columns. The following screenshot shows a small part of a typical Cost Worksheet register: The columns are set by PCM but the rows are determined by the organization. The rows are the Cost Codes. This is your cost breakdown structure. This is the lowest level of detail with which money will be tracked in PCM. It can also be called Cost Accounts. All money entered into the documents in any of the monetary modules in PCM will be allocated to a Cost Code on the Cost Worksheet. Even if you do not specifically allocate to a Cost Code, the system will allocate to a system generated Cost Code called NOT COSTED. The NOT COSTED Cost Code is important so no money slips through the cracks. If you forget to assign money to your Cost Codes on the project it will assign the money to this code. When reviewing the Cost Worksheet, a user can review the NOT COSTED Cost Code and see if any money is associated with this code. If there is money associated with NOT COSTED, he can find that document where he has forgotten to allocate all the money to proper Cost Codes. Users cannot edit any numbers directly on the Cost Worksheet; it is a reflection of information entered on various documents on your project. This provides a high level of accountability in that no money can be entered or changed without a document entered someplace within PCM (like can be done with a spreadsheet) The Cost Code itself can be up to 30 characters in length and can be divided into segments to align with the cost breakdown structure, as shown in the following screenshot: The number of Cost Codes and the level of breakdown is typically determined by the accounting or ERP system used by your organization or it can be used as an extension of the ERP system's coding structure. When the Cost Code structure matches, integration between the two systems becomes easier. There are many other factors to consider when thinking about integrating systems but the Cost Code structure is at the core of relating the two systems. Defining the segments within the Cost Codes is done as part of the initial setup and implementation of PCM. This is done in the Cost Code Definitions screen, as shown in the following screenshot: To set up, you must tell PCM what character of the Cost Code the segment starts with and how long the segment is (the number of characters). Once this is done you can also populate a dictionary of titles for each segment. A trick used for having different segment titles for different projects is to create an identical segment dictionary but for different projects. For example, if you have a different list of Disciplines for every project, you can create and define a list of Disciplines for each project with the same starting character and length. Then you can use the proper Cost Code definitions in your layouts and reporting for that project. The following screenshot shows how this can be done: Once the Cost Codes have been defined, the Cost Worksheet will need to be populated for your project. There are various ways to accomplish this. Create a dummy project with the complete list of company Cost Codes you would ever use on a project. When you want to populate the Cost Code list on a new project, use the Copy Cost Codes function from the project tree. Import a list of Cost Codes that have been developed in a spreadsheet (Yes, I used the word "spreadsheet". There are times when a spreadsheet comes in handy – managing a multi-million dollar project is not one of them). PCM has an import function from the Cost Worksheet where you can import a comma-separated values (CSV) file of the Cost Codes and titles. Enter the Cost Codes one at a time from the Cost Worksheet. If there are a small number of Cost Codes, this might be the fastest and easiest method. Understanding the columns of the Cost Worksheet will help you understand how powerful and important the Cost Worksheet really is. The columns of the Cost Worksheet in PCM are set by the system. They are broken down into a few categories, as follows: Budget Commitment Custom Actuals Procurement Variances Miscellaneous Each of the categories has a corresponding color to help differentiate them when looking at the Cost Worksheet. Within each of these categories are a number of columns. The Budget, Commitment, and Custom categories have the same columns while the other categories have their own set of columns. These three categories work basically the same. They can be defined in basic terms as follows: Budget: This is the money that your company has available to spend and is going to be received by the project. Examples depend on the perspective of the setup of PCM. In the example of our cavemen Joe and David, David is the person working for Joe. If David was using PCM, the Budget category would be the amount of the agreed price between Joe and David to make the chair or the amount of money that David was going to be paid by Joe to make the chair. Committed: This is the money that has been agreed to be spent on the project, not the money that has been spent. So in our example it would be the amount of money that David has agreed to pay his subcontractors to supply him with goods and services to build the chair for him. Custom: This is a category that is available to be used by the user for another contracting type. It has its own set of columns identical to the Budget and Commitment categories. This can be used for a Funding module where you can track the amount of money funded for the project, which can be much different from the available budget for the project. Money distributed to the Trends module can be posted to many of the columns as determined by the user upon adding the Trend. The Trend document is not referenced in the following explanations. When money is entered in a document, it must be allocated or distributed to one or multiple Cost Codes. As stated before, if you forget or do not allocate the money to a Cost Code, PCM will allocate the money to the NOT COSTED Cost Code. The system knows what column to place the money in but the user must tell PCM the proper row (Cost Code). If the Status Type of a document is set to Closed or Rejected, the money is removed from the Cost Worksheet but the document is still available to be reviewed. This way only documents that are in progress or approved will be placed on the Cost Worksheet. Let's look at each of the columns individually and explain how money is posted. The only documents that affect the Cost Worksheet are as follows: Contracts (three types) Change Orders Proposals Payment Requisitions Invoices Trends Procurement Contracts Let's look at the first three categories first since they are the most complex. Following is a table of the columns associated with these categories. Understand that the terminology used here is the standard out of the box terminology of PCM and may not match what has been set up in your organization. The third contract type (Custom) can be turned on or off using the Project Settings. It can be used for a variety of types as it has its own set of columns in the Cost Worksheet. The Custom contract type can be used in the Change Management module; however, it utilizes the Commitment tab, which requires the user to understand exactly what category the change is related. The following tables show various columns on the Cost Worksheet starting with the Cost Code itself. The first table shows all the columns used by each of the three contract categories: Cost Worksheet Columns The columns listed above are affected by the Contracts, Purchase Orders, or any Change Document modules. Let's look at specific definitions of what document type can be posted to which column. The Original Column The Original column is used for money distributed from any of the Contract modules. If a Commitment contract is added under the Contracts – Committed module and the money is distributed to various Cost Codes (rows), the column used is the Original Commitment column in the worksheet. It's the same with the Contracts – Budgeted and Contracts – Custom modules. The Purchase Order module is posted to the Commitments category. Money can also be posted to this column for Budget and Commitment contracts from the Change Management module where a phase has been assigned this column. This is not a typical practice as the Original column should be unchangeable from the values on the original contract. The Approved Column The Approved Revisions column is used for money distributed from the Change Order module. If a Change Order is added under the Change Order module against a commitment contract and the money is distributed to various Cost Codes (rows), and the Change Order has been approved, the money on this document is posted to the Approved Commitment Revisions column in the worksheet. We will discuss what happens prior to approval later. The Revised Column The Revised column is a computed column adding the original money and the approved money. Money cannot be distributed to this column from any document in PCM. The Pending Changes Column The Pending Revisions column can be populated by several document types as follows: Change Orders: Prior to approving the Change Order document, all money associated with the Change Order document created from the Change Orders module from the point of creation will be posted to the Pending Changes column. Change Management: These are documents associated with a change management process where the change phase is associated with the Pending column. This can be from the Proposal module or the Change Order module. Proposals: These are documents created in the Proposals module either through the Change Management module or directly from the module itself. The Estimated Changes Column The Estimated Revisions column is populated from phases in Change Management that have been assigned to distribute money to this column The Adjustment Column The Adjustment column is populated from phases in Change Management that have been assigned to distribute money to this column The Projected Column The Projected column is a computed column of all columns associated with a category. This column is very powerful in understanding the potential cost at completion of this Cost Code. Actuals There are two columns that are associated with actual cost documents in PCM. The modules that affect these columns are as follows: Payment Requisitions Invoices These columns are the Actuals Received and Actuals Issued columns. These column names can be confusing and should be considered for change during implementation. This is the way you could look at what money these columns include. Actuals Received: This column holds money where you have received a Payment Requisition or Invoice to be paid by you. This also includes the Custom category. Actuals Issued: This column holds money where you have issued a Payment Requisition or Invoice to be paid to you. As Payment Requisitions or Invoices are being entered and the money distributed to Cost Codes, this money will be placed in one of these two columns depending on the contract relationship associated with these documents. Be aware that money is placed into these columns as soon as it is entered into Payment Requisitions or Invoices regardless of approval or certification. Procurement There are many columns relating to the Procurement module. This book does not go into details of the Procurement module. The column names related to Procurement are as follows: Procurement Estimate Original Estimate Estimate Accuracy Estimated Gross Profit Buyout Purchasing Buyout Variances There are many Variance columns that are computed columns. These columns show the variance (or difference) between other columns on the worksheet, as follows. Original Variance: The Original Budget minus the Original Commitment Approved Variance: The Revised Budget minus the Revised Commitment Pending Variance: The (Revised Budget plus Pending Budget Revisions) minus (Revised Commitment plus Pending Commitment) Projected Variance: The Projected Budget minus the Projected Commitment These columns are very powerful to help analyze relationships between the Budget category and the Commitment category. Miscellaneous There are a few miscellaneous columns as follows that are worth noting so you understand what the numbers mean: Budget Percent: This represents the percentage of the Actuals Issued column of the Revised Budget column for that Cost Code. Commitment Percentage: This represents the percentage of the Actuals Received column of the Revised Commitment column for that Cost Code. Planned to Commit: This is the planned expenditure for the Cost Code. This value can only be populated from the Details tab of the Cost Code. It is also used for an estimators value of the Cost Code. Drilling down to the detail The beauty of the Cost Worksheet is the ability to quickly review what documents have had an effect on which column on the worksheet. Look at the Cost Worksheet as a ten-thousand foot view of the money on your project. There is a lot of information that can be gleaned from this high-level review especially if you are using layouts properly. If you see some numbers that need further review, then drilling down to the detail directly from the Cost Worksheet is quite simple. To drill down to the detail, click on the Cost Code. This will bring up a new page with tabs for the different categories. Click on the tab you wish to review and the grid shows all the documents where some or all the money has been posted to this Cost Code. This page shows all columns affected by the selected category, with the rows representing each document and the corresponding value from that document that affects the selected Cost Code on the Cost Worksheet. From this page you can click on the link under the Item column (as shown in the previous screenshot) to open the actual document that the row represents. Summary Understanding the concepts in this article is key to understanding how the money flows within PCM. Take the time to review this information so that other articles of the book on changes, payments, and forecasting make more sense. The ability to have all aspects of the money on your project accessible from one module is extremely powerful and should be one of the modules that you refer to on a regular basis. Resources for Article : Further resources on this subject: Author Podcast - Ronald Rood discusses the birth of Oracle Scheduler [Article] Author Podcast - Bob Griesemer on Oracle Warehouse Builder 11g [Article] Oracle Integration and Consolidation Products [Article]
Read more
  • 0
  • 0
  • 1676
article-image-generating-reports-notebooks-rstudio
Packt
26 Mar 2013
7 min read
Save for later

Generating Reports in Notebooks in RStudio

Packt
26 Mar 2013
7 min read
(For more resources related to this topic, see here.) A very important feature of reproducible science is generating reports. The main idea of automatic report generation is that the results of analyses are not manually copied to the report. Instead, both the R code and the report's text are combined in one or more plain text files. The report is generated by a tool that executes the chunks of code, captures the results (including figures), and generates the report by weaving the report's text and results together. To achieve this, you need to learn a few special commands, called markup specifiers, that tell the report generator which part of your text is R code, and which parts you want in special typesetting such as boldface or italic. There are several markup languages to do this, but the following is a minimal example using the Markdown language: A simple example with Markdown The left panel shows the plain text file in RStudio's editor and the right panel shows the web page that is generated by clicking on the Knit HTML button. The markup specifiers used here are the double asterisks for boldface, single underscores for slanted font, and the backticks for code. By adding an r to the first backtick, the report generator executes the code following it. To reproduce this example, go to File | New | R Markdown, copy the text as shown in the preceding screenshot, and save as one.Rmd. Next, click on Knit HTML. The Markdown language is one of many markup languages in existence and RStudio supports several of them. RStudio has excellent support for interweaving code with Markdown, HTML, LaTeX, or even in plain comments. Notebooks are useful to quickly share annotated lines of code or results. There are a few ways to control the layout of a notebook. The Markdown language is easy to learn and has a fair amount of layout options. It also allows you to include equations in the LaTeX format. The HTML option is really only useful if you aim to create a web page. You should know, or be willing to learn HTML to use it. The result of these three methods is always a web page (that is, an HTML file) although this can be exported to PDF. If you need ultimate control over your document's layout, and if you need features like automated bibliographies and equation numbering, LaTeX is the way to go. With this last option, it is possible to create papers for scientific journals straight from your analysis. Depending on the chosen system, a text file with a different extension is used as the source file. The following table gives an overview: Markup system Input file type Report file type Notebook Markdown HTML LaTeX .R .Rmd .Rhtml .Rnw .html (via .md) .html (via .md) .html .pdf (via .tex) Finally, we note that the interweaving of code and text (often referred to as literate programming) may serve two purposes. The first, described in this article, is to generate a data analysis report by executing code to produce the result. The second is to document the code itself, for example, by describing the purpose of a function and all its arguments. Prerequisites for report generation For notebooks, R Markdown, and Rhtml, RStudio relies on Yihui Xie's knitr package for executing code chunks and merging the results. The knitr package can be installed via RStudio's Packages tab or with the command install. packages("knitr"). For LaTeX/Sweave files, the default is to use R's native Sweave driver. The knitr package is easier to use and has more options for fine-tuning, so in the rest of this article we assume that knitr is always used. To make sure that knitr is also used for Sweave files, go to Tools | Options | Sweave and choose knitr as Weave Rnw files. If you're working in an RStudio project, you can set this as a project option as well by navigating to Project | Project Options | Sweave. When you work with LaTeX/Sweave, you need to have a working LaTeX distribution installed. Popular distributions are TeXLive for Linux, MikTeX for Windows, and MacTeX for Mac OS X. Notebook The easiest way to generate a quick, sharable report straight from your Rscript is by creating a notebook via File | Notebook, or by clicking on the Notebook button all the way on the top right of the Rscript tab (right next to the Source button). Notebook options RStudio offers three ways to generate a notebook from an Rscript—the simplest are Default and knitr::stitch. These only differ a little in layout. The knitr::spin mode allows you to use the Markdown markup language to specify text layout. The markup options are presented after navigating to File | Notebook or after clicking on the Notebook button. Under the hood, the Default and knitr::stitch options use knitr to generate a Markdown file which is then directly converted to a web page (HTML file). The knitr::spin mode allows for using Markdown commands in your comments and will convert your .R file to a .Rmd (R Markdown) file before further processing. In Default mode, R code and printed results are rendered to code blocks in a fixedwidth font with a different background color. Figures are included in the output and the document is prepended with a title, an optional author name, and the date. The only option to include text in your output is to add it as an R comment (behind the # sign) and it will be rendered as such. In knitr::stitch mode, instead of prepending the report with an author name and date, the report is appended with a call to Sys.time() and R's sessionInfo(). The latter is useful since it shows the context in which the code was executed including R's version, locale settings, and loaded packages. The result of the knitr::stitch mode depends on a template file called knitr-template.Rnw, included with the knitr package. It is stored in a directory that you can find by typing system. file('misc',package='knitr'). The knitr::spin mode allows you to escape from the simple notebook and add text outside of code blocks, using special markup specifiers. In particular, all comment lines that are preceded with #' (hash and single quote) are interpreted as the Markdown text. For example, the following code block: # This is printed as comment in a code block 1 + 1 #' This will be rendered as main text #' Markdown **specifiers** are also _recognized_ Will be rendered in the knitr::spin mode as shown in the following screenshot: Reading a notebook in the knitr::spin mode allows for escaping to Markdown The knitr package has several general layout options for included code (that will be discussed in the next section). When generating a notebook in the knitr::spin mode, these options can be set by preceding them with a #+ (hash and plus signs). For example, the following code: #' The code below is _not_ evaluated #+ eval=FALSE 1 + 1 Results in the following report: Setting knitr options for a notebook in knitr::spin mode Although it is convenient to be able to use Markdown commands in the knitr::spin mode, once you need such options it is often better to switch to R Markdown completely, as discussed in the next section. Note that a notebook is a valid R script and can be executed as such. This is in contrast with the other report generation options—those are text files that need knitr or Sweave to be processed. Publishing a notebook Notebooks are ideal to share examples or quick results from fairly simple data analyses. Since early 2012, the creators of RStudio offer a website, called RPubs. com, where you can upload your notebooks by clicking on the Publish button in the notebook preview window that automatically opens after a notebook has been generated. Do note that this means that results will be available for the world to see, so be careful when using personal or otherwise private data. Summary In this article we discussed prerequisites for producing a report. We also learnt how to produce reports via Notebook that automatically include the results of an analysis. Resources for Article : Further resources on this subject: Organizing, Clarifying and Communicating the R Data Analyses[Article] Customizing Graphics and Creating a Bar Chart and Scatterplot in R [Article] Graphical Capabilities of R[Article]
Read more
  • 0
  • 0
  • 6465

article-image-creating-first-circos-diagram
Packt
25 Mar 2013
6 min read
Save for later

Creating the first Circos diagram

Packt
25 Mar 2013
6 min read
(For more resources related to this topic, see here.) Getting ready Let's start with the simple task of graphing a relationship between a student's eye and hair color. We can expect some results: brown eyes are more common for students with brown or black hair, and blue eyes are more common amongst blondes. Circos is able to show these relationships with more clarity than a traditional table. We will be using the hair and eye color data available in the book's supplemental materials (HairEyeColor.csv). The data contains the information about hair and eye color of University of Delaware students. Create a folder C:Usersuser_nameCircos BookHairEyeColor, and place the data file into the location. Here, user_name denotes the user name that is used to log in to your computer. The original data is in a size that can be typically stored in a data set. Each line represents a student and their respective hair (black, brown, blonde, or red) and eye (blue, brown, green, or hazel) color. The following table shows the first 10 lines of data: Hair Eye Brown Red Blonde Brown Blonde Brown Black Brown Brown Brown Brown Blue Hazel Blue Blue Brown Brown Hazel   Before we start creating the specific diagram, let's prepare the data into a table. If you wish, you can use Microsoft Excel's PivotTable or Data Pilots of OpenOffice to transform it into a table as follows:   Blue Brown Green Hazel Black Blonde Brown Red 20 94 84 17 68 7 119 26 5 15 29 14 15 11 54 14 In order to use the data for Circos, we need a simpler format. Open a text file and create a table only separated by spaces. We will also change the row and column titles to make it clearer, as follows: X Blue_Eyes Brown_Eyes Green_Eyes Hazel_Eyes Black_Hair 20 68 5 15 Blonde_Hair 94 7 15 11 Brown_Hair 84 119 29 54 Red_Hair 17 26 14 14 The X is simply a place holder. Save this file as HairEyeColorTable.txt as we are ready to use Circos. You can skip the process of making the raw tables. We will be using the HairEyeColorTable.txt file to create the Circos diagram. How to do it… Open the Command Prompt and change the directory to the location of the tableviewer tools in the CircosCircos Toolstoolstableviewerbin, as follows: cd C:Program Files (x86)CircosCircos Toolstoolstableviewerbin Parse the text table (HairEyeColorTable.txt). This will create a new file, HairEyeColorTable-parsed.txt, which will be refined into a Circos diagram as follows: perl parse-table -file "C:Usersuser_nameCircos Book HairEyeColorHairEyeColorTable.txt" > "C:Usersuser_nameCircos BookHairEyeColorHairEyeColorTable-parsed.txt" The parse command consists of a few parts. First, Perl's parse-table instructs Perl to execute the parse program on the HairEyeColorTable.txt file. Second, the > symbol instructs Windows to write the output into another text file called HairEyeColorTable-parsed.txt. Linux Users Linux users can use a simpler, shorter syntax. Steps 2 and 3 can be completed with this command: cat "~/Documents/Circos Book/HairEyeColor/ HairEyeColorTable.txt" | bin/parse-table | bin/ make-conf -dir "~/Documents/user_name/Circos Book/ HairEyeColor/HairEyeColorTable-parsed.txt Create the configuration files from the parsed table using the following command: type "C:Usersuser_nameCircos BookHairEyeColor HairEyeColorTable-parsed.txt" | perl make-conf -dir "C:Users user_nameCircos BookHairEyeColor" This will create 11 new configuration files. These files contain the data and style information which is needed to create the final diagram. This command consists of two parts. We are instructing Windows to pass the text in the HairEyeColorTable-parsed.txt file to the make-conf command. The | (pipe) character separates what we want passed along and the actual command. After the pipe, we are instructing Perl to execute the make-conf command and store the output into a new directory. We need to create a final file, which compiles all the information. This file will also tell Circos how the diagram should appear, such as size, labels, image style, and where the diagram will be saved. We will save the diagram as HairEyeColor.conf. The make-conf command gave us the color.conf file, which associates colors with the final diagram. In addition, the Circos installation provides us with some other basic colors and fonts. The first several lines of code are: <colors> <<include colors.conf>> <<include C:Program Files (x86)Circosetccolors.conf>> </colors> <fonts> <<include C:Program Files (x86)Circosetcfonts.conf>> </fonts> The next segment is the ideogram. These are the parameters that set the details of the image. This first set of lines specifies the spacing, color, and size of the chromosomes: <ideogram> <spacing> default=0.01r break=200u </spacing> thickness = 100p stroke_thickness = 2 stroke_color = black fill = yes fill_color = black radius = 0.7r show_label = yes label_font = condensedbold label_radius = dim(ideogram,radius) + 0.05r label_size = 48 band_stroke_thickness = 2 show_bands = yes fill_bands = yes </ideogram> Next, we will define the image, including where it is stored (this location is mentioned in the following code snippet as dir), the file name, whether we want an SVG or PNG file, size, background color, and any rotation: dir = C:Usersuser_nameCircos BookHairEyeColor file = HairEyeColor svg = yes png = yes 24bit = yes radius = 800p background = white angle_offset = +90 Lastly, we will input the data and define how the links (ribbons) should look: chromosomes_units = 1 karyotype = karyotype.txt <links> z = 0 radius = 1r – 150p bezier_radius = 0.2r <link cell_> ribbon = yes flat = yes show = yes color = black thickness = 2 file = cells.txt </link> show_bands = yes <<include C:Program Files (x86)Circosetchousekeeping.conf>> Save this file as HairEyeColor.conf with the other configuration files. Have a look at the next diagram which explains all this procedure: The make-conf command outputs a few very important files. First, karyotype.txt defines each ideogram band's name, width, and color. Meanwhile, cells.txt is the segdup file containing the actual data. It is very different from our original table, but it dictates the width of each ribbon. Circos links the karyotype and segdup files to create the image. The other configuration files are mostly to set the aesthetics, placement, and size of the diagram. Return to the Command Prompt and execute the following command: cd C:Usersuser_nameCircos BookHairEyeColor perl "C:Program Files (x86)Circosbincircos" –conf HairEyeColor.conf Several lines of text will scroll across the screen. At the conclusion, HairEyeColor.png and HairEyeColor.svg will appear in the folder as shown in the next diagram:
Read more
  • 0
  • 0
  • 3109
Modal Close icon
Modal Close icon