Data | Tech News, Tutorials & Expert Insights

article-image-machine-learning-using-spark-mllib

01 Apr 2015

22 min read

Machine Learning Using Spark MLlib

01 Apr 2015

0
0
7085

article-image-basic-concepts-machine-learning-and-logistic-regression-example-mahout

Packt

30 Mar 2015

33 min read

Basic Concepts of Machine Learning and Logistic Regression Example in Mahout

Packt

30 Mar 2015

33 min read

0
0
4995

Packt

30 Mar 2015

28 min read

PostgreSQL – New Features

Packt

30 Mar 2015

28 min read

In this article, Jayadevan Maymala, author of the book, PostgreSQL for Data Architects, you will see how to troubleshoot the initial hiccups faced by people who are new to PostgreSQL. We will look at a few useful, but not commonly used data types. We will also cover pgbadger, a nifty third-party tool that can run through a PostgreSQL log. This tool can tell us a lot about what is happening in the cluster. Also, we will look at a few key features that are part of PostgreSQL 9.4 release. We will cover a couple of useful extensions. (For more resources related to this topic, see here.) Interesting data types We will start with the data types. PostgreSQL does have all the common data types we see in databases. These include: The number data types (smallint, integer, bigint, decimal, numeric, real, and double) The character data types (varchar, char, and text) The binary data types The date/time data types (including date, timestamp without timezone, and timestamp with timezone) BOOLEAN data types However, this is all standard fare. Let's start off by looking at the RANGE data type. RANGE This is a data type that can be used to capture values that fall in a specific range. Let's look at a few examples of use cases. Cars can be categorized as compact, convertible, MPV, SUV, and so on. Each of these categories will have a price range. For example, the price range of a category of cars can start from $15,000 at the lower end and the price range at the upper end can start from $40,000. We can have meeting rooms booked for different time slots. Each room is booked during different time slots and is available accordingly. Then, there are use cases that involve shift timings for employees. Each shift begins at a specific time, ends at a specific time, and involves a specific number of hours on duty. We would also need to capture the swipe-in and swipe-out time for employees. These are some use cases where we can consider range types. Range is a high-level data type; we can use int4range as the appropriate subtype for the car price range scenario. For the booking the meeting rooms and shifting use cases, we can consider tsrange or tstzrange (if we want to capture time zone as well). It makes sense to explore the possibility of using range data types in most scenarios, which involve the following features: From and to timestamps/dates for room reservations Lower and upper limit for price/discount ranges Scheduling jobs Timesheets Let's now look at an example. We have three meeting rooms. The rooms can be booked and the entries for reservations made go into another table (basic normalization principles). How can we find rooms that are not booked for a specific time period, say, 10:45 to 11:15? We will look at this with and without the range data type: CREATE TABLE rooms(id serial, descr varchar(50)); INSERT INTO rooms(descr) SELECT concat('Room ', generate_series(1,3)); CREATE TABLE room_book (id serial , room_id integer, from_time timestamp, to_time timestamp , res tsrange); INSERT INTO room_book (room_id,from_time,to_time,res) values(1,'2014-7-30 10:00:00', '2014-7-30 11:00:00', '(2014-7-30 10:00:00,2014-7-30 11:00:00)'); INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 10:00:00', '2014-7-30 10:40:00', '(2014-7-30 10:00,2014-7-30 10:40:00)'); INSERT INTO room_book (room_id,from_time,to_time,res) values(2,'2014-7-30 11:20:00', '2014-7-30 12:00:00', '(2014-7-30 11:20:00,2014-7-30 12:00:00)'); INSERT INTO room_book (room_id,from_time,to_time,res) values(3,'2014-7-30 11:00:00', '2014-7-30 11:30:00', '(2014-7-30 11:00:00,2014-7-30 11:30:00)'); PostgreSQL has the OVERLAPS operator. This can be used to get all the reservations that overlap with the period for which we wanted to book a room: SELECT room_id FROM room_book WHERE (from_time,to_time) OVERLAPS ('2014-07-30 10:45:00','2014-07-30 11:15:00'); If we eliminate these room IDs from the master list, we have the list of rooms available. So, we prefix the following command to the preceding SQL: SELECT id FROM rooms EXCEPT We get a room ID that is not booked from 10:45 to 11:15. This is the old way of doing it. With the range data type, we can write the following SQL statement: SELECT id FROM rooms EXCEPT SELECT room_id FROM room_book WHERE res && '(2014-07-30 10:45:00,2014-07-30 11:15:00)'; Do look up GIST indexes to improve the performance of queries that use range operators. Another way of achieving the same is to use the following command: SELECT id FROM rooms EXCEPT SELECT room_id FROM room_book WHERE '2014-07-30 10:45:00' < to_time AND '2014-07-30 11:15:00' > from_time; Now, let's look at the finer points of how a range is represented. The range values can be opened using [ or ( and closed with ] or ). [ means include the lower value and ( means exclude the lower value. The closing (] or )) has a similar effect on the upper values. When we do not specify anything, [) is assumed, implying include the lower value, but exclude the upper value. Note that the lower bound is 3 and upper bound is 6 when we mention 3,5, as shown here: SELECT int4range(3,5,'[)') lowerincl ,int4range(3,5,'[]') bothincl, int4range(3,5,'()') bothexcl , int4range(3,5,'[)') upperexcl; lowerincl | bothincl | bothexcl | upperexcl -----------+----------+----------+----------- [3,5) | [3,6) | [4,5) | [3,5) Using network address types The network address types are cidr, inet, and macaddr. These are used to capture IPv4, IPv6, and Mac addresses. Let's look at a few use cases. When we have a website that is open to public, a number of users from different parts of the world access it. We may want to analyze the access patterns. Very often, websites can be used by users without registering or providing address information. In such cases, it becomes even more important that we get some insight into the users based on the country/city and similar location information. When anonymous users access our website, an IP is usually all we get to link the user to a country or city. Often, this becomes our not-so-accurate unique identifier (along with cookies) to keep track of repeat visits, to analyze website-usage patterns, and so on. The network address types can also be useful when we develop applications that monitor a number of systems in different networks to check whether they are up and running, to monitor resource consumption of the systems in the network, and so on. While data types (such as VARCHAR or BIGINT) can be used to store IP addresses, it's recommended to use one of the built-in types PostgreSQL provides to store network addresses. There are three data types to store network addresses. They are as follows: inet: This data type can be used to store an IPV4 or IPV6 address along with its subnet. The format in which data is to be inserted is Address/y, where y is the number of bits in the netmask. cidr: This data type can also be used to store networks and network addresses. Once we specify the subnet mask for a cidr data type, PostgreSQL will throw an error if we set bits beyond the mask, as shown in the following example: CREATE TABLE nettb (id serial, intclmn inet, cidrclmn cidr); CREATE TABLE INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/32', '192.168.64.2/32'); INSERT 0 1 INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.64.2/24'); ERROR: invalid cidr value: "192.168.64.2/24" LINE 1: ...b (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.6... ^ DETAIL: Value has bits set to right of mask. INSERT INTO nettb (intclmn , cidrclmn) VALUES ('192.168.64.2/24', '192.168.64.0/24'); INSERT 0 1 SELECT * FROM nettb; id | intclmn | cidrclmn ----+-----------------+----------------- 1 | 192.168.64.2 | 192.168.64.2/32 2 | 192.168.64.2/24 | 192.168.64.0/24 Let's also look at a couple of useful operators available within network address types. Does an IP fall in a subnet? This can be figured out using <<=, as shown here: SELECT id,intclmn FROM nettb ; id | intclmn ----+-------------- 1 | 192.168.64.2 3 | 192.168.12.2 4 | 192.168.13.2 5 | 192.168.12.4 SELECT id,intclmn FROM nettb where intclmn <<= inet'192.168.12.2/24'; id | intclmn 3 | 192.168.12.2 5 | 192.168.12.4 SELECT id,intclmn FROM nettb where intclmn <<= inet'192.168.12.2/32'; id | intclmn 3 | 192.168.12.2 The operator used in the preceding command checks whether the column value is contained within or equal to the value we provided. Similarly, we have the equality operator, that is, greater than or equal to, bitwise AND, bitwise OR, and other standard operators. The macaddr data type can be used to store Mac addresses in different formats. hstore for key-value pairs A key-value store available in PostgreSQL is hstore. Many applications have requirements that make developers look for a schema-less data store. They end up turning to one of the NoSQL databases (Cassandra) or the simple and more prevalent stores such as Redis or Riak. While it makes sense to opt for one of these if the objective is to achieve horizontal scalability, it does make the system a bit complex because we now have more moving parts. After all, most applications do need a relational database to take care of all the important transactions along with the ability to write SQL to fetch data with different projections. If a part of the application needs to have a key-value store (and horizontal scalability is not the prime objective), the hstore data type in PostgreSQL should serve the purpose. It may not be necessary to make the system more complex by using different technologies that will also add to the maintenance overhead. Sometimes, what we want is not an entirely schema-less database, but some flexibility where we are certain about most of our entities and their attributes but are unsure about a few. For example, a person is sure to have a few key attributes such as first name, date of birth, and a couple of other attributes (irrespective of his nationality). However, there could be other attributes that undergo change. A U.S. citizen is likely to have a Social Security Number (SSN); someone from Canada has a Social Insurance Number (SIN). Some countries may provide more than one identifier. There can be more attributes with a similar pattern. There is usually a master attribute table (which links the IDs to attribute names) and a master table for the entities. Writing queries against tables designed on an EAV approach can get tricky. Using hstore may be an easier way of accomplishing the same. Let's see how we can do this using hstore with a simple example. The hstore key-value store is an extension and has to be installed using CREATE EXTENSION hstore. We will model a customer table with first_name and an hstore column to hold all the dynamic attributes: CREATE TABLE customer(id serial, first_name varchar(50), dynamic_attributes hstore); INSERT INTO customer (first_name ,dynamic_attributes) VALUES ('Michael','ssn=>"123-465-798" '), ('Smith','ssn=>"129-465-798" '), ('James','ssn=>"No data" '), ('Ram','uuid=>"1234567891" , npr=>"XYZ5678", ratnum=>"Somanyidentifiers" '); Now, let's try retrieving all customers with their SSN, as shown here: SELECT first_name, dynamic_attributes FROM customer WHERE dynamic_attributes ? 'ssn'; first_name | dynamic_attributes Michael | "ssn"=>"123-465-798" Smith | "ssn"=>"129-465-798" James | "ssn"=>"No data" Also, those with a specific SSN: SELECT first_name,dynamic_attributes FROM customer WHERE dynamic_attributes -> 'ssn'= '123-465-798'; first_name | dynamic_attributes - Michael | "ssn"=>"123-465-798" If we want to get records that do not contain a specific SSN, just use the following command: WHERE NOT dynamic_attributes -> 'ssn'= '123-465-798' Also, replacing it with WHERE NOT dynamic_attributes ? 'ssn'; gives us the following command: first_name | dynamic_attributes ------------+----------------------------------------------------- Ram | "npr"=>"XYZ5678", "uuid"=>"1234567891", "ratnum"=>"Somanyidentifiers" As is the case with all data types in PostgreSQL, there are a number of functions and operators available to fetch data selectively, update data, and so on. We must always use the appropriate data types. This is not just for the sake of doing it right, but because of the number of operators and functions available with a focus on each data type; hstore stores only text. We can use it to store numeric values, but these values will be stored as text. We can index the hstore columns to improve performance. The type of index to be used depends on the operators we will be using frequently. json/jsonb JavaScript Object Notation (JSON) is an open standard format used to transmit data in a human-readable format. It's a language-independent data format and is considered an alternative to XML. It's really lightweight compared to XML and has been steadily gaining popularity in the last few years. PostgreSQL added the JSON data type in Version 9.2 with a limited set of functions and operators. Quite a few new functions and operators were added in Version 9.3. Version 9.4 adds one more data type: jsonb.json, which is very similar to JSONB. The jsonb data type stores data in binary format. It also removes white spaces (which are insignificant) and avoids duplicate object keys. As a result of these differences, JSONB has an overhead when data goes in, while JSON has extra processing overhead when data is retrieved (consider how often each data point will be written and read). The number of operators available with each of these data types is also slightly different. As it's possible to cast one data type to the other, which one should we use depends on the use case. If the data will be stored as it is and retrieved without any operations, JSON should suffice. However, if we plan to use operators extensively and want indexing support, JSONB is a better choice. Also, if we want to preserve whitespace, key ordering, and duplicate keys, JSON is the right choice. Now, let's look at an example. Assume that we are doing a proof of concept project for a library management system. There are a number of categories of items (ranging from books to DVDs). We wouldn't have information about all the categories of items and their attributes at the piloting stage. For the pilot stage, we could use a table design with the JSON data type to hold various items and their attributes: CREATE TABLE items ( item_id serial, details json ); Now, we will add records. All DVDs go into one record, books go into another, and so on: INSERT INTO items (details) VALUES ('{ "DVDs" :[ {"Name":"The Making of Thunderstorms", "Types":"Educational", "Age-group":"5-10","Produced By":"National Geographic" }, {"Name":"My nightmares", "Types":"Movies", "Categories":"Horror", "Certificate":"A", "Director":"Dracula","Actors": [{"Name":"Meena"},{"Name":"Lucy"},{"Name":"Van Helsing"}] }, {"Name":"My Cousin Vinny", "Types":"Movies", "Categories":"Suspense", "Certificate":"A", "Director": "Jonathan Lynn","Actors": [{"Name":"Joe "},{"Name":"Marissa"}] }] }' ); A better approach would be to have one record for each item. Now, let's take a look at a few JSON functions: SELECT details->>'DVDs' dvds, pg_typeof(details->>'DVDs') datatype FROM items; SELECT details->'DVDs' dvds ,pg_typeof(details->'DVDs') datatype FROM items; Note the difference between ->> and -> in the following screenshot. We are using the pg_typeof function to clearly see the data type returned by the functions. Both return the JSON object field. The first function returns text and the second function returns JSON: Now, let's try something a bit more complex: retrieve all movies in DVDs in which Meena acted with the following SQL statement: WITH tmp (dvds) AS (SELECT json_array_elements(details->'DVDs') det FROM items) SELECT * FROM tmp , json_array_elements(tmp.dvds#>'{Actors}') as a WHERE a->>'Name'='Meena'; We get the record as shown here: We used one more function and a couple of operators. The json_array_elements expands a JSON array to a set of JSON elements. So, we first extracted the array for DVDs. We also created a temporary table, which ceases to exist as soon as the query is over, using the WITH clause. In the next part, we extracted the elements of the array actors from DVDs. Then, we checked whether the Name element is equal to Meena. XML PostgreSQL added the xml data type in Version 8.3. Extensible Markup Language (XML) has a set of rules to encode documents in a format that is both human-readable and machine-readable. This data type is best used to store documents. XML became the standard way of data exchanging information across systems. XML can be used to represent complex data structures such as hierarchical data. However, XML is heavy and verbose; it takes more bytes per data point compared to the JSON format. As a result, JSON is referred to as fat-free XML. XML structure can be verified against XML Schema Definition Documents (XSD). In short, XML is heavy and more sophisticated, whereas JSON is lightweight and faster to process. We need to configure PostgreSQL with libxml support (./configure --with-libxml) and then restart the cluster for XML features to work. There is no need to reinitialize the database cluster. Inserting and verifying XML data Now, let's take a look at what we can do with the xml data type in PostgreSQL: CREATE TABLE tbl_xml(id serial, docmnt xml); INSERT INTO tbl_xml(docmnt ) VALUES ('Not xml'); INSERT INTO tbl_xml (docmnt) SELECT query_to_xml( 'SELECT now()',true,false,'') ; SELECT xml_is_well_formed_document(docmnt::text), docmnt FROM tbl_xml; Then, take a look at the following screenshot: First, we created a table with a column to store the XML data. Then, we inserted a record, which is not in the XML format, into the table. Next, we used the query_to_xml function to get the output of a query in the XML format. We inserted this into the table. Then, we used a function to check whether the data in the table is well-formed XML. Generating XML files for table definitions and data We can use the table_to_xml function if we want to dump the data from a table in the XML format. Append and_xmlschema so that the function becomes table_to_xml_and_xmlschema, which will also generate the schema definition before dumping the content. If we want to generate just the definitions, we can use table_to_xmlschema. PostgreSQL also provides the xpath function to extract data as follows: SELECT xpath('/table/row/now/text()',docmnt) FROM tbl_xml WHERE id = 2; xpath ------------------------------------ {2014-07-29T16:55:00.781533+05:30} Using properly designed tables with separate columns to capture each attribute is always the best approach from a performance standpoint and update/write-options perspective. Data types such as json/xml are best used to temporarily store data when we need to provide feeds/extracts/views to other systems or when we get data from external systems. They can also be used to store documents. The maximum size for a field is 1 GB. We must consider this when we use the database to store text/document data. pgbadger Now, we will look at a must-have tool if we have just started with PostgreSQL and want to analyze the events taking place in the database. For those coming from an Oracle background, this tool provides reports similar to AWR reports, although the information is more query-centric. It does not include data regarding host configuration, wait statistics, and so on. Analyzing the activities in a live cluster provides a lot of insight. It tells us about load, bottlenecks, which queries get executed frequently (we can focus more on them for optimization). It even tells us if the parameters are set right, although a bit indirectly. For example, if we see that there are many temp files getting created while a specific query is getting executed, we know that we either have a buffer issue or have not written the query right. For pgbadger to effectively scan the log file and produce useful reports, we should get our logging configuration right as follows: log_destination = 'stderr' logging_collector = on log_directory = 'pg_log' log_filename = 'postgresql-%Y-%m-%d.log' log_min_duration_statement = 0 log_connections = on log_disconnections = on log_duration = on log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d ' log_lock_waits = on track_activity_query_size = 2048 It might be necessary to restart the cluster for some of these changes to take effect. We will also ensure that there is some load on the database using pgbench. It's a utility that ships with PostgreSQL and can be used to benchmark PostgreSQL on our servers. We can initialize the tables required for pgbench by executing the following command at shell prompt: pgbench -i pgp This creates a few tables on the pgp database. We can log in to psql (database pgp) and check: \dt List of relations Schema | Name | Type | Owner --------+------------------+-------+---------- public | pgbench_accounts | table | postgres public | pgbench_branches | table | postgres public | pgbench_history | table | postgres public | pgbench_tellers | table | postgres Now, we can run pgbench to generate load on the database with the following command: pgbench -c 5 -T10 pgp The T option passes the duration for which pgbench should continue execution in seconds, c passes the number of clients, and pgp is the database. At shell prompt, execute: wget https://github.com/dalibo/pgbadger/archive/master.zip Once the file is downloaded, unzip the file using the following command: unzip master.zip Use cd to the directory pgbadger-master as follows: cd pgbadger-master Execute the following command: ./pgbadger /pgdata/9.3/pg_log/postgresql-2014-07-31.log –o myoutput.html Replace the log file name in the command with the actual name. It will generate a myoutput.html file. The HTML file generated will have a wealth of information about what happened in the cluster with great charts/tables. In fact, it takes quite a bit of time to go through the report. Here is a sample chart that provides the distribution of queries based on execution time: The following screenshot gives an idea about the number of performance metrics provided by the report: If our objective is to troubleshoot performance bottlenecks, the slowest individual queries and most frequent queries under the top drop-down list is the right place to start. Once the queries are identified, locks, temporary file generation, and so on can be studied to identify the root cause. Of course, EXPLAIN is the best option when we want to refine individual queries. If the objective is to understand how busy the cluster is, the Overview section and Sessions are the right places to explore. The logging configuration used may create huge log files in systems with a lot of activity. Tweak the parameters appropriately to ensure that this does not happen. With this, we covered most of the interesting data types, an interesting extension and a must-use tool from PostgreSQL ecosystem. Now, let's cover a few interesting features in PostgreSQL Version 9.4. Features over time Applying filters in Versions 8.0, 9.0, and 9.4 gives us a good idea about how quickly features are getting added to the database. Interesting features in 9.4 Each version of PostgreSQL adds many features grouped into different categories (such as performance, backend, data types, and so on). We will look at a few features that are more likely to be of interest (because they help us improve performance or they make maintenance and configuration easy). Keeping the buffer ready As we saw earlier, reads from disk have a significant overhead compared to those from memory. There are quite a few occasions when disk reads are unavoidable. Let's see a few examples. In a data warehouse, the Extract, Transform, Load (ETL) process, which may happen once a day usually, involves a lot of raw data getting processed in memory before being loaded into the final tables. This data is mostly transactional data. The master data, which does not get processed on a regular basis, may be evicted from memory as a result of this churn. Reports typically depend a lot on master data. When users refresh their reports after ETL, it's highly likely that the master data will be read from disk, resulting in a drop in the response time. If we could ensure that the master data as well as the recently processed data is in the buffer, it can really improve user experience. In a transactional system like an airline reservation system, a change to the fare rule may result in most of the fares being recalculated. This is a situation similar to the one described previously, ensuring that the fares and availability data for the most frequently searched routes in the buffer can provide a better user experience. This applies to an e-commerce site selling products also. If the product/price/inventory data is always available in memory, it can be retrieved very fast. You must use PostgreSQL 9.4 for trying out the code in the following sections. So, how can we ensure that the data is available in the buffer? A pg_prewarm module has been added as an extension to provide this functionality. The basic syntax is very simple: SELECT pg_prewarm('tablename');. This command will populate the buffers with data from the table. It's also possible to mention the blocks that should be loaded into the buffer from the table. We will install the extension in a database, create a table, and populate some data. Then, we will stop the server, drop buffers (OS), and restart the server. We will see how much time a SELECT count(*) takes. We will repeat the exercise, but we will use pg_prewarm before executing SELECT count(*) at psql: CREATE EXTENSION pg_prewarm; CREATE TABLE myt(id SERIAL, name VARCHAR(40)); INSERT INTO myt(name) SELECT concat(generate_series(1,10000),'name'); Now, stop the server using pg_ctl at the shell prompt: pg_ctl stop -m immediate Clean OS buffers using the following command at the shell prompt (will need to use sudo to do this): echo 1 > /proc/sys/vm/drop_caches The command may vary depending on the OS. Restart the cluster using pg_ctl start. Then, execute the following command: SELECT COUNT(*) FROM myt; Time: 333.115 ms We should repeat the steps of shutting down the server, dropping the cache, and starting PostgreSQL. Then, execute SELECT pg_prewarm('myt'); before SELECT count(*). The response time goes down significantly. Executing pg_prewarm does take some time, which is close to the time taken to execute the SELECT count(*) against a cold cache. However, the objective is to ensure that the user does not experience a delay. SELECT COUNT(*) FROM myt; count ------- 10000 (1 row) Time: 7.002 ms Better recoverability A new parameter called recovery_min_apply_delay has been added in 9.4. This will go to the recovery.conf file of the slave server. With this, we can control the replay of transactions on the slave server. We can set this to approximately 5 minutes and then the standby will replay the transaction from the master when the standby system time is 5 minutes past the time of commit at the master. This provides a bit more flexibility when it comes to recovering from mistakes. When we keep the value at 1 hour, the changes at the master will be replayed at the slave after one hour. If we realize that something went wrong on the master server, we have about 1 hour to stop the transaction replay so that the action that caused the issue (for example, accidental dropping of a table) doesn't get replayed at the slave. Easy-to-change parameters An ALTER SYSTEM command has been introduced so that we don't have to edit postgresql.conf to change parameters. The entry will go to a file named postgresql.auto.conf. We can execute ALTER SYSTEM SET work_mem='12MB'; and then check the file at psql: \! more postgresql.auto.conf # Do not edit this file manually! # It will be overwritten by ALTER SYSTEM command. work_mem = '12MB' We must execute SELECT pg_reload_conf(); to ensure that the changes are propagated. Logical decoding and consumption of changes Version 9.4 introduces physical and logical replication slots. We will look at logical slots as they let us track changes and filter specific transactions. This lets us pick and choose from the transactions that have been committed. We can grab some of the changes, decode, and possibly replay on a remote server. We do not have to have an all-or-nothing replication. As of now, we will have to do a lot of work to decode/move the changes. Two parameter changes are necessary to set this up. These are as follows: The max_replication_slots parameter (set to at least 1) and wal_level (set to logical). Then, we can connect to a database and create a slot as follows: SELECT * FROM pg_create_logical_replication_slot('myslot','test_decoding'); The first parameter is the name we give to our slot and the second parameter is the plugin to be used. Test_decoding is the sample plugin available, which converts WAL entries into text representations as follows: INSERT INTO myt(id) values (4); INSERT INTO myt(name) values ('abc'); Now, we will try retrieving the entries: SELECT * FROM pg_logical_slot_peek_changes('myslot',NULL,NULL); Then, check the following screenshot: This function lets us take a look at the changes without consuming them so that the changes can be accessed again: SELECT * FROM pg_logical_slot_get_changes('myslot',NULL,NULL); This is shown in the following screenshot: This function is similar to the peek function, but the changes are no longer available to be fetched again as they get consumed. Summary In this article, we covered a few data types that data architects will find interesting. We also covered what is probably the best utility available to parse the PostgreSQL log file to produce excellent reports. We also looked at some of the interesting features in PostgreSQL version 9.4, which will be of interest to data architects. Resources for Article: Further resources on this subject: PostgreSQL as an Extensible RDBMS [article] Getting Started with PostgreSQL [article] PostgreSQL Cookbook - High Availability and Replication [article]

0
0
3012

article-image-storm-real-time-high-velocity-computation

Packt

27 Mar 2015

10 min read

Storm for Real-time High Velocity Computation

Packt

27 Mar 2015

10 min read

In this article by Shilpi Saxena, author of the book Real-time Analytics with Storm and Cassandra, we will cover the following topics: What's possible with data analysis? Real-time analytics—why is it becoming the need of the hour Why storm—the power of high speed distributed computations We will get you to think about some interesting problems along the lines of Air Traffic Controller (ATC), credit card fraud detection, and so on. First and foremost, you will understand what is big data. Well, big data is the buzzword of the software industry but it's much more than the buzz in reality, it's really a huge amount of data. (For more resources related to this topic, see here.) What is big data? Big data is equal to volume, veracity, variety, and velocity. The descriptions of these are as follows: Volume: Enterprises are awash with ever growing data of all types, easily amassing terabytes even petabytes of information (for example, convert 12 terabytes of tweets created each day into an improved product sentiment analysis or convert 350 billion annual meter readings to better predict power consumption). Velocity: Sometimes, 2 minutes is too late. For time-sensitive processes, such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value (for example, scrutinize 5 million trade events created each day to identify potential fraud or analyze 500 million call detail records daily in real time to predict the customer churn faster). Variety: Big data is any type of data, structured and unstructured data, such as text, sensor data, audio, video, click streams, log files, and many more. New insights are found when analyzing these data types together (for example, monitor hundreds of live video feeds from surveillance cameras to target points of interest or exploit the 80 percent data growth in images, videos, and documents to improve customer satisfaction). Well now that I have described big data, let's have a quick look at where is this data generated and how does it come into existence. The following figure demonstrates a quick snapshot of what all can happen in one second in the world of the internet and social media. Now, we need the power to process all this data at the same rate at which it is generated to gain some meaningful insight out of it, as shown: The power of computation comes with the Storm and Cassandra combination. This technological combo let's us cater to the following use cases: Credit card fraud detection Security breaches Bandwidth allocation Machine failures Supply chain Personalized content Recommendations Get acquainted to few problems that require distributed computing solution Let's do a deep dive and identify some of the problems which require distributed solutions. Real-time business solution for credit or debit card fraud detection Let's get acquainted to the problem depicted in the following figure; when we make any transaction using plastic money and swipe our debit or credit card for payment, the duration within which the bank has to validate or reject the transaction is less than 5 seconds. During this less than 5 seconds, data or transaction details have to be encrypted, travel over secure network from servicing back bank to issuing back bank, then at the issuing back bank the entire fuzzy logic for acceptance or decline of the transaction has to computed, and the result has to travel back over the secure network: The challenges such as network latency and delay can be optimized to some extent, but to achieve the preceding featuring transaction in less than 5 seconds, one has to design an application that is able to churn a considerable amount of data and generate results in 1 to 2 seconds. Aircraft Communications Addressing and Reporting system It is another typical use case that cannot be implemented without having a reliable real-time processing system in place. These systems use Satellite communication (SATCOM), and as per the following figure, they gather voice and packet data from all phases of flight in real-time and are able to generate analytics and alerts on the same data in real-time. Let's take the example from the figure in the preceding case. A flight encounters some real hazardous weather, say, electric Storms on a route, then that information is sent through satellite links and voice or data gateways to the air controller, which in real-time detects and raises the alerts to deviate routes for all other flights passing through that area. Healthcare This is another very important domain where real-time analytics over high volume and velocity data has equipped the healthcare professionals with accurate and exact information in real-time to take informed life-saving actions. The preceding figure depicts the use case where the doctors can take informed action to handle the medical situation of the patients. Data is collated from historic patient database, drug database, and patient records. Once data is collected it is processed, and live statistics and key parameters of the patient are plotted against the same collated data. This data can be used to further generate reports and alerts to aid the health care professionals in real-time. Other applications There are varieties of other applications where power of real-time computing can either optimize or help people take informed decisions. It has become a great utility and aid in following industries: Manufacturing Application performance monitoring Customer relationship management Transportation industry Network optimization Complexity of existing solutions Now that we understand the power that real-time solutions can get into various industry verticals, let's explore and find out what options do we have to process vast amount of data being generated at a very fast pace. The Hadoop Solution The Hadoop solution is a tried, tested, and proven solution in industry which we use the MapReduce jobs in clustered setup to execute jobs and generate results. MapReduce is a programming paradigm where we process large data sets by using a mapper function that processes a key and value pair and thus generate intermediate output again in form of key-value pair. Then a reduce function operates on the mapper output and merges the values associated with same intermediate key and generates result. In the preceding figure, we demonstrate the simple word count MapReduce job where: There is a huge big data store which can go up to zettabytes and petabytes Blocks of the input data are split and replicated onto each of the nodes in Hadoop cluster Each mapper job counts the number of words on the data blocks allocated to it Once the mapper is done, the words (which are actually the keys) and the counts are sent to reducers Reducers combine the mapper output and the results are generated Big data, as we know, did provide a solution to processing and generating results out of humongous volume of data, but that's predominantly a batch processing system and has almost no utility on real-time use case. A custom solution Here we talk about a solution of the kinds twitter used before the advent of Storm. The simplistic version of the problem could be that you need a real-time count of the tweets by each user; Twitter solved the problem by following mechanism shown in the following figure: Here is the detailed information of how the preceding mechanism works: They created a fire hose or queue onto which all the tweets are pushed. A set of workers' nodes read from the queue and decipher the tweet Json and maintain the count of tweets by each user by different workers. At first set of workers the data or the number of tweets are equally distributed amongst the workers, so they are shared randomly. These workers assimilate these first level count into next set of queues. From these queues (the ones mentioned at level 1) second level of workers pick from these queues. Here the sharding is not random an algorithm is in place which ensures that tweet count of one user always goes to same worker. Then the counts are dumped into data store. The queue-worker solution is described in the following: Very complex and specific to the use case Redeployment and reconfiguration is a huge task Scaling is very tedious System is not fault tolerant Paid solution Well this is always an option, lot of big companies have invested in products which let us do this kind of computing but that comes at a heavy license cost. Few solutions to name are from companies such as: IBM Oracle Vertica Gigaspace Open real-time processing tools There are few other technologies which have some similar traits and features such as Apache Storm and S4 from Yahoo, but it lacks guaranteed processing. Spark is one is essentially a batch processing system with some features on micro-batching, which could be utilized as real-time. So finally after evaluation of all these problems, we still find Storm as the best open-source candidate to handle these use cases. Storm persistence Storm processes the streaming data at very high velocity. Cassandra complements the Storms ability to process by providing support to write and read to NoSQL at a very high rate. There are variety of API's available for connecting with Cassandra. In general the API's we are talking are wrappers written over core thrift API, which offer various crud operations over Cassandra cluster using programmer friendly packages. Thrift protocol: The most basic and core of all APIs for access to Cassandra it is the RPC protocol, which provides a language neutral interface and thus exposes flexibility to communicate using Python, Java and so on. Please note almost all other API's we'd discuss are using Thrift under the hood. It is simple to use and provides basic functionality out of the box such as ring discovery, and native access. Complex features such as retry, connection pooling, and so on are not supported out of the box. We have variety of libraries which have extended Thrift and added these much required features, we'd like to touch upon a few widely used ones in this article. Hector: This is has the privilege of being one of the most stable and extensively used API for java based client applications to access the Cassandra. As said earlier it uses Thrift underneath, so it can't essentially offer any feature or functionality not supported by Thrift protocol. The reasons for its wide spread use are number of essential features ready to use and available out of the box. It has implementation for connection pooling It has ring discovery feature with an add on of automatic failover support It has a retry for downed hosts in Cassandra ring Datastax Java Driver: This one is again a recent addition to the stack of client access options to Cassandra and hence gels well with newer version of Cassandra. Here are the salient features: Connection pooling Reconnection policies Load balancing Cursor support Astyanax: It is a very recent addition to bouquet of Cassandra client API's and has been developed by Netflix, which definitely makes it more fabled than others. Let's have a look at its credentials to see where does it qualifies: It supports all Hector functions and is much more easier to use Promises better connection pooling than hector Has a better failover handling than Hector It gives me some out of the box database like features (now that's a big news) At API level it provides me functionality called Recipes in its terms which provides:Parallel all row query executionMessaging queue functionalityObject storePagination It has numerous frequently required utilities such as following: JSON Writer CVS importer Summary In this article, we reviewed the what is big data, how it is analysed, the applications in which it it used, the complexity of the solutions and the monitoring tools of Storm. Resources for Article: Further resources on this subject: Deploying Storm on Hadoop for Advertising Analysis [article] An overview of architecture and modeling in Cassandra [article] Getting Up and Running with Cassandra [article]

0
0
1959

article-image-big-data-analysis-r-and-hadoop

Packt

26 Mar 2015

37 min read

Big Data Analysis (R and Hadoop)

Packt

26 Mar 2015

37 min read

0
0
4746

Packt

26 Mar 2015

35 min read

Cassandra Architecture

Packt

26 Mar 2015

35 min read

In this article by Nishant Neeraj, the author of the book Mastering Apache Cassandra - Second Edition, aims to set you into a perspective where you will be able to see the evolution of the NoSQL paradigm. It will start with a discussion of common problems that an average developer faces when the application starts to scale up and software components cannot keep up with it. Then, we'll see what can be assumed as a thumb rule in the NoSQL world: the CAP theorem that says to choose any two out of consistency, availability, and partition-tolerance. As we discuss this further, we will realize how much more important it is to serve the customers (availability), than to be correct (consistency) all the time. However, we cannot afford to be wrong (inconsistent) for a long time. The customers wouldn't like to see that the items are in stock, but that the checkout is failing. Cassandra comes into picture with its tunable consistency. (For more resources related to this topic, see here.) Problems in the RDBMS world RDBMS is a great approach. It keeps data consistent, it's good for OLTP (http://en.wikipedia.org/wiki/Online_transaction_processing), it provides access to good grammar, and manipulates data supported by all the popular programming languages. It has been tremendously successful in the last 40 years (the relational data model was proposed in its first incarnation by Codd, E.F. (1970) in his research paper A Relational Model of Data for Large Shared Data Banks). However, in early 2000s, big companies such as Google and Amazon, which have a gigantic load on their databases to serve, started to feel bottlenecked with RDBMS, even with helper services such as Memcache on top of them. As a response to this, Google came up with BigTable (http://research.google.com/archive/bigtable.html), and Amazon with Dynamo (http://www.cs.ucsb.edu/~agrawal/fall2009/dynamo.pdf). If you have ever used RDBMS for a complicated web application, you must have faced problems such as slow queries due to complex joins, expensive vertical scaling, and problems in horizontal scaling. Due to these problems, indexing takes a long time. At some point, you may have chosen to replicate the database, but there was still some locking, and this hurts the availability of the system. This means that under a heavy load, locking will cause the user's experience to deteriorate. Although replication gives some relief, a busy slave may not catch up with the master (or there may be a connectivity glitch between the master and the slave). Consistency of such systems cannot be guaranteed. Consistency, the property of a database to remain in a consistent state before and after a transaction, is one of the promises made by relational databases. It seems that one may need to make compromises on consistency in a relational database for the sake of scalability. With the growth of the application, the demand to scale the backend becomes more pressing, and the developer teams may decide to add a caching layer (such as Memcached) at the top of the database. This will alleviate some load off the database, but now the developers will need to maintain the object states in two places: the database, and the caching layer. Although some Object Relational Mappers (ORMs) provide a built-in caching mechanism, they have their own issues, such as larger memory requirement, and often mapping code pollutes application code. In order to achieve more from RDBMS, we will need to start to denormalize the database to avoid joins, and keep the aggregates in the columns to avoid statistical queries. Sharding or horizontal scaling is another way to distribute the load. Sharding in itself is a good idea, but it adds too much manual work, plus the knowledge of sharding creeps into the application code. Sharded databases make the operational tasks (backup, schema alteration, and adding index) difficult. To find out more about the hardships of sharding, visit http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/. There are ways to loosen up consistency by providing various isolation levels, but concurrency is just one part of the problem. Maintaining relational integrity, difficulties in managing data that cannot be accommodated on one machine, and difficult recovery, were all making the traditional database systems hard to be accepted in the rapidly growing big data world. Companies needed a tool that could support hundreds of terabytes of data on the ever-failing commodity hardware reliably. This led to the advent of modern databases like Cassandra, Redis, MongoDB, Riak, HBase, and many more. These modern databases promised to support very large datasets that were hard to maintain in SQL databases, with relaxed constrains on consistency and relation integrity. Enter NoSQL NoSQL is a blanket term for the databases that solve the scalability issues which are common among relational databases. This term, in its modern meaning, was first coined by Eric Evans. It should not be confused with the database named NoSQL (http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page). NoSQL solutions provide scalability and high availability, but may not guarantee ACID: atomicity, consistency, isolation, and durability in transactions. Many of the NoSQL solutions, including Cassandra, sit on the other extreme of ACID, named BASE, which stands for basically available, soft-state, eventual consistency. Wondering about where the name, NoSQL, came from? Read Eric Evans' blog at http://blog.sym-link.com/2009/10/30/nosql_whats_in_a_name.html. The CAP theorem In 2000, Eric Brewer (http://en.wikipedia.org/wiki/Eric_Brewer_%28scientist%29), in his keynote speech at the ACM Symposium, said, "A distributed system requiring always-on, highly-available operations cannot guarantee the illusion of coherent, consistent single-system operation in the presence of network partitions, which cut communication between active servers". This was his conjecture based on his experience with distributed systems. This conjecture was later formally proved by Nancy Lynch and Seth Gilbert in 2002 (Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services, published in ACMSIGACT News, Volume 33 Issue 2 (2002), page 51 to 59 available at http://webpages.cs.luc.edu/~pld/353/gilbert_lynch_brewer_proof.pdf). Let's try to understand this. Let's say we have a distributed system where data is replicated at two distinct locations and two conflicting requests arrive, one at each location, at the time of communication link failure between the two servers. If the system (the cluster) has obligations to be highly available (a mandatory response, even when some components of the system are failing), one of the two responses will be inconsistent with what a system with no replication (no partitioning, single copy) would have returned. To understand it better, let's take an example to learn the terminologies. Let's say you are planning to read George Orwell's book Nineteen Eighty-Four over the Christmas vacation. A day before the holidays start, you logged into your favorite online bookstore to find out there is only one copy left. You add it to your cart, but then you realize that you need to buy something else to be eligible for free shipping. You start to browse the website for any other item that you might buy. To make the situation interesting, let's say there is another customer who is trying to buy Nineteen Eighty-Four at the same time. Consistency A consistent system is defined as one that responds with the same output for the same request at the same time, across all the replicas. Loosely, one can say a consistent system is one where each server returns the right response to each request. In our example, we have only one copy of Nineteen Eighty-Four. So only one of the two customers is going to get the book delivered from this store. In a consistent system, only one can check out the book from the payment page. As soon as one customer makes the payment, the number of Nineteen Eighty-Four books in stock will get decremented by one, and one quantity of Nineteen Eighty-Four will be added to the order of that customer. When the second customer tries to check out, the system says that the book is not available any more. Relational databases are good for this task because they comply with the ACID properties. If both the customers make requests at the same time, one customer will have to wait till the other customer is done with the processing, and the database is made consistent. This may add a few milliseconds of wait to the customer who came later. An eventual consistent database system (where consistency of data across the distributed servers may not be guaranteed immediately) may have shown availability of the book at the time of check out to both the customers. This will lead to a back order, and one of the customers will be paid back. This may or may not be a good policy. A large number of back orders may affect the shop's reputation and there may also be financial repercussions. Availability Availability, in simple terms, is responsiveness; a system that's always available to serve. The funny thing about availability is that sometimes a system becomes unavailable exactly when you need it the most. In our example, one day before Christmas, everyone is buying gifts. Millions of people are searching, adding items to their carts, buying, and applying for discount coupons. If one server goes down due to overload, the rest of the servers will get even more loaded now, because the request from the dead server will be redirected to the rest of the machines, possibly killing the service due to overload. As the dominoes start to fall, eventually the site will go down. The peril does not end here. When the website comes online again, it will face a storm of requests from all the people who are worried that the offer end time is even closer, or those who will act quickly before the site goes down again. Availability is the key component for extremely loaded services. Bad availability leads to bad user experience, dissatisfied customers, and financial losses. Partition-tolerance Network partitioning is defined as the inability to communicate between two or more subsystems in a distributed system. This can be due to someone walking carelessly in a data center and snapping the cable that connects the machine to the cluster, or may be network outage between two data centers, dropped packages, or wrong configuration. Partition-tolerance is a system that can operate during the network partition. In a distributed system, a network partition is a phenomenon where, due to network failure or any other reason, one part of the system cannot communicate with the other part(s) of the system. An example of network partition is a system that has some nodes in a subnet A and some in subnet B, and due to a faulty switch between these two subnets, the machines in subnet A will not be able to send and receive messages from the machines in subnet B. The network will be allowed to lose many messages arbitrarily sent from one node to another. This means that even if the cable between the two nodes is chopped, the system will still respond to the requests. The following figure shows the database classification based on the CAP theorem: An example of a partition-tolerant system is a system with real-time data replication with no centralized master(s). So, for example, in a system where data is replicated across two data centers, the availability will not be affected, even if a data center goes down. The significance of the CAP theorem Once you decide to scale up, the first thing that comes to mind is vertical scaling, which means using beefier servers with a bigger RAM, more powerful processor(s), and bigger disks. For further scaling, you need to go horizontal. This means adding more servers. Once your system becomes distributed, the CAP theorem starts to play, which means, in a distributed system, you can choose only two out of consistency, availability, and partition-tolerance. So, let's see how choosing two out of the three options affects the system behavior as follows: CA system: In this system, you drop partition-tolerance for consistency and availability. This happens when you put everything related to a transaction on one machine or a system that fails like an atomic unit, like a rack. This system will have serious problems in scaling. CP system: The opposite of a CA system is a CP system. In a CP system, availability is sacrificed for consistency and partition-tolerance. What does this mean? If the system is available to serve the requests, data will be consistent. In an event of a node failure, some data will not be available. A sharded database is an example of such a system. AP system: An available and partition-tolerant system is like an always-on system that is at risk of producing conflicting results in an event of network partition. This is good for user experience, your application stays available, and inconsistency in rare events may be alright for some use cases. In our example, it may not be such a bad idea to back order a few unfortunate customers due to inconsistency of the system than having a lot of users return without making any purchases because of the system's poor availability. Eventual consistent (also known as BASE system): The AP system makes more sense when viewed from an uptime perspective—it's simple and provides a good user experience. But, an inconsistent system is not good for anything, certainly not good for business. It may be acceptable that one customer for the book Nineteen Eighty-Four gets a back order. But if it happens more often, the users would be reluctant to use the service. It will be great if the system could fix itself (read: repair) as soon as the first inconsistency is observed; or, maybe there are processes dedicated to fixing the inconsistency of a system when a partition failure is fixed or a dead node comes back to life. Such systems are called eventual consistent systems. The following figure shows the life of an eventual consistent system: Quoting Wikipedia, "[In a distributed system] given a sufficiently long time over which no changes [in system state] are sent, all updates can be expected to propagate eventually through the system and the replicas will be consistent". (The page on eventual consistency is available at http://en.wikipedia.org/wiki/Eventual_consistency.) Eventual consistent systems are also called BASE, a made-up term to represent that these systems are on one end of the spectrum, which has traditional databases with ACID properties on the opposite end. Cassandra is one such system that provides high availability and partition-tolerance at the cost of consistency, which is tunable. The preceding figure shows a partition-tolerant eventual consistent system. Cassandra Cassandra is a distributed, decentralized, fault tolerant, eventually consistent, linearly scalable, and column-oriented data store. This means that Cassandra is made to easily deploy over a cluster of machines located at geographically different places. There is no central master server, so no single point of failure, no bottleneck, data is replicated, and a faulty node can be replaced without any downtime. It's eventually consistent. It is linearly scalable, which means that with more nodes, the requests served per second per node will not go down. Also, the total throughput of the system will increase with each node being added. And finally, it's column oriented, much like a map (or better, a map of sorted maps) or a table with flexible columns where each column is essentially a key-value pair. So, you can add columns as you go, and each row can have a different set of columns (key-value pairs). It does not provide any relational integrity. It is up to the application developer to perform relation management. So, if Cassandra is so good at everything, why doesn't everyone drop whatever database they are using and jump start with Cassandra? This is a natural question. Some applications require strong ACID compliance, such as a booking system. If you are a person who goes by statistics, you'd ask how Cassandra fares with other existing data stores. TilmannRabl and others in their paper, Solving Big Data Challenges for Enterprise Application Performance Management (http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf), said that, "In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput from one to 12 nodes. This comes at the price of a high write and read latency. Cassandra's performance is best for high insertion rates". If you go through the paper, Cassandra wins in almost all the criteria. Equipped with proven concepts of distributed computing, made to reliably serve from commodity servers, and simple and easy maintenance, Cassandra is one of the most scalable, fastest, and very robust NoSQL database. So, the next natural question is, "What makes Cassandra so blazing fast?". Let's dive deeper into the Cassandra architecture. Understanding the architecture of Cassandra Cassandra is a relative latecomer in the distributed data-store war. It takes advantage of two proven and closely similar data-store mechanisms, namely Bigtable: A Distributed Storage System for Structured Data, 2006 (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) and Amazon Dynamo: Amazon's Highly Available Key-value Store, 2007 (http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf). The following diagram displays the read throughputs that show linear scaling of Cassandra: Like BigTable, it has a tabular data presentation. It is not tabular in the strictest sense. It is rather a dictionary-like structure where each entry holds another sorted dictionary/map. This model is more powerful than the usual key-value store and it is named a table, formerly known as a column family. The properties such as eventual consistency and decentralization are taken from Dynamo. Now, assume a column family is a giant spreadsheet, such as MS Excel. But unlike spreadsheets, each row is identified by a row key with a number (token), and unlike spreadsheets, each cell may have its own unique name within the row. In Cassandra, the columns in the rows are sorted by this unique column name. Also, since the number of partitions is allowed to be very large (1.7*1038), it distributes the rows almost uniformly across all the available machines by dividing the rows in equal token groups. Tables or column families are contained within a logical container or name space called keyspace. A keyspace can be assumed to be more or less similar to database in RDBMS. A word on max number of cells, rows, and partitions A cell in a partition can be assumed as a key-value pair. The maximum number of cells per partition is limited by the Java integer's max value, which is about 2 billion. So, one partition can hold a maximum of 2 billion cells. A row, in CQL terms, is a bunch of cells with predefined names. When you define a table with a primary key that has just one column, the primary key also serves as the partition key. But when you define a composite primary key, the first column in the definition of the primary key works as the partition key. So, all the rows (bunch of cells) that belong to one partition key go into one partition. This means that every partition can have a maximum of X rows, where X = (2*109/number_of_columns_in_a_row). Essentially, rows * columns cannot exceed 2 billion per partition. Finally, how many partitions can Cassandra hold for each table or column family? As we know, column families are essentially distributed hashmaps. The keys or row keys or partition keys are generated by taking a consistent hash of the string that you pass. So, the number of partitioned keys is bounded by the number of hashes these functions generate. This means that if you are using the default Murmur3 partitioner (range -263 to +263), the maximum number of partitions that you can have is 1.85*1019. If you use the Random partitioner, the number of partitions that you can have is 1.7*1038. Ring representation A Cassandra cluster is called a ring. The terminology is taken from Amazon Dynamo. Cassandra 1.1 and earlier versions used to have a token assigned to each node. Let's call this value the initial token. Each node is responsible for storing all the rows with token values (a token is basically a hash value of a row key) ranging from the previous node's initial token (exclusive) to the node's initial token (inclusive). This way, the first node, the one with the smallest initial token, will have a range from the token value of the last node (the node with the largest initial token) to the first token value. So, if you jump from node to node, you will make a circle, and this is why a Cassandra cluster is called a ring. Let's take an example. Assume that there is a hashing algorithm (partitioner) that generates tokens from 0 to 127 and you have four Cassandra machines to create a cluster. To allocate equal load, we need to assign each of the four nodes to bear an equal number of tokens. So, the first machine will be responsible for tokens one to 32, the second will hold 33 to 64, third 65 to 96, and fourth 97 to 127 and 0. If you mark each node with the maximum token number that it can hold, the cluster looks like a ring, as shown in the following figure: Token ownership and distribution in a balanced Cassandra ring Virtual nodes In Cassandra 1.1 and previous versions, when you create a cluster or add a node, you manually assign its initial token. This is extra work that the database should handle internally. Apart from this, adding and removing nodes requires manual resetting token ownership for some or all nodes. This is called rebalancing. Yet another problem was replacing a node. In the event of replacing a node with a new one, the data (rows that the to-be-replaced node owns) is required to be copied to the new machine from a replica of the old machine. For a large database, this could take a while because we are streaming from one machine. To solve all these problems, Cassandra 1.2 introduced virtual nodes (vnodes). The following figure shows 16 vnodes distributed over four servers: In the preceding figure, each node is responsible for a single continuous range. In the case of a replication factor of 2 or more, the data is also stored on other machines than the one responsible for the range. (Replication factor (RF) represents the number of copies of a table that exist in the system. So, RF=2, means there are two copies of each record for the table.) In this case, one can say one machine, one range. With vnodes, each machine can have multiple smaller ranges and these ranges are automatically assigned by Cassandra. How does this solve those issues? Let's see. If you have a 30 ring cluster and a node with 256 vnodes had to be replaced. If nodes are well-distributed randomly across the cluster, each physical node in remaining 29 nodes will have 8 or 9 vnodes (256/29) that are replicas of vnodes on the dead node. In older versions, with a replication factor of 3, the data had to be streamed from three replicas (10 percent utilization). In the case of vnodes, all the nodes can participate in helping the new node get up. The other benefit of using vnodes is that you can have a heterogeneous ring where some machines are more powerful than others, and change the vnodes ' settings such that the stronger machines will take proportionally more data than others. This was still possible without vnodes but it needed some tricky calculation and rebalancing. So, let's say you have a cluster of machines with similar hardware specifications and you have decided to add a new server that is twice as powerful as any machine in the cluster. Ideally, you would want it to work twice as harder as any of the old machines. With vnodes, you can achieve this by setting twice as many num_tokens as on the old machine in the new machine's cassandra.yaml file. Now, it will be allotted double the load when compared to the old machines. Yet another benefit of vnodes is faster repair. Node repair requires the creation of a Merkle tree for each range of data that a node holds. The data gets compared with the data on the replica nodes, and if needed, data re-sync is done. Creation of a Merkle tree involves iterating through all the data in the range followed by streaming it. For a large range, the creation of a Merkle tree can be very time consuming while the data transfer might be much faster. With vnodes, the ranges are smaller, which means faster data validation (by comparing with other nodes). Since the Merkle tree creation process is broken into many smaller steps (as there are many small nodes that exist in a physical node), the data transmission does not have to wait till the whole big range finishes. Also, the validation uses all other machines instead of just a couple of replica nodes. As of Cassandra 2.0.9, the default setting for vnodes is "on" with default vnodes per machine as 256. If for some reason you do not want to use vnodes and want to disable this feature, comment out the num_tokens variable and uncomment and set the initial_token variable in cassandra.yaml. If you are starting with a new cluster or migrating an old cluster to the latest version of Cassandra, vnodes are highly recommended. The number of vnodes that you specify on a Cassandra node represents the number of vnodes on that machine. So, the total vnodes on a cluster is the sum total of all the vnodes across all the nodes. One can always imagine a Cassandra cluster as a ring of lots of vnodes. How Cassandra works Diving into various components of Cassandra without having any context is a frustrating experience. It does not make sense why you are studying SSTable, MemTable, and log structured merge (LSM) trees without being able to see how they fit into the functionality and performance guarantees that Cassandra gives. So first we will see Cassandra's write and read mechanism. It is possible that some of the terms that we encounter during this discussion may not be immediately understandable. A rough overview of the Cassandra components is as shown in the following figure: Main components of the Cassandra service The main class of Storage Layer is StorageProxy. It handles all the requests. The messaging layer is responsible for inter-node communications, such as gossip. Apart from this, process-level structures keep a rough idea about the actual data containers and where they live. There are four data buckets that you need to know. MemTable is a hash table-like structure that stays in memory. It contains actual cell data. SSTable is the disk version of MemTables. When MemTables are full, they are persisted to hard disk as SSTable. Commit log is an append only log of all the mutations that are sent to the Cassandra cluster. Mutations can be thought of as update commands. So, insert, update, and delete operations are mutations, since they mutate the data. Commit log lives on the disk and helps to replay uncommitted changes. These three are basically core data. Then there are bloom filters and index. The bloom filter is a probabilistic data structure that lives in the memory. They both live in memory and contain information about the location of data in the SSTable. Each SSTable has one bloom filter and one index associated with it. The bloom filter helps Cassandra to quickly detect which SSTable does not have the requested data, while the index helps to find the exact location of the data in the SSTable file. With this primer, we can start looking into how write and read works in Cassandra. We will see more explanation later. Write in action To write, clients need to connect to any of the Cassandra nodes and send a write request. This node is called the coordinator node. When a node in a Cassandra cluster receives a write request, it delegates the write request to a service called StorageProxy. This node may or may not be the right place to write the data. StorageProxy's job is to get the nodes (all the replicas) that are responsible for holding the data that is going to be written. It utilizes a replication strategy to do this. Once the replica nodes are identified, it sends the RowMutation message to them, the node waits for replies from these nodes, but it does not wait for all the replies to come. It only waits for as many responses as are enough to satisfy the client's minimum number of successful writes defined by ConsistencyLevel. ConsistencyLevel is basically a fancy way of saying how reliable a read or write you want to be. Cassandra has tunable consistency, which means you can define how much reliability is wanted. Obviously, everyone wants a hundred percent reliability, but it comes with latency as the cost. For instance, in a thrice-replicated cluster (replication factor = 3), a write time consistency level TWO, means the write will become successful only if it is written to at least two replica nodes. This request will be faster than the one with the consistency level THREE or ALL, but slower than the consistency level ONE or ANY. The following figure is a simplistic representation of the write mechanism. The operations on node N2 at the bottom represent the node-local activities on receipt of the write request: The following steps show everything that can happen during a write mechanism: If the failure detector detects that there aren't enough live nodes to satisfy ConsistencyLevel, the request fails. If the failure detector gives a green signal, but writes time-out after the request is sent due to infrastructure problems or due to extreme load, StorageProxy writes a local hint to replay when the failed nodes come back to life. This is called hinted hand off. One might think that hinted handoff may be responsible for Cassandra's eventual consistency. But it's not entirely true. If the coordinator node gets shut down or dies due to hardware failure and hints on this machine cannot be forwarded, eventual consistency will not occur. The anti-entropy mechanism is responsible for consistency, rather than hinted hand-off. Anti-entropy makes sure that all replicas are in sync. If the replica nodes are distributed across data centers, it will be a bad idea to send individual messages to all the replicas in other data centers. Rather, it sends the message to one replica in each data center with a header, instructing it to forward the request to other replica nodes in that data center. Now the data is received by the node which should actually store that data. The data first gets appended to the commit log, and pushed to a MemTable appropriate column family in the memory. When the MemTable becomes full, it gets flushed to the disk in a sorted structure named SSTable. With lots of flushes, the disk gets plenty of SSTables. To manage SSTables, a compaction process runs. This process merges data from smaller SSTables to one big sorted file. Read in action Similar to a write case, when StorageProxy of the node that a client is connected to gets the request, it gets a list of nodes containing this key based on the replication strategy. The node's StorageProxy then sorts the nodes based on their proximity to itself. The proximity is determined by the snitch function that is set up for this cluster. Basically, the following types of snitches exist: SimpleSnitch: A closer node is the one that comes first when moving clockwise in the ring. (A ring is when all the machines in the cluster are placed in a circular fashion with each machine having a token number. When you walk clockwise, the token value increases. At the end, it snaps back to the first node.) PropertyFileSnitch: This snitch allows you to specify how you want your machines' location to be interpreted by Cassandra. You do this by assigning a data center name and rack name for all the machines in the cluster in the $CASSANDRA_HOME/conf/cassandra-topology.properties file. Each node has a copy of this file and you need to alter this file each time you add or remove a node. This is what the file looks like: # Cassandra Node IP=Data Center:Rack 192.168.1.100=DC1:RAC1 192.168.2.200=DC2:RAC2 10.0.0.10=DC1:RAC1 10.0.0.11=DC1:RAC1 10.0.0.12=DC1:RAC2 10.20.114.10=DC2:RAC1 10.20.114.11=DC2:RAC1 GossipingPropertyFileSnitch: The PropertyFileSnitch is kind of a pain, even when you think about it. Each node has the locations of all nodes manually written and updated every time a new node joins or an old node retires. And then, we need to copy it on all the servers. Wouldn't it be better if we just specify each node's data center and rack on just that one machine, and then have Cassandra somehow collect this information to understand the topology? This is exactly what GossipingPropertyFileSnitch does. Similar to PropertyFileSnitch, you have a file called $CASSANDRA_HOME/conf/cassandra-rackdc.properties, and in this file you specify the data center and the rack name for that machine. The gossip protocol makes sure that this information gets spread to all the nodes in the cluster (and you do not have to edit properties of files on all the nodes when a new node joins or leaves). Here is what a cassandra-rackdc.properties file looks like: # indicate the rack and dc for this node dc=DC13 rack=RAC42 RackInferringSnitch: This snitch infers the location of a node based on its IP address. It uses the third octet to infer rack name, and the second octet to assign data center. If you have four nodes 10.110.6.30, 10.110.6.4, 10.110.7.42, and 10.111.3.1, this snitch will think the first two live on the same rack as they have the same second octet (110) and the same third octet (6), while the third lives in the same data center but on a different rack as it has the same second octet but the third octet differs. Fourth, however, is assumed to live in a separate data center as it has a different second octet than the three. EC2Snitch: This is meant for Cassandra deployments on Amazon EC2 service. EC2 has regions and within regions, there are availability zones. For example, us-east-1e is an availability zone in the us-east region with availability zone named 1e. This snitch infers the region name (us-east, in this case) as the data center and availability zone (1e) as the rack. EC2MultiRegionSnitch: The multi-region snitch is just an extension of EC2Snitch where data centers and racks are inferred the same way. But you need to make sure that broadcast_address is set to the public IP provided by EC2 and seed nodes must be specified using their public IPs so that inter-data center communication can be done. DynamicSnitch: This Snitch determines closeness based on a recent performance delivered by a node. So, a quick responding node is perceived as being closer than a slower one, irrespective of its location closeness, or closeness in the ring. This is done to avoid overloading a slow performing node. DynamicSnitch is used by all the other snitches by default. You can disable it, but it is not advisable. Now, with knowledge about snitches, we know the list of the fastest nodes that have the desired row keys, it's time to pull data from them. The coordinator node (the one that the client is connected to) sends a command to the closest node to perform a read (we'll discuss local reads in a minute) and return the data. Now, based on ConsistencyLevel, other nodes will send a command to perform a read operation and send just the digest of the result. If we have read repairs (discussed later) enabled, the remaining replica nodes will be sent a message to compute the digest of the command response. Let's take an example. Let's say you have five nodes containing a row key K (that is, RF equals five), your read ConsistencyLevel is three; then the closest of the five nodes will be asked for the data and the second and third closest nodes will be asked to return the digest. If there is a difference in the digests, full data is pulled from the conflicting node and the latest of the three will be sent. These replicas will be updated to have the latest data. We still have two nodes left to be queried. If read repairs are not enabled, they will not be touched for this request. Otherwise, these two will be asked to compute the digest. Depending on the read_repair_chance setting, the request to the last two nodes is done in the background, after returning the result. This updates all the nodes with the most recent value, making all replicas consistent. Let's see what goes on within a node. Take a simple case of a read request looking for a single column within a single row. First, the attempt is made to read from MemTable, which is rapid fast, and since there exists only one copy of data, this is the fastest retrieval. If all required data is not found there, Cassandra looks into SSTable. Now, remember from our earlier discussion that we flush MemTables to disk as SSTables and later when the compaction mechanism wakes up, it merges those SSTables. So, our data can be in multiple SSTables. The following figure represents a simplified representation of the read mechanism. The bottom of the figure shows processing on the read node. The numbers in circles show the order of the event. BF stands for bloom filter. Each SSTable is associated with its bloom filter built on the row keys in the SSTable. Bloom filters are kept in the memory and used to detect if an SSTable may contain (false positive) the row data. Now, we have the SSTables that may contain the row key. The SSTables get sorted in reverse chronological order (latest first). Apart from the bloom filter for row keys, there exists one bloom filter for each row in the SSTable. This secondary bloom filter is created to detect whether the requested column names exist in the SSTable. Now, Cassandra will take SSTables one by one from younger to older, and use the index file to locate the offset for each column value for that row key and the bloom filter associated with the row (built on the column name). On the bloom filter being positive for the requested column, it looks into the SSTable file to read the column value. Note that we may have a column value in other yet-to-be-read SSTables, but that does not matter, because we are reading the most recent SSTables first, and any value that was written earlier to it does not matter. So, the value gets returned as soon as the first column in the most recent SSTable is allocated. Summary By now, you should be familiar with all the nuts and bolts of Cassandra. We have discussed how the pressure to make data stores to web scale inspired a rather not-so-common database mechanism to become mainstream, and how the CAP theorem governs the behavior of such databases. We have seen that Cassandra shines out among its peers. Then, we dipped our toes into the big picture of Cassandra read and write mechanisms. This left us with lots of fancy terms. It is understandable that it may be a lot to take in for someone new to NoSQL systems. It is okay if you do not have complete clarity at this point. As you start working with Cassandra, tweaking it, experimenting with it, and going through the Cassandra mailing list discussions or talks, you will start to come across stuff that you have read in this article and it will start to make sense, and perhaps you may want to come back and refer to this article to improve your clarity. It is not required that you understand this article fully to be able to write queries, set up clusters, maintain clusters, or do anything else related to Cassandra. A general sense of this article will take you far enough to work extremely well with Cassandra-based projects. How does this knowledge help us in building an application? Isn't it just about learning Thrift or CQL API and getting going? You might be wondering why you need to know about the compaction and storage mechanism, when all you need to do is to deliver an application that has a fast backend. It may not be obvious at this point why you are learning this, but as we move ahead with developing an application, we will come to realize that knowledge about underlying storage mechanism helps. Later, if you will deploy a cluster, performance tuning, maintenance, and integrating with other tools such as Apache Hadoop, you may find this article useful. Resources for Article: Further resources on this subject: About Cassandra [article] Replication [article] Getting Up and Running with Cassandra [article]

0
0
5947

article-image-classifying-real-world-examples

Packt

24 Mar 2015

32 min read

Classifying with Real-world Examples

Packt

24 Mar 2015

32 min read

This article by the authors, Luis Pedro Coelho and Willi Richert, of the book, Building Machine Learning Systems with Python - Second Edition, focuses on the topic of classification. (For more resources related to this topic, see here.) You have probably already used this form of machine learning as a consumer, even if you were not aware of it. If you have any modern e-mail system, it will likely have the ability to automatically detect spam. That is, the system will analyze all incoming e-mails and mark them as either spam or not-spam. Often, you, the end user, will be able to manually tag e-mails as spam or not, in order to improve its spam detection ability. This is a form of machine learning where the system is taking examples of two types of messages: spam and ham (the typical term for "non spam e-mails") and using these examples to automatically classify incoming e-mails. The general method of classification is to use a set of examples of each class to learn rules that can be applied to new examples. This is one of the most important machine learning modes and is the topic of this article. Working with text such as e-mails requires a specific set of techniques and skills. For the moment, we will work with a smaller, easier-to-handle dataset. The example question for this article is, "Can a machine distinguish between flower species based on images?" We will use two datasets where measurements of flower morphology are recorded along with the species for several specimens. We will explore these small datasets using a few simple algorithms. At first, we will write classification code ourselves in order to understand the concepts, but we will quickly switch to using scikit-learn whenever possible. The goal is to first understand the basic principles of classification and then progress to using a state-of-the-art implementation. The Iris dataset The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification. The dataset is a collection of morphological measurements of several Iris flowers. These measurements will enable us to distinguish multiple species of the flowers. Today, species are identified by their DNA fingerprints, but in the 1930s, DNA's role in genetics had not yet been discovered. The following four attributes of each plant were measured: sepal length sepal width petal length petal width In general, we will call the individual numeric measurements we use to describe our data features. These features can be directly measured or computed from intermediate data. This dataset has four features. Additionally, for each plant, the species was recorded. The problem we want to solve is, "Given these examples, if we see a new flower out in the field, could we make a good prediction about its species from its measurements?" This is the supervised learning or classification problem: given labeled examples, can we design a rule to be later applied to other examples? A more familiar example to modern readers who are not botanists is spam filtering, where the user can mark e-mails as spam, and systems use these as well as the non-spam e-mails to determine whether a new, incoming message is spam or not. For the moment, the Iris dataset serves our purposes well. It is small (150 examples, four features each) and can be easily visualized and manipulated. Visualization is a good first step Datasets will grow to thousands of features. With only four in our starting example, we can easily plot all two-dimensional projections on a single page. We will build intuitions on this small example, which can then be extended to large datasets with many more features. Visualizations are excellent at the initial exploratory phase of the analysis as they allow you to learn the general features of your problem as well as catch problems that occurred with data collection early. Each subplot in the following plot shows all points projected into two of the dimensions. The outlying group (triangles) are the Iris Setosa plants, while Iris Versicolor plants are in the center (circle) and Iris Virginica are plotted with x marks. We can see that there are two large groups: one is of Iris Setosa and another is a mixture of Iris Versicolor and Iris Virginica. In the following code snippet, we present the code to load the data and generate the plot: >>> from matplotlib import pyplot as plt >>> import numpy as np >>> # We load the data with load_iris from sklearn >>> from sklearn.datasets import load_iris >>> data = load_iris() >>> # load_iris returns an object with several fields >>> features = data.data >>> feature_names = data.feature_names >>> target = data.target >>> target_names = data.target_names >>> for t in range(3): ... if t == 0: ... c = 'r' ... marker = '>' ... elif t == 1: ... c = 'g' ... marker = 'o' ... elif t == 2: ... c = 'b' ... marker = 'x' ... plt.scatter(features[target == t,0], ... features[target == t,1], ... marker=marker, ... c=c) Building our first classification model If the goal is to separate the three types of flowers, we can immediately make a few suggestions just by looking at the data. For example, petal length seems to be able to separate Iris Setosa from the other two flower species on its own. We can write a little bit of code to discover where the cut-off is: >>> # We use NumPy fancy indexing to get an array of strings: >>> labels = target_names[target] >>> # The petal length is the feature at position 2 >>> plength = features[:, 2] >>> # Build an array of booleans: >>> is_setosa = (labels == 'setosa') >>> # This is the important step: >>> max_setosa =plength[is_setosa].max() >>> min_non_setosa = plength[~is_setosa].min() >>> print('Maximum of setosa: {0}.'.format(max_setosa)) Maximum of setosa: 1.9. >>> print('Minimum of others: {0}.'.format(min_non_setosa)) Minimum of others: 3.0. Therefore, we can build a simple model: if the petal length is smaller than 2, then this is an Iris Setosa flower; otherwise it is either Iris Virginica or Iris Versicolor. This is our first model and it works very well in that it separates Iris Setosa flowers from the other two species without making any mistakes. In this case, we did not actually do any machine learning. Instead, we looked at the data ourselves, looking for a separation between the classes. Machine learning happens when we write code to look for this separation automatically. The problem of recognizing Iris Setosa apart from the other two species was very easy. However, we cannot immediately see what the best threshold is for distinguishing Iris Virginica from Iris Versicolor. We can even see that we will never achieve perfect separation with these features. We could, however, look for the best possible separation, the separation that makes the fewest mistakes. For this, we will perform a little computation. We first select only the non-Setosa features and labels: >>> # ~ is the boolean negation operator >>> features = features[~is_setosa] >>> labels = labels[~is_setosa] >>> # Build a new target variable, is_virginica >>> is_virginica = (labels == 'virginica') Here we are heavily using NumPy operations on arrays. The is_setosa array is a Boolean array and we use it to select a subset of the other two arrays, features and labels. Finally, we build a new boolean array, virginica, by using an equality comparison on labels. Now, we run a loop over all possible features and thresholds to see which one results in better accuracy. Accuracy is simply the fraction of examples that the model classifies correctly. >>> # Initialize best_acc to impossibly low value >>> best_acc = -1.0 >>> for fi in range(features.shape[1]): ... # We are going to test all possible thresholds ... thresh = features[:,fi] ... for t in thresh: ... # Get the vector for feature `fi` ... feature_i = features[:, fi] ... # apply threshold `t` ... pred = (feature_i > t) ... acc = (pred == is_virginica).mean() ... rev_acc = (pred == ~is_virginica).mean() ... if rev_acc > acc: ... reverse = True ... acc = rev_acc ... else: ... reverse = False ... ... if acc > best_acc: ... best_acc = acc ... best_fi = fi ... best_t = t ... best_reverse = reverse We need to test two types of thresholds for each feature and value: we test a greater than threshold and the reverse comparison. This is why we need the rev_acc variable in the preceding code; it holds the accuracy of reversing the comparison. The last few lines select the best model. First, we compare the predictions, pred, with the actual labels, is_virginica. The little trick of computing the mean of the comparisons gives us the fraction of correct results, the accuracy. At the end of the for loop, all the possible thresholds for all the possible features have been tested, and the variables best_fi, best_t, and best_reverse hold our model. This is all the information we need to be able to classify a new, unknown object, that is, to assign a class to it. The following code implements exactly this method: def is_virginica_test(fi, t, reverse, example): "Apply threshold model to a new example" test = example[fi] > t if reverse: test = not test return test What does this model look like? If we run the code on the whole data, the model that is identified as the best makes decisions by splitting on the petal width. One way to gain intuition about how this works is to visualize the decision boundary. That is, we can see which feature values will result in one decision versus the other and exactly where the boundary is. In the following screenshot, we see two regions: one is white and the other is shaded in grey. Any datapoint that falls on the white region will be classified as Iris Virginica, while any point that falls on the shaded side will be classified as Iris Versicolor. In a threshold model, the decision boundary will always be a line that is parallel to one of the axes. The plot in the preceding screenshot shows the decision boundary and the two regions where points are classified as either white or grey. It also shows (as a dashed line) an alternative threshold, which will achieve exactly the same accuracy. Our method chose the first threshold it saw, but that was an arbitrary choice. Evaluation – holding out data and cross-validation The model discussed in the previous section is a simple model; it achieves 94 percent accuracy of the whole data. However, this evaluation may be overly optimistic. We used the data to define what the threshold will be, and then we used the same data to evaluate the model. Of course, the model will perform better than anything else we tried on this dataset. The reasoning is circular. What we really want to do is estimate the ability of the model to generalize to new instances. We should measure its performance in instances that the algorithm has not seen at training. Therefore, we are going to do a more rigorous evaluation and use held-out data. For this, we are going to break up the data into two groups: on one group, we'll train the model, and on the other, we'll test the one we held out of training. The full code, which is an adaptation of the code presented earlier, is available on the online support repository. Its output is as follows: Training accuracy was 96.0%. Testing accuracy was 90.0% (N = 50). The result on the training data (which is a subset of the whole data) is apparently even better than before. However, what is important to note is that the result in the testing data is lower than that of the training error. While this may surprise an inexperienced machine learner, it is expected that testing accuracy will be lower than the training accuracy. To see why, look back at the plot that showed the decision boundary. Consider what would have happened if some of the examples close to the boundary were not there or that one of them between the two lines was missing. It is easy to imagine that the boundary will then move a little bit to the right or to the left so as to put them on the wrong side of the border. The accuracy on the training data, the training accuracy, is almost always an overly optimistic estimate of how well your algorithm is doing. We should always measure and report the testing accuracy, which is the accuracy on a collection of examples that were not used for training. These concepts will become more and more important as the models become more complex. In this example, the difference between the accuracy measured on training data and on testing data is not very large. When using a complex model, it is possible to get 100 percent accuracy in training and do no better than random guessing on testing! One possible problem with what we did previously, which was to hold out data from training, is that we only used half the data for training. Perhaps it would have been better to use more training data. On the other hand, if we then leave too little data for testing, the error estimation is performed on a very small number of examples. Ideally, we would like to use all of the data for training and all of the data for testing as well, which is impossible. We can achieve a good approximation of this impossible ideal by a method called cross-validation. One simple form of cross-validation is leave-one-out cross-validation. We will take an example out of the training data, learn a model without this example, and then test whether the model classifies this example correctly. This process is then repeated for all the elements in the dataset. The following code implements exactly this type of cross-validation: >>> correct = 0.0 >>> for ei in range(len(features)): # select all but the one at position `ei`: training = np.ones(len(features), bool) training[ei] = False testing = ~training model = fit_model(features[training], is_virginica[training]) predictions = predict(model, features[testing]) correct += np.sum(predictions == is_virginica[testing]) >>> acc = correct/float(len(features)) >>> print('Accuracy: {0:.1%}'.format(acc)) Accuracy: 87.0% At the end of this loop, we will have tested a series of models on all the examples and have obtained a final average result. When using cross-validation, there is no circularity problem because each example was tested on a model which was built without taking that datapoint into account. Therefore, the cross-validated estimate is a reliable estimate of how well the models would generalize to new data. The major problem with leave-one-out cross-validation is that we are now forced to perform many times more work. In fact, you must learn a whole new model for each and every example and this cost will increase as our dataset grows. We can get most of the benefits of leave-one-out at a fraction of the cost by using x-fold cross-validation, where x stands for a small number. For example, to perform five-fold cross-validation, we break up the data into five groups, so-called five folds. Then you learn five models: each time you will leave one fold out of the training data. The resulting code will be similar to the code given earlier in this section, but we leave 20 percent of the data out instead of just one element. We test each of these models on the left-out fold and average the results. The preceding figure illustrates this process for five blocks: the dataset is split into five pieces. For each fold, you hold out one of the blocks for testing and train on the other four. You can use any number of folds you wish. There is a trade-off between computational efficiency (the more folds, the more computation is necessary) and accurate results (the more folds, the closer you are to using the whole of the data for training). Five folds is often a good compromise. This corresponds to training with 80 percent of your data, which should already be close to what you will get from using all the data. If you have little data, you can even consider using 10 or 20 folds. In the extreme case, if you have as many folds as datapoints, you are simply performing leave-one-out cross-validation. On the other hand, if computation time is an issue and you have more data, 2 or 3 folds may be the more appropriate choice. When generating the folds, you need to be careful to keep them balanced. For example, if all of the examples in one fold come from the same class, then the results will not be representative. We will not go into the details of how to do this, because the machine learning package scikit-learn will handle them for you. We have now generated several models instead of just one. So, "What final model do we return and use for new data?" The simplest solution is now to train a single overall model on all your training data. The cross-validation loop gives you an estimate of how well this model should generalize. A cross-validation schedule allows you to use all your data to estimate whether your methods are doing well. At the end of the cross-validation loop, you can then use all your data to train a final model. Although it was not properly recognized when machine learning was starting out as a field, nowadays, it is seen as a very bad sign to even discuss the training accuracy of a classification system. This is because the results can be very misleading and even just presenting them marks you as a newbie in machine learning. We always want to measure and compare either the error on a held-out dataset or the error estimated using a cross-validation scheme. Building more complex classifiers In the previous section, we used a very simple model: a threshold on a single feature. Are there other types of systems? Yes, of course! Many others. To think of the problem at a higher abstraction level, "What makes up a classification model?" We can break it up into three parts: The structure of the model: How exactly will a model make decisions? In this case, the decision depended solely on whether a given feature was above or below a certain threshold value. This is too simplistic for all but the simplest problems. The search procedure: How do we find the model we need to use? In our case, we tried every possible combination of feature and threshold. You can easily imagine that as models get more complex and datasets get larger, it rapidly becomes impossible to attempt all combinations and we are forced to use approximate solutions. In other cases, we need to use advanced optimization methods to find a good solution (fortunately, scikit-learn already implements these for you, so using them is easy even if the code behind them is very advanced). The gain or loss function: How do we decide which of the possibilities tested should be returned? Rarely do we find the perfect solution, the model that never makes any mistakes, so we need to decide which one to use. We used accuracy, but sometimes it will be better to optimize so that the model makes fewer errors of a specific kind. For example, in spam filtering, it may be worse to delete a good e-mail than to erroneously let a bad e-mail through. In that case, we may want to choose a model that is conservative in throwing out e-mails rather than the one that just makes the fewest mistakes overall. We can discuss these issues in terms of gain (which we want to maximize) or loss (which we want to minimize). They are equivalent, but sometimes one is more convenient than the other. We can play around with these three aspects of classifiers and get different systems. A simple threshold is one of the simplest models available in machine learning libraries and only works well when the problem is very simple, such as with the Iris dataset. In the next section, we will tackle a more difficult classification task that requires a more complex structure. In our case, we optimized the threshold to minimize the number of errors. Alternatively, we might have different loss functions. It might be that one type of error is much costlier than the other. In a medical setting, false negatives and false positives are not equivalent. A false negative (when the result of a test comes back negative, but that is false) might lead to the patient not receiving treatment for a serious disease. A false positive (when the test comes back positive even though the patient does not actually have that disease) might lead to additional tests to confirm or unnecessary treatment (which can still have costs, including side effects from the treatment, but are often less serious than missing a diagnostic). Therefore, depending on the exact setting, different trade-offs can make sense. At one extreme, if the disease is fatal and the treatment is cheap with very few negative side-effects, then you want to minimize false negatives as much as you can. What the gain/cost function should be is always dependent on the exact problem you are working on. When we present a general-purpose algorithm, we often focus on minimizing the number of mistakes, achieving the highest accuracy. However, if some mistakes are costlier than others, it might be better to accept a lower overall accuracy to minimize the overall costs. A more complex dataset and a more complex classifier We will now look at a slightly more complex dataset. This will motivate the introduction of a new classification algorithm and a few other ideas. Learning about the Seeds dataset We now look at another agricultural dataset, which is still small, but already too large to plot exhaustively on a page as we did with Iris. This dataset consists of measurements of wheat seeds. There are seven features that are present, which are as follows: area A perimeter P compactness C = 4pA/P² length of kernel width of kernel asymmetry coefficient length of kernel groove There are three classes, corresponding to three wheat varieties: Canadian, Koma, and Rosa. As earlier, the goal is to be able to classify the species based on these morphological measurements. Unlike the Iris dataset, which was collected in the 1930s, this is a very recent dataset and its features were automatically computed from digital images. This is how image pattern recognition can be implemented: you can take images, in digital form, compute a few relevant features from them, and use a generic classification system. For the moment, we will work with the features that are given to us. UCI Machine Learning Dataset Repository The University of California at Irvine (UCI) maintains an online repository of machine learning datasets (at the time of writing, they list 233 datasets). Both the Iris and the Seeds dataset used in this article were taken from there. The repository is available online at http://archive.ics.uci.edu/ml/. Features and feature engineering One interesting aspect of these features is that the compactness feature is not actually a new measurement, but a function of the previous two features, area and perimeter. It is often very useful to derive new combined features. Trying to create new features is generally called feature engineering. It is sometimes seen as less glamorous than algorithms, but it often matters more for performance (a simple algorithm on well-chosen features will perform better than a fancy algorithm on not-so-good features). In this case, the original researchers computed the compactness, which is a typical feature for shapes. It is also sometimes called roundness. This feature will have the same value for two kernels, one of which is twice as big as the other one, but with the same shape. However, it will have different values for kernels that are very round (when the feature is close to one) when compared to kernels that are elongated (when the feature is closer to zero). The goals of a good feature are to simultaneously vary with what matters (the desired output) and be invariant with what does not. For example, compactness does not vary with size, but varies with the shape. In practice, it might be hard to achieve both objectives perfectly, but we want to approximate this ideal. You will need to use background knowledge to design good features. Fortunately, for many problem domains, there is already a vast literature of possible features and feature-types that you can build upon. For images, all of the previously mentioned features are typical and computer vision libraries will compute them for you. In text-based problems too, there are standard solutions that you can mix and match. When possible, you should use your knowledge of the problem to design a specific feature or to select which ones from the literature are more applicable to the data at hand. Even before you have data, you must decide which data is worthwhile to collect. Then, you hand all your features to the machine to evaluate and compute the best classifier. A natural question is whether we can select good features automatically. This problem is known as feature selection. There are many methods that have been proposed for this problem, but in practice very simple ideas work best. For the small problems we are currently exploring, it does not make sense to use feature selection, but if you had thousands of features, then throwing out most of them might make the rest of the process much faster. Nearest neighbor classification For use with this dataset, we will introduce a new classifier: the nearest neighbor classifier. The nearest neighbor classifier is very simple. When classifying a new element, it looks at the training data for the object that is closest to it, its nearest neighbor. Then, it returns its label as the answer. Notice that this model performs perfectly on its training data! For each point, its closest neighbor is itself, and so its label matches perfectly (unless two examples with different labels have exactly the same feature values, which will indicate that the features you are using are not very descriptive). Therefore, it is essential to test the classification using a cross-validation protocol. The nearest neighbor method can be generalized to look not at a single neighbor, but to multiple ones and take a vote amongst the neighbors. This makes the method more robust to outliers or mislabeled data. Classifying with scikit-learn We have been using handwritten classification code, but Python is a very appropriate language for machine learning because of its excellent libraries. In particular, scikit-learn has become the standard library for many machine learning tasks, including classification. We are going to use its implementation of nearest neighbor classification in this section. The scikit-learn classification API is organized around classifier objects. These objects have the following two essential methods: fit(features, labels): This is the learning step and fits the parameters of the model predict(features): This method can only be called after fit and returns a prediction for one or more inputs Here is how we could use its implementation of k-nearest neighbors for our data. We start by importing the KneighborsClassifier object from the sklearn.neighbors submodule: >>> from sklearn.neighbors import KNeighborsClassifier The scikit-learn module is imported as sklearn (sometimes you will also find that scikit-learn is referred to using this short name instead of the full name). All of the sklearn functionality is in submodules, such as sklearn.neighbors. We can now instantiate a classifier object. In the constructor, we specify the number of neighbors to consider, as follows: >>> classifier = KNeighborsClassifier(n_neighbors=1) If we do not specify the number of neighbors, it defaults to 5, which is often a good choice for classification. We will want to use cross-validation (of course) to look at our data. The scikit-learn module also makes this easy: >>> from sklearn.cross_validation import KFold >>> kf = KFold(len(features), n_folds=5, shuffle=True) >>> # `means` will be a list of mean accuracies (one entry per fold) >>> means = [] >>> for training,testing in kf: ... # We fit a model for this fold, then apply it to the ... # testing data with `predict`: ... classifier.fit(features[training], labels[training]) ... prediction = classifier.predict(features[testing]) ... ... # np.mean on an array of booleans returns fraction ... # of correct decisions for this fold: ... curmean = np.mean(prediction == labels[testing]) ... means.append(curmean) >>> print("Mean accuracy: {:.1%}".format(np.mean(means))) Mean accuracy: 90.5% Using five folds for cross-validation, for this dataset, with this algorithm, we obtain 90.5 percent accuracy. As we discussed in the earlier section, the cross-validation accuracy is lower than the training accuracy, but this is a more credible estimate of the performance of the model. Looking at the decision boundaries We will now examine the decision boundary. In order to plot these on paper, we will simplify and look at only two dimensions. Take a look at the following plot: Canadian examples are shown as diamonds, Koma seeds as circles, and Rosa seeds as triangles. Their respective areas are shown as white, black, and grey. You might be wondering why the regions are so horizontal, almost weirdly so. The problem is that the x axis (area) ranges from 10 to 22, while the y axis (compactness) ranges from 0.75 to 1.0. This means that a small change in x is actually much larger than a small change in y. So, when we compute the distance between points, we are, for the most part, only taking the x axis into account. This is also a good example of why it is a good idea to visualize our data and look for red flags or surprises. If you studied physics (and you remember your lessons), you might have already noticed that we had been summing up lengths, areas, and dimensionless quantities, mixing up our units (which is something you never want to do in a physical system). We need to normalize all of the features to a common scale. There are many solutions to this problem; a simple one is to normalize to z-scores. The z-score of a value is how far away from the mean it is, in units of standard deviation. It comes down to this operation: In this formula, f is the old feature value, f' is the normalized feature value, μ is the mean of the feature, and σ is the standard deviation. Both μ and σ are estimated from training data. Independent of what the original values were, after z-scoring, a value of zero corresponds to the training mean, positive values are above the mean, and negative values are below it. The scikit-learn module makes it very easy to use this normalization as a preprocessing step. We are going to use a pipeline of transformations: the first element will do the transformation and the second element will do the classification. We start by importing both the pipeline and the feature scaling classes as follows: >>> from sklearn.pipeline import Pipeline >>> from sklearn.preprocessing import StandardScaler Now, we can combine them. >>> classifier = KNeighborsClassifier(n_neighbors=1) >>> classifier = Pipeline([('norm', StandardScaler()), ('knn', classifier)]) The Pipeline constructor takes a list of pairs (str,clf). Each pair corresponds to a step in the pipeline: the first element is a string naming the step, while the second element is the object that performs the transformation. Advanced usage of the object uses these names to refer to different steps. After normalization, every feature is in the same units (technically, every feature is now dimensionless; it has no units) and we can more confidently mix dimensions. In fact, if we now run our nearest neighbor classifier, we obtain 93 percent accuracy, estimated with the same five-fold cross-validation code shown previously! Look at the decision space again in two dimensions: The boundaries are now different and you can see that both dimensions make a difference for the outcome. In the full dataset, everything is happening on a seven-dimensional space, which is very hard to visualize, but the same principle applies; while a few dimensions are dominant in the original data, after normalization, they are all given the same importance. Binary and multiclass classification The first classifier we used, the threshold classifier, was a simple binary classifier. Its result is either one class or the other, as a point is either above the threshold value or it is not. The second classifier we used, the nearest neighbor classifier, was a natural multiclass classifier, its output can be one of the several classes. It is often simpler to define a simple binary method than the one that works on multiclass problems. However, we can reduce any multiclass problem to a series of binary decisions. This is what we did earlier in the Iris dataset, in a haphazard way: we observed that it was easy to separate one of the initial classes and focused on the other two, reducing the problem to two binary decisions: Is it an Iris Setosa (yes or no)? If not, check whether it is an Iris Virginica (yes or no). Of course, we want to leave this sort of reasoning to the computer. As usual, there are several solutions to this multiclass reduction. The simplest is to use a series of one versus the rest classifiers. For each possible label ℓ, we build a classifier of the type is this ℓ or something else? When applying the rule, exactly one of the classifiers will say yes and we will have our solution. Unfortunately, this does not always happen, so we have to decide how to deal with either multiple positive answers or no positive answers. Alternatively, we can build a classification tree. Split the possible labels into two, and build a classifier that asks, "Should this example go in the left or the right bin?" We can perform this splitting recursively until we obtain a single label. The preceding diagram depicts the tree of reasoning for the Iris dataset. Each diamond is a single binary classifier. It is easy to imagine that we could make this tree larger and encompass more decisions. This means that any classifier that can be used for binary classification can also be adapted to handle any number of classes in a simple way. There are many other possible ways of turning a binary method into a multiclass one. There is no single method that is clearly better in all cases. The scikit-learn module implements several of these methods in the sklearn.multiclass submodule. Some classifiers are binary systems, while many real-life problems are naturally multiclass. Several simple protocols reduce a multiclass problem to a series of binary decisions and allow us to apply the binary models to our multiclass problem. This means methods that are apparently only for binary data can be applied to multiclass data with little extra effort. Summary Classification means generalizing from examples to build a model (that is, a rule that can automatically be applied to new, unclassified objects). It is one of the fundamental tools in machine. In a sense, this was a very theoretical article, as we introduced generic concepts with simple examples. We went over a few operations with the Iris dataset. This is a small dataset. However, it has the advantage that we were able to plot it out and see what we were doing in detail. This is something that will be lost when we move on to problems with many dimensions and many thousands of examples. The intuitions we gained here will all still be valid. You also learned that the training error is a misleading, over-optimistic estimate of how well the model does. We must, instead, evaluate it on testing data that has not been used for training. In order to not waste too many examples in testing, a cross-validation schedule can get us the best of both worlds (at the cost of more computation). We also had a look at the problem of feature engineering. Features are not predefined for you, but choosing and designing features is an integral part of designing a machine learning pipeline. In fact, it is often the area where you can get the most improvements in accuracy, as better data beats fancier methods. Resources for Article: Further resources on this subject: Ridge Regression [article] The Spark programming model [article] Using cross-validation [article]

0
0
17882

article-image-introducing-interactive-plotting

Packt

20 Mar 2015

29 min read

Introducing Interactive Plotting

Packt

20 Mar 2015

29 min read

This article is written by Benjamin V. Root, the author of Interactive Applications using Matplotlib. The goal of any interactive application is to provide as much information as possible while minimizing complexity. If it can't provide the information the users need, then it is useless to them. However, if the application is too complex, then the information's signal gets lost in the noise of the complexity. A graphical presentation often strikes the right balance. The Matplotlib library can help you present your data as graphs in your application. Anybody can make a simple interactive application without knowing anything about draw buffers, event loops, or even what a GUI toolkit is. And yet, the Matplotlib library will cede as much control as desired to allow even the most savvy GUI developer to create a masterful application from scratch. Like much of the Python language, Matplotlib's philosophy is to give the developer full control, but without being stupidly unhelpful and tedious. (For more resources related to this topic, see here.) Installing Matplotlib There are many ways to install Matplotlib on your system. While the library used to have a reputation for being difficult to install on non-Linux systems, it has come a long way since then, along with the rest of the Python ecosystem. Refer to the following command: $ pip install matplotlib Most likely, the preceding command would work just fine from the command line. Python Wheels (the next-generation Python package format that has replaced "eggs") for Matplotlib are now available from PyPi for Windows and Mac OS X systems. This method would also work for Linux users; however, it might be more favorable to install it via the system's built-in package manager. While the core Matplotlib library can be installed with few dependencies, it is a part of a much larger scientific computing ecosystem known as SciPy. Displaying your data is often the easiest part of your application. Processing it is much more difficult, and the SciPy ecosystem most likely has the packages you need to do that. For basic numerical processing and N-dimensional data arrays, there is NumPy. For more advanced but general data processing tools, there is the SciPy package (the name was so catchy, it ended up being used to refer to many different things in the community). For more domain-specific needs, there are "Sci-Kits" such as scikit-learn for artificial intelligence, scikit-image for image processing, and statsmodels for statistical modeling. Another very useful library for data processing is pandas. This was just a short summary of the packages available in the SciPy ecosystem. Manually managing all of their installations, updates, and dependencies would be difficult for many who just simply want to use the tools. Luckily, there are several distributions of the SciPy Stack available that can keep the menagerie under control. The following are Python distributions that include the SciPy Stack along with many other popular Python packages or make the packages easily available through package management software: Anaconda from Continuum Analytics Canopy from Enthought SciPy Superpack Python(x, y) (Windows only) WinPython (Windows only) Pyzo (Python 3 only) Algorete Loopy from Dartmouth College Show() your work With Matplotlib installed, you are now ready to make your first simple plot. Matplotlib has multiple layers. Pylab is the topmost layer, often used for quick one-off plotting from within a live Python session. Start up your favorite Python interpreter and type the following: >>> from pylab import * >>> plot([1, 2, 3, 2, 1]) Nothing happened! This is because Matplotlib, by default, will not display anything until you explicitly tell it to do so. The Matplotlib library is often used for automated image generation from within Python scripts, with no need for any interactivity. Also, most users would not be done with their plotting yet and would find it distracting to have a plot come up automatically. When you are ready to see your plot, use the following command: >>> show() Interactive navigation A figure window should now appear, and the Python interpreter is not available for any additional commands. By default, showing a figure will block the execution of your scripts and interpreter. However, this does not mean that the figure is not interactive. As you mouse over the plot, you will see the plot coordinates in the lower right-hand corner. The figure window will also have a toolbar: From left to right, the following are the tools: Home, Back, and Forward: These are similar to that of a web browser. These buttons help you navigate through the previous views of your plot. The "Home" button will take you back to the first view when the figure was opened. "Back" will take you to the previous view, while "Forward" will return you to the previous views. Pan (and zoom): This button has two modes: pan and zoom. Press the left mouse button and hold it to pan the figure. If you press x or y while panning, the motion will be constrained to just the x or y axis, respectively. Press the right mouse button to zoom. The plot will be zoomed in or out proportionate to the right/left and up/down movements. Use the X, Y, or Ctrl key to constrain the zoom to the x axis or the y axis or preserve the aspect ratio, respectively. Zoom-to-rectangle: Press the left mouse button and drag the cursor to a new location and release. The axes view limits will be zoomed to the rectangle you just drew. Zoom out using your right mouse button, placing the current view into the region defined by the rectangle you just drew. Subplot configuration: This button brings up a tool to modify plot spacing. Save: This button brings up a dialog that allows you to save the current figure. The figure window would also be responsive to the keyboard. The default keymap is fairly extensive (and will be covered fully later), but some of the basic hot keys are the Home key for resetting the plot view, the left and right keys for back and forward actions, p for pan/zoom mode, o for zoom-to-rectangle mode, and Ctrl + s to trigger a file save. When you are done viewing your figure, close the window as you would close any other application window, or use Ctrl + w. Interactive plotting When we did the previous example, no plots appeared until show() was called. Furthermore, no new commands could be entered into the Python interpreter until all the figures were closed. As you will soon learn, once a figure is closed, the plot it contains is lost, which means that you would have to repeat all the commands again in order to show() it again, perhaps with some modification or additional plot. Matplotlib ships with its interactive plotting mode off by default. There are a couple of ways to turn the interactive plotting mode on. The main way is by calling the ion() function (for Interactive ON). Interactive plotting mode can be turned on at any time and turned off with ioff(). Once this mode is turned on, the next plotting command will automatically trigger an implicit show() command. Furthermore, you can continue typing commands into the Python interpreter. You can modify the current figure, create new figures, and close existing ones at any time, all from the current Python session. Scripted plotting Python is known for more than just its interactive interpreters; it is also a fully fledged programming language that allows its users to easily create programs. Having a script to display plots from daily reports can greatly improve your productivity. Alternatively, you perhaps need a tool that can produce some simple plots of the data from whatever mystery data file you have come across on the network share. Here is a simple example of how to use Matplotlib's pyplot API and the argparse Python standard library tool to create a simple CSV plotting script called plotfile.py. Code: chp1/plotfile.py#!/usr/bin/env python from argparse import ArgumentParserimport matplotlib.pyplot as pltif __name__ == '__main__': parser = ArgumentParser(description="Plot a CSV file") parser.add_argument("datafile", help="The CSV File") # Require at least one column name parser.add_argument("columns", nargs='+', help="Names of columns to plot") parser.add_argument("--save", help="Save the plot as...") parser.add_argument("--no-show", action="store_true", help="Don't show the plot") args = parser.parse_args() plt.plotfile(args.datafile, args.columns) if args.save: plt.savefig(args.save) if not args.no_show: plt.show() Note the two optional command-line arguments: --save and --no-show. With the --save option, the user can have the plot automatically saved (the graphics format is determined automatically from the filename extension). Also, the user can choose not to display the plot, which when coupled with the --save option might be desirable if the user is trying to plot several CSV files. When calling this script to show a plot, the execution of the script will stop at the call to plt.show(). If the interactive plotting mode was on, then the execution of the script would continue past show(), terminating the script, thus automatically closing out any figures before the user has had a chance to view them. This is why the interactive plotting mode is turned off by default in Matplotlib. Also note that the call to plt.savefig() is before the call to plt.show(). As mentioned before, when the figure window is closed, the plot is lost. You cannot save a plot after it has been closed. Getting help We have covered how to install Matplotlib and went over how to make very simple plots from a Python session or a Python script. Most likely, this went very smoothly for you.. You may be very curious and want to learn more about the many kinds of plots this library has to offer, or maybe you want to learn how to make new kinds of plots. Help comes in many forms. The Matplotlib website (http://matplotlib.org) is the primary online resource for Matplotlib. It contains examples, FAQs, API documentation, and, most importantly, the gallery. Gallery Many users of Matplotlib are often faced with the question, "I want to make a plot that has this data along with that data in the same figure, but it needs to look like this other plot I have seen." Text-based searches on graphing concepts are difficult, especially if you are unfamiliar with the terminology. The gallery showcases the variety of ways in which one can make plots, all using the Matplotlib library. Browse through the gallery, click on any figure that has pieces of what you want in your plot, and see the code that generated it. Soon enough, you will be like a chef, mixing and matching components to produce that perfect graph. Mailing lists and forums When you are just simply stuck and cannot figure out how to get something to work or just need some hints on how to get started, you will find much of the community at the Matplotlib-users mailing list. This mailing list is an excellent resource of information with many friendly members who just love to help out newcomers. Be persistent! While many questions do get answered fairly quickly, some will fall through the cracks. Try rephrasing your question or with a plot showing your attempts so far. The people at Matplotlib-users love plots, so an image that shows what is wrong often gets the quickest response. A newer community resource is StackOverflow, which has many very knowledgeable users who are able to answer difficult questions. From front to backend So far, we have shown you bits and pieces of two of Matplotlib's topmost abstraction layers: pylab and pyplot. The layer below them is the object-oriented layer (the OO layer). To develop any type of application, you will want to use this layer. Mixing the pylab/pyplot layers with the OO layer will lead to very confusing behaviors when dealing with multiple plots and figures. Below the OO layer is the backend interface. Everything above this interface level in Matplotlib is completely platform-agnostic. It will work the same regardless of whether it is in an interactive GUI or comes from a driver script running on a headless server. The backend interface abstracts away all those considerations so that you can focus on what is most important: writing code to visualize your data. There are several backend implementations that are shipped with Matplotlib. These backends are responsible for taking the figures represented by the OO layer and interpreting it for whichever "display device" they implement. The backends are chosen automatically but can be explicitly set, if needed. Interactive versus non-interactive There are two main classes of backends: ones that provide interactive figures and ones that don't. Interactive backends are ones that support a particular GUI, such as Tcl/Tkinter, GTK, Qt, Cocoa/Mac OS X, wxWidgets, and Cairo. With the exception of the Cocoa/Mac OS X backend, all interactive backends can be used on Windows, Linux, and Mac OS X. Therefore, when you make an interactive Matplotlib application that you wish to distribute to users of any of those platforms, unless you are embedding Matplotlib, you will not have to concern yourself with writing a single line of code for any of these toolkits—it has already been done for you! Non-interactive backends are used to produce image files. There are backends to produce Postscript/EPS, Adobe PDF, and Scalable Vector Graphics (SVG) as well as rasterized image files such as PNG, BMP, and JPEGs. Anti-grain geometry The open secret behind the high quality of Matplotlib's rasterized images is its use of the Anti-Grain Geometry (AGG) library (http://agg.sourceforge.net/antigrain.com/index.html). The quality of the graphics generated from AGG is far superior than most other toolkits available. Therefore, not only is AGG used to produce rasterized image files, but it is also utilized in most of the interactive backends as well. Matplotlib maintains and ships with its own fork of the library in order to ensure you have consistent, high quality image products across all platforms and toolkits. What you see on your screen in your interactive figure window will be the same as the PNG file that is produced when you call savefig(). Selecting your backend When you install Matplotlib, a default backend is chosen for you based upon your OS and the available GUI toolkits. For example, on Mac OS X systems, your installation of the library will most likely set the default interactive backend to MacOSX or CocoaAgg for older Macs. Meanwhile, Windows users will most likely have a default of TkAgg or Qt5Agg. In most situations, the choice of interactive backends will not matter. However, in certain situations, it may be necessary to force a particular backend to be used. For example, on a headless server without an active graphics session, you would most likely need to force the use of the non-interactive Agg backend: import matplotlibmatplotlib.use("Agg") When done prior to any plotting commands, this will avoid loading any GUI toolkits, thereby bypassing problems that occur when a GUI fails on a headless server. Any call to show() effectively becomes a no-op (and the execution of the script is not blocked). Another purpose of setting your backend is for scenarios when you want to embed your plot in a native GUI application. Therefore, you will need to explicitly state which GUI toolkit you are using. Finally, some users simply like the look and feel of some GUI toolkits better than others. They may wish to change the default backend via the backend parameter in the matplotlibrc configuration file. Most likely, your rc file can be found in the .matplotlib directory or the .config/matplotlib directory under your home folder. If you can't find it, then use the following set of commands: >>> import matplotlib >>> matplotlib.matplotlib_fname() u'/home/ben/.config/matplotlib/matplotlibrc' Here is an example of the relevant section in my matplotlibrc file: #### CONFIGURATION BEGINS HERE # the default backend; one of GTK GTKAgg GTKCairo GTK3Agg # GTK3Cairo CocoaAgg MacOSX QtAgg Qt4Agg TkAgg WX WXAgg Agg Cairo # PS PDF SVG # You can also deploy your own backend outside of matplotlib by # referring to the module name (which must be in the PYTHONPATH) # as 'module://my_backend' #backend : GTKAgg #backend : QT4Agg backend : TkAgg # If you are using the Qt4Agg backend, you can choose here # to use the PyQt4 bindings or the newer PySide bindings to # the underlying Qt4 toolkit. #backend.qt4 : PyQt4 # PyQt4 | PySide This is the global configuration file that is used if one isn't found in the current working directory when Matplotlib is imported. The settings contained in this configuration serves as default values for many parts of Matplotlib. In particular, we see that the choice of backends can be easily set without having to use a single line of code. The Matplotlib figure-artist hierarchy Everything that can be drawn in Matplotlib is called an artist. Any artist can have child artists that are also drawable. This forms the basis of a hierarchy of artist objects that Matplotlib sends to a backend for rendering. At the root of this artist tree is the figure. In the examples so far, we have not explicitly created any figures. The pylab and pyplot interfaces will create the figures for us. However, when creating advanced interactive applications, it is highly recommended that you explicitly create your figures. You will especially want to do this if you have multiple figures being displayed at the same time. This is the entry into the OO layer of Matplotlib: fig = plt.figure() Canvassing the figure The figure is, quite literally, your canvas. Its primary component is the FigureCanvas instance upon which all drawing occurs. Unless you are embedding your Matplotlib figures into a GUI application, it is very unlikely that you will need to interact with this object directly. Instead, as plotting commands are issued, artist objects are added to the canvas automatically. While any artist can be added directly to the figure, usually only Axes objects are added. A figure can have many axes objects, typically called subplots. Much like the figure object, our examples so far have not explicitly created any axes objects to use. This is because the pylab and pyplot interfaces will also automatically create and manage axes objects for a figure if needed. For the same reason as for figures, you will want to explicitly create these objects when building your interactive applications. If an axes or a figure is not provided, then the pyplot layer will have to make assumptions about which axes or figure you mean to apply a plotting command to. While this might be fine for simple situations, these assumptions get hairy very quickly in non-trivial applications. Luckily, it is easy to create both your figure and its axes using a single command: fig, axes = plt.subplots(2, 1) # 2x1 grid of subplots These objects are highly advanced complex units that most developers will utilize for their plotting needs. Once placed on the figure canvas, the axes object will provide the ticks, axis labels, axes title(s), and the plotting area. An axes is an artist that manages all of its scale and coordinate transformations (for example, log scaling and polar coordinates), automated tick labeling, and automated axis limits. In addition to these responsibilities, an axes object provides a wide assortment of plotting functions. A sampling of plotting functions is as follows: Function Description bar Make a bar plot barbs Plot a two-dimensional field of barbs boxplot Make a box and whisker plot cohere Plot the coherence between x and y contour Plot contours errorbar Plot an errorbar graph hexbin Make a hexagonal binning plot hist Plot a histogram imshow Display an image on the axes pcolor Create a pseudocolor plot of a two-dimensional array pcolormesh Plot a quadrilateral mesh pie Plot a pie chart plot Plot lines and/or markers quiver Plot a two-dimensional field of arrows sankey Create a Sankey flow diagram scatter Make a scatter plot of x versus y stem Create a stem plot streamplot Draw streamlines of a vector flow This application will be a storm track editing application. Given a series of radar images, the user can circle each storm cell they see in the radar image and link those storm cells across time. The application will need the ability to save and load track data and provide the user with mechanisms to edit the data. Along the way, we will learn about Matplotlib's structure, its artists, the callback system, doing animations, and finally, embedding this application within a larger GUI application. So, to begin, we first need to be able to view a radar image. There are many ways to load data into a Python program but one particular favorite among meteorologists is the Network Common Data Form (NetCDF) file. The SciPy package has built-in support for NetCDF version 3, so we will be using an hour's worth of radar reflectivity data prepared using this format from a NEXRAD site near Oklahoma City, OK on the evening of May 10, 2010, which produced numerous tornadoes and severe storms. The NetCDF binary file is particularly nice to work with because it can hold multiple data variables in a single file, with each variable having an arbitrary number of dimensions. Furthermore, metadata can be attached to each variable and to the dataset itself, allowing you to self-document data files. This particular data file has three variables, namely Reflectivity, lat, and lon to record the radar reflectivity values and the latitude and longitude coordinates of each pixel in the reflectivity data. The reflectivity data is three-dimensional, with the first dimension as time and the other two dimensions as latitude and longitude. The following code example shows how easy it is to load this data and display the first image frame using SciPy and Matplotlib. Code: chp1/simple_radar_viewer.py import matplotlib.pyplot as plt from scipy.io import netcdf_file ncf = netcdf_file('KTLX_20100510_22Z.nc') data = ncf.variables['Reflectivity'] lats = ncf.variables['lat'] lons = ncf.variables['lon'] i = 0 cmap = plt.get_cmap('gist_ncar') cmap.set_under('lightgrey') fig, ax = plt.subplots(1, 1) im = ax.imshow(data[i], origin='lower', extent=(lons[0], lons[-1], lats[0], lats[-1]), vmin=0.1, vmax=80, cmap='gist_ncar') cb = fig.colorbar(im) cb.set_label('Reflectivity (dBZ)') ax.set_xlabel('Longitude') ax.set_ylabel('Latitude') plt.show() Running this script should result in a figure window that will display the first frame of our storms. The plot has a colorbar and the axes ticks label the latitudes and longitudes of our data. What is probably most important in this example is the imshow() call. Being an image, traditionally, the origin of the image data is shown in the upper-left corner and Matplotlib follows this tradition by default. However, this particular dataset was saved with its origin in the lower-left corner, so we need to state this with the origin parameter. The extent parameter is a tuple describing the data extent of the image. By default, it is assumed to be at (0, 0) and (N – 1, M – 1) for an MxN shaped image. The vmin and vmax parameters are a good way to ensure consistency of your colormap regardless of your input data. If these two parameters are not supplied, then imshow() will use the minimum and maximum of the input data to determine the colormap. This would be undesirable as we move towards displaying arbitrary frames of radar data. Finally, one can explicitly specify the colormap to use for the image. The gist_ncar colormap is very similar to the official NEXRAD colormap for radar data, so we will use it here: The gist_ncar colormap, along with some other colormaps packaged with Matplotlib such as the default jet colormap, are actually terrible for visualization. See the Choosing Colormaps page of the Matplotlib website for an explanation of why, and guidance on how to choose a better colormap. The menagerie of artists Whenever a plotting function is called, the input data and parameters are processed to produce new artists to represent the data. These artists are either primitives or collections thereof. They are called primitives because they represent basic drawing components such as lines, images, polygons, and text. It is with these primitives that your data can be represented as bar charts, line plots, errorbars, or any other kinds of plots. Primitives There are four drawing primitives in Matplotlib: Line2D, AxesImage, Patch, and Text. It is through these primitive artists that all other artist objects are derived from, and they comprise everything that can be drawn in a figure. A Line2D object uses a list of coordinates to draw line segments in between. Typically, the individual line segments are straight, and curves can be approximated with many vertices; however, curves can be specified to draw arcs, circles, or any other Bezier-approximated curves. An AxesImage class will take two-dimensional data and coordinates and display an image of that data with a colormap applied to it. There are actually other kinds of basic image artists available besides AxesImage, but they are typically for very special uses. AxesImage objects can be very tricky to deal with, so it is often best to use the imshow() plotting method to create and return these objects. A Patch object is an arbitrary two-dimensional object that has a single color for its "face." A polygon object is a specific instance of the slightly more general patch. These objects have a "path" (much like a Line2D object) that specifies segments that would enclose a face with a single color. The path is known as an "edge," and can have its own color as well. Besides the Polygons that one sees for bar plots and pie charts, Patch objects are also used to create arrows, legend boxes, and the markers used in scatter plots and elsewhere. Finally, the Text object takes a Python string, a point coordinate, and various font parameters to form the text that annotates plots. Matplotlib primarily uses TrueType fonts. It will search for fonts available on your system as well as ship with a few FreeType2 fonts, and it uses Bitstream Vera by default. Additionally, a Text object can defer to LaTeX to render its text, if desired. While specific artist classes will have their own set of properties that make sense for the particular art object they represent, there are several common properties that can be set. The following table is a listing of some of these properties. Property Meaning alpha 0 represents transparent and 1 represents opaque color Color name or other color specification visible boolean to flag whether to draw the artist or not zorder value of the draw order in the layering engine Let's extend the radar image example by loading up already saved polygons of storm cells in the tutorial.py file. Code: chp1/simple_storm_cell_viewer.py import matplotlib.pyplot as plt from scipy.io import netcdf_file from matplotlib.patches import Polygon from tutorial import polygon_loader ncf = netcdf_file('KTLX_20100510_22Z.nc') data = ncf.variables['Reflectivity'] lats = ncf.variables['lat'] lons = ncf.variables['lon'] i = 0 cmap = plt.get_cmap('gist_ncar') cmap.set_under('lightgrey') fig, ax = plt.subplots(1, 1) im = ax.imshow(data[i], origin='lower', extent=(lons[0], lons[-1], lats[0], lats[-1]), vmin=0.1, vmax=80, cmap='gist_ncar') cb = fig.colorbar(im) polygons = polygon_loader('polygons.shp') for poly in polygons[i]: p = Polygon(poly, lw=3, fc='k', ec='w', alpha=0.45) ax.add_artist(p) cb.set_label("Reflectivity (dBZ)") ax.set_xlabel("Longitude") ax.set_ylabel("Latitude") plt.show() The polygon data returned from polygon_loader() is a dictionary of lists keyed by a frame index. The list contains Nx2 numpy arrays of vertex coordinates in longitude and latitude. The vertices form the outline of a storm cell. The Polygon constructor, like all other artist objects, takes many optional keyword arguments. First, lw is short for linewidth, (referring to the outline of the polygon), which we specify to be three points wide. Next is fc, which is short for facecolor, and is set to black ('k'). This is the color of the filled-in region of the polygon. Then edgecolor (ec) is set to white ('w') to help the polygons stand out against a dark background. Finally, we set the alpha argument to be slightly less than half to make the polygon fairly transparent so that one can still see the reflectivity data beneath the polygons. Note a particular difference between how we plotted the image using imshow() and how we plotted the polygons using polygon artists. For polygons, we called a constructor and then explicitly called ax.add_artist() to add each polygon instance as a child of the axes. Meanwhile, imshow() is a plotting function that will do all of the hard work in validating the inputs, building the AxesImage instance, making all necessary modifications to the axes instance (such as setting the limits and aspect ratio), and most importantly, adding the artist object to the axes. Finally, all plotting functions in Matplotlib return artists or a list of artist objects that it creates. In most cases, you will not need to save this return value in a variable because there is nothing else to do with them. In this case, we only needed the returned AxesImage so that we could pass it to the fig.colorbar() method. This is so that it would know what to base the colorbar upon. The plotting functions in Matplotlib exist to provide convenience and simplicity to what can often be very tricky to get right by yourself. They are not magic! They use the same OO interface that is accessible to application developers. Therefore, anyone can write their own plotting functions to make complicated plots easier to perform. Collections Any artist that has child artists (such as a figure or an axes) is called a container. A special kind of container in Matplotlib is called a Collection. A collection usually contains a list of primitives of the same kind that should all be treated similarly. For example, a CircleCollection would have a list of Circle objects, all with the same color, size, and edge width. Individual values for artists in the collection can also be set. A collection makes management of many artists easier. This becomes especially important when considering the number of artist objects that may be needed for scatter plots, bar charts, or any other kind of plot or diagram. Some collections are not just simply a list of primitives, but are artists in their own right. These special kinds of collections take advantage of various optimizations that can be assumed when rendering similar or identical things. RegularPolyCollection, for example, just needs to know the points of a single polygon relative to its center (such as a star or box) and then just needs a list of all the center coordinates, avoiding the need to store all the vertices of every polygon in its collection in memory. In the following example, we will display storm tracks as LineCollection. Note that instead of using ax.add_artist() (which would work), we will use ax.add_collection() instead. This has the added benefit of performing special handling on the object to determine its bounding box so that the axes object can incorporate the limits of this collection with any other plotted objects to automatically set its own limits which we trigger with the ax.autoscale(True) call. Code: chp1/linecoll_track_viewer.py import matplotlib.pyplot as plt from matplotlib.collections import LineCollection from tutorial import track_loader tracks = track_loader('polygons.shp') # Filter out non-tracks (unassociated polygons given trackID of -9) tracks = {tid: t for tid, t in tracks.items() if tid != -9} fig, ax = plt.subplots(1, 1) lc = LineCollection(tracks.values(), color='b') ax.add_collection(lc) ax.autoscale(True) ax.set_xlabel("Longitude") ax.set_ylabel("Latitude") plt.show() Much easier than the radar images, Matplotlib took care of all the limit setting automatically. Such features are extremely useful for writing generic applications that do not wish to concern themselves with such details. Summary In this article, we introduced you to the foundational concepts of Matplotlib. Using show(), you showed your first plot with only three lines of Python. With this plot up on your screen, you learned some of the basic interactive features built into Matplotlib, such as panning, zooming, and the myriad of key bindings that are available. Then we discussed the difference between interactive and non-interactive plotting modes and the difference between scripted and interactive plotting. You now know where to go online for more information, examples, and forum discussions of Matplotlib when it comes time for you to work on your next Matplotlib project. Next, we discussed the architectural concepts of Matplotlib: backends, figures, axes, and artists. Then we started our construction project, an interactive storm cell tracking application. We saw how to plot a radar image using a pre-existing plotting function, as well as how to display polygons and lines as artists and collections. While creating these objects, we had a glimpse of how to customize the properties of these objects for our display needs, learning some of the property and styling names. We also learned some of the steps one needs to consider when creating their own plotting functions, such as autoscaling. Resources for Article: Further resources on this subject: The plot function [article] First Steps [article] Machine Learning in IPython with scikit-learn [article]

0
2
7670

Packt

19 Mar 2015

18 min read

Finding People and Things

Packt

19 Mar 2015

18 min read

0
0
3134

article-image-entity-framework-code-first-accessing-database-views-and-stored-procedures

Packt

18 Mar 2015

15 min read

Entity Framework Code-First: Accessing Database Views and Stored Procedures

Packt

18 Mar 2015

15 min read

0
0
10826

Robi Sen

16 Mar 2015

7 min read

Text Mining with R: Part 1

Robi Sen

16 Mar 2015

7 min read

R is rapidly becoming the platform of choice for programmers, scientists, and others who need to perform statistical analysis and data mining. In part this is because R is incredibly easy to learn and with just a few commands you can perform data mining and analysis functions that would be very hard in more general purpose languages like Ruby, .Net, Java, or C++. To demonstrate R’s ease, flexibility, and power we will look at how to use R to look at a collection of tweets from the 2014 super bowl, clear up data via R, turn that data it a document matrix so we can analyze the data, then create a “word cloud” so we can visualize our analysis to look for interesting words. Getting Started To get started you need to download both R and R studio. R can be found here and RStudio can be found here. R and RStudio are available for most major operating systems and you should follow the up to date installation guides on their respective websites. For this example we are going to be using a data set from Techtunk which is rather large. For this article I have taken a small excerpt of techtrunks SuperBowl 2014, over 8 million tweets, and cleaned it up for the article. You can download it from the original data source here. Finally you will need to install the R packages text mining package (tm ) and word cloud package (wordcloud). You can use standard library method to install the packages or just use RStudio to install the packets. Preparing our Data As already stated you can find the total SuperBowl 2014 dataset. That being said, it's very large and broken up into many sets of Pipe Delimited files, and they have the .csv file extension but are not .csv, which can be somewhat awkward to work with. This though is a common problem when working with large data sets. Luckily the data set is broken up into fragments in that usually when you are working with large datasets you do not want to try to start developing against the whole data set rather a small and workable dataset the will let you quickly develop your scripts without being so large and unwieldy that it delays development. Indeed you will find that working the large files provided by Techtunk can take 10’s of minutes to process as is. In cases like this is good to look at the data, figure out what data you want, take a sample set of data, massage it as needed, and then work in it from there until you have your coding working exactly how you want. In our cases I took a subset of 4600 tweets from one of the pipe delimited files, converted the file format to Commas Separated Value, .csv, and saved it as a sample file to work from. You can do the same thing, you should consider using files smaller than 5000 records, however you would like or use the file created for this post here. Visualizing our Data For this post all we want to do is get a general sense of what the more common words are that are being tweeted during the superbowl. A common way to visualize this is with a word cloud which will show us the frequency of a term by representing it in greater size to other words in comparison to how many times it is mentioned in the body of tweets being analyzed. To do this we need to a few things first with our data. First we need to create read in our file and turn our collection of tweets into a Corpus. In general a Corpus is a large body of text documents. In R’s textming package it’s an object that will be used to hold our tweets in memory. So to load our tweets as a corpus into R you can do as shown here: # change this file location to suit your machine file_loc <- "yourfilelocation/largerset11.csv" # change TRUE to FALSE if you have no column headings in the CSV myfile <- read.csv(file_loc, header = TRUE, stringsAsFactors=FALSE) require(tm) mycorpus <- Corpus(DataframeSource(myfile[c("username","tweet")])) You can now simply print your Corpus to get a sense of it. > print(mycorpus) <<VCorpus (documents: 4688, metadata (corpus/indexed): 0/0)>> In this case, VCorpus is an automatic assignment meaning that the Corpus is a Volatile object stored in memory. If you want you can make the Corpus permanent using PCorpus. You might do this if you were doing analysis on actual documents such as PDF’s or even databases and in this case R makes pointers to the documents instead of full document structures in memory. Another method you can use to look at your corpus is inspect() which provides a variety of ways to look at the documents in your corpus. For example using: inspect(mycorpus[1,2]) This might give you a result like: > inspect(mycorpus[1:2]) <<VCorpus (documents: 2, metadata (corpus/indexed): 0/0)>> [[1]] <<PlainTextDocument (metadata: 7)>> sstanley84 wow rt thestanchion news fleury just made fraudulent investment karlsson httptco5oi6iwashg [[2]] <<PlainTextDocument (metadata: 7)>> judemgreen 2 hour wait train superbowl time traffic problemsnice job chris christie As such inspect can be very useful in quickly getting a sense of data in your corpus without having to try to print the whole corpus. Now that we have our corpus in memory let's clean it up a little before we do our analysis. Usually you want to do this on documents that you are analyzing to remove words that are not relevant to your analysis such as “stopwords” or words such as and, like, but, if, the, and the like which you don’t care about. To do this with the textmining package you want to use transforms. Transforms essentially various functions to all documents in a corpus and that the form of tm_map(your corpus, some function). For example we can use tm_map like this: mycorpus <- tm_map(mycorpus, removePunctuation) Which will now remove all the punctuation marks from our tweets. We can do some other transforms to clean up our data by converting all the text to lower case, removing stop words, extra whitespace, and the like. mycorpus <- tm_map(mycorpus, removePunctuation) mycorpus <- tm_map(mycorpus, content_transformer(tolower)) mycorpus <- tm_map(mycorpus, stripWhitespace) mycorpus <- tm_map(mycorpus, removeWords, c(stopwords("english"), "news")) Note the last line. In that case we are using the stopwords() method but also adding our own word to it; news. You could append your own list of stopwords in this manner. Summary In this post we have looked at the basics of doing text mining in R by selecting data, preparing it, cleaning, then performing various operations on it to visualize that data. In the next post, Part 2, we look at a simple use case showing how we can derive real meaning and value from a visualization by seeing how a simple word cloud and help you understand the impact of an advertisement. About the author Robi Sen, CSO at Department 13, is an experienced inventor, serial entrepreneur, and futurist whose dynamic twenty-plus year career in technology, engineering, and research has led him to work on cutting edge projects for DARPA, TSWG, SOCOM, RRTO, NASA, DOE, and the DOD. Robi also has extensive experience in the commercial space, including the co-creation of several successful start-up companies. He has worked with companies such as UnderArmour, Sony, CISCO, IBM, and many others to help build out new products and services. Robi specializes in bringing his unique vision and thought process to difficult and complex problems allowing companies and organizations to find innovative solutions that they can rapidly operationalize or go to market with.

0
0
3404

article-image-pricing-double-no-touch-option

Packt

10 Mar 2015

19 min read

Pricing the Double-no-touch option

Packt

10 Mar 2015

19 min read

In this article by Balázs Márkus, coauthor of the book Mastering R for Quantitative Finance, you will learn about pricing and life of Double-no-touch (DNT) option. (For more resources related to this topic, see here.) A Double-no-touch (DNT) option is a binary option that pays a fixed amount of cash at expiry. Unfortunately, the fExoticOptions package does not contain a formula for this option at present. We will show two different ways to price DNTs that incorporate two different pricing approaches. In this section, we will call the function dnt1, and for the second approach, we will use dnt2 as the name for the function. Hui (1996) showed how a one-touch double barrier binary option can be priced. In his terminology, "one-touch" means that a single trade is enough to trigger the knock-out event, and "double barrier" binary means that there are two barriers and this is a binary option. We call this DNT as it is commonly used on the FX markets. This is a good example for the fact that many popular exotic options are running under more than one name. In Haug (2007a), the Hui-formula is already translated into the generalized framework. S, r, b, s, and T have the same meaning. K means the payout (dollar amount) while L and U are the lower and upper barriers. Where Implementing the Hui (1996) function to R starts with a big question mark: what should we do with an infinite sum? How high a number should we substitute as infinity? Interestingly, for practical purposes, small number like 5 or 10 could often play the role of infinity rather well. Hui (1996) states that convergence is fast most of the time. We are a bit skeptical about this since a will be used as an exponent. If b is negative and sigma is small enough, the (S/L)a part in the formula could turn out to be a problem. First, we will try with normal parameters and see how quick the convergence is: dnt1 <- function(S, K, U, L, sigma, T, r, b, N = 20, ploterror = FALSE){ if ( L > S | S > U) return(0) Z <- log(U/L) alpha <- -1/2*(2*b/sigma^2 - 1) beta <- -1/4*(2*b/sigma^2 - 1)^2 - 2*r/sigma^2 v <- rep(0, N) for (i in 1:N) v[i] <- 2*pi*i*K/(Z^2) * (((S/L)^alpha - (-1)^i*(S/U)^alpha ) / (alpha^2+(i*pi/Z)^2)) * sin(i*pi/Z*log(S/L)) * exp(-1/2 * ((i*pi/Z)^2-beta) * sigma^2*T) if (ploterror) barplot(v, main = "Formula Error"); sum(v) } print(dnt1(100, 10, 120, 80, 0.1, 0.25, 0.05, 0.03, 20, TRUE)) The following screenshot shows the result of the preceding code: The Formula Error chart shows that after the seventh step, additional steps were not influencing the result. This means that for practical purposes, the infinite sum can be quickly estimated by calculating only the first seven steps. This looks like a very quick convergence indeed. However, this could be pure luck or coincidence. What about decreasing the volatility down to 3 percent? We have to set N as 50 to see the convergence: print(dnt1(100, 10, 120, 80, 0.03, 0.25, 0.05, 0.03, 50, TRUE)) The preceding command gives the following output: Not so impressive? 50 steps are still not that bad. What about decreasing the volatility even lower? At 1 percent, the formula with these parameters simply blows up. First, this looks catastrophic; however, the price of a DNT was already 98.75 percent of the payout when we used 3 percent volatility. Logic says that the DNT price should be a monotone-decreasing function of volatility, so we already know that the price of the DNT should be worth at least 98.75 percent if volatility is below 3 percent. Another issue is that if we choose an extreme high U or extreme low L, calculation errors emerge. However, similar to the problem with volatility, common sense helps here too; the price of a DNT should increase if we make U higher or L lower. There is still another trick. Since all the problem comes from the a parameter, we can try setting b as 0, which will make a equal to 0.5. If we also set r to 0, the price of a DNT converges into 100 percent as the volatility drops. Anyway, whenever we substitute an infinite sum by a finite sum, it is always good to know when it will work and when it will not. We made a new code that takes into consideration that convergence is not always quick. The trick is that the function calculates the next step as long as the last step made any significant change. This is still not good for all the parameters as there is no cure for very low volatility, except that we accept the fact that if implied volatilities are below 1 percent, than this is an extreme market situation in which case DNT options should not be priced by this formula: dnt1 <- function(S, K, U, L, sigma, Time, r, b) { if ( L > S | S > U) return(0) Z <- log(U/L) alpha <- -1/2*(2*b/sigma^2 - 1) beta <- -1/4*(2*b/sigma^2 - 1)^2 - 2*r/sigma^2 p <- 0 i <- a <- 1 while (abs(a) > 0.0001){ a <- 2*pi*i*K/(Z^2) * (((S/L)^alpha - (-1)^i*(S/U)^alpha ) / (alpha^2 + (i *pi / Z)^2) ) * sin(i * pi / Z * log(S/L)) * exp(-1/2*((i*pi/Z)^2-beta) * sigma^2 * Time) p <- p + a i <- i + 1 } p } Now that we have a nice formula, it is possible to draw some DNT-related charts to get more familiar with this option. Later, we will use a particular AUDUSD DNT option with the following parameters: L equal to 0.9200, U equal to 0.9600, K (payout) equal to USD 1 million, T equal to 0.25 years, volatility equal to 6 percent, r_AUD equal to 2.75 percent, r_USD equal to 0.25 percent, and b equal to -2.5 percent. We will calculate and plot all the possible values of this DNT from 0.9200 to 0.9600; each step will be one pip (0.0001), so we will use 2,000 steps. The following code plots a graph of price of underlying: x <- seq(0.92, 0.96, length = 2000) y <- z <- rep(0, 2000) for (i in 1:2000){ y[i] <- dnt1(x[i], 1e6, 0.96, 0.92, 0.06, 0.25, 0.0025, -0.0250) z[i] <- dnt1(x[i], 1e6, 0.96, 0.92, 0.065, 0.25, 0.0025, -0.0250) } matplot(x, cbind(y,z), type = "l", lwd = 2, lty = 1, main = "Price of a DNT with volatility 6% and 6.5% ", cex.main = 0.8, xlab = "Price of underlying" ) The following output is the result of the preceding code: It can be clearly seen that even a small change in volatility can have a huge impact on the price of a DNT. Looking at this chart is an intuitive way to find that vega must be negative. Interestingly enough even just taking a quick look at this chart can convince us that the absolute value of vega is decreasing if we are getting closer to the barriers. Most end users think that the biggest risk is when the spot is getting close to the trigger. This is because end users really think about binary options in a binary way. As long as the DNT is alive, they focus on the positive outcome. However, for a dynamic hedger, the risk of a DNT is not that interesting when the value of the DNT is already small. It is also very interesting that since the T-Bill price is independent of the volatility and since the DNT + DOT = T-Bill equation holds, an increasing volatility will decrease the price of the DNT by the exact same amount just like it will increase the price of the DOT. It is not surprising that the vega of the DOT should be the exact mirror of the DNT. We can use the GetGreeks function to estimate vega, gamma, delta, and theta. For gamma we can use the GetGreeks function in the following way: GetGreeks <- function(FUN, arg, epsilon,...) { all_args1 <- all_args2 <- list(...) all_args1[[arg]] <- as.numeric(all_args1[[arg]] + epsilon) all_args2[[arg]] <- as.numeric(all_args2[[arg]] - epsilon) (do.call(FUN, all_args1) - do.call(FUN, all_args2)) / (2 * epsilon) } Gamma <- function(FUN, epsilon, S, ...) { arg1 <- list(S, ...) arg2 <- list(S + 2 * epsilon, ...) arg3 <- list(S - 2 * epsilon, ...) y1 <- (do.call(FUN, arg2) - do.call(FUN, arg1)) / (2 * epsilon) y2 <- (do.call(FUN, arg1) - do.call(FUN, arg3)) / (2 * epsilon) (y1 - y2) / (2 * epsilon) } x = seq(0.9202, 0.9598, length = 200) delta <- vega <- theta <- gamma <- rep(0, 200) for(i in 1:200){ delta[i] <- GetGreeks(FUN = dnt1, arg = 1, epsilon = 0.0001, x[i], 1000000, 0.96, 0.92, 0.06, 0.5, 0.02, -0.02) vega[i] <- GetGreeks(FUN = dnt1, arg = 5, epsilon = 0.0005, x[i], 1000000, 0.96, 0.92, 0.06, 0.5, 0.0025, -0.025) theta[i] <- - GetGreeks(FUN = dnt1, arg = 6, epsilon = 1/365, x[i], 1000000, 0.96, 0.92, 0.06, 0.5, 0.0025, -0.025) gamma[i] <- Gamma(FUN = dnt1, epsilon = 0.0001, S = x[i], K = 1e6, U = 0.96, L = 0.92, sigma = 0.06, Time = 0.5, r = 0.02, b = -0.02) } windows() plot(x, vega, type = "l", xlab = "S",ylab = "", main = "Vega") The following chart is the result of the preceding code: After having a look at the value chart, the delta of a DNT is also very close to intuitions; if we are coming close to the higher barrier, our delta gets negative, and if we are coming closer to the lower barrier, the delta gets positive as follows: windows() plot(x, delta, type = "l", xlab = "S",ylab = "", main = "Delta") This is really a non-convex situation; if we would like to do a dynamic delta hedge, we will lose money for sure. If the spot price goes up, the delta of the DNT decreases, so we should buy some AUDUSD as a hedge. However, if the spot price goes down, we should sell some AUDUSD. Imagine a scenario where AUDUSD goes up 20 pips in the morning and then goes down 20 pips in the afternoon. For a dynamic hedger, this means buying some AUDUSD after the price moved up and selling this very same amount after the price comes down. The changing of the delta can be described by the gamma as follows: windows() plot(x, gamma, type = "l", xlab = "S",ylab = "", main = "Gamma") Negative gamma means that if the spot goes up, our delta is decreasing, but if the spot goes down, our delta is increasing. This doesn't sound great. For this inconvenient non-convex situation, there is some compensation, that is, the value of theta is positive. If nothing happens, but one day passes, the DNT will automatically worth more. Here, we use theta as minus 1 times the partial derivative, since if (T-t) is the time left, we check how the value changes as t increases by one day: windows() plot(x, theta, type = "l", xlab = "S",ylab = "", main = "Theta") The more negative the gamma, the more positive our theta. This is how time compensates for the potential losses generated by the negative gamma. Risk-neutral pricing also implicates that negative gamma should be compensated by a positive theta. This is the main message of the Black-Scholes framework for vanilla options, but this is also true for exotics. See Taleb (1997) and Wilmott (2006). We already introduced the Black-Scholes surface before; now, we can go into more detail. This surface is also a nice interpretation of how theta and delta work. It shows the price of an option for different spot prices and times to maturity, so the slope of this surface is the theta for one direction and delta for the other. The code for this is as follows: BS_surf <- function(S, Time, FUN, ...) { n <- length(S) k <- length(Time) m <- matrix(0, n, k) for (i in 1:n) { for (j in 1:k) { l <- list(S = S[i], Time = Time[j], ...) m[i,j] <- do.call(FUN, l) } } persp3D(z = m, xlab = "underlying", ylab = "Time", zlab = "option price", phi = 30, theta = 30, bty = "b2") } BS_surf(seq(0.92,0.96,length = 200), seq(1/365, 1/48, length = 200), dnt1, K = 1000000, U = 0.96, L = 0.92, r = 0.0025, b = -0.0250, sigma = 0.2) The preceding code gives the following output: We can see what was already suspected; DNT likes when time is passing and the spot is moving to the middle of the (L,U) interval. Another way to price the Double-no-touch option Static replication is always the most elegant way of pricing. The no-arbitrage argument will let us say that if, at some time in the future, two portfolios have the same value for sure, then their price should be equal any time before this. We will show how double-knock-out (DKO) options could be used to build a DNT. We will need to use a trick; the strike price could be the same as one of the barriers. For a DKO call, the strike price should be lower than the upper barrier because if the strike price is not lower than the upper barrier, the DKO call would be knocked out before it could become in-the-money, so in this case, the option would be worthless as nobody can ever exercise it in-the-money. However, we can choose the strike price to be equal to the lower barrier. For a put, the strike price should be higher than the lower barrier, so why not make it equal to the upper barrier. This way, the DKO call and DKO put option will have a very convenient feature; if they are still alive, they will both expiry in-the-money. Now, we are almost done. We just have to add the DKO prices, and we will get a DNT that has a payout of (U-L) dollars. Since DNT prices are linear in the payout, we only have to multiply the result by K*(U-L): dnt2 <- function(S, K, U, L, sigma, T, r, b) { a <- DoubleBarrierOption("co", S, L, L, U, T, r, b, sigma, 0, 0,title = NULL, description = NULL) z <- a@price b <- DoubleBarrierOption("po", S, U, L, U, T, r, b, sigma, 0, 0,title = NULL, description = NULL) y <- b@price (z + y) / (U - L) * K } Now, we have two functions for which we can compare the results: dnt1(0.9266, 1000000, 0.9600, 0.9200, 0.06, 0.25, 0.0025, -0.025) [1] 48564.59 dnt2(0.9266, 1000000, 0.9600, 0.9200, 0.06, 0.25, 0.0025, -0.025) [1] 48564.45 For a DNT with a USD 1 million contingent payout and an initial market value of over 48,000 dollars, it is very nice to see that the difference in the prices is only 14 cents. Technically, however, having a second pricing function is not a big help since low volatility is also an issue for dnt2. We will use dnt1 for the rest of the article. The life of a Double-no-touch option – a simulation How has the DNT price been evolving during the second quarter of 2014? We have the open-high-low-close type time series with five minute frequency for AUDUSD, so we know all the extreme prices: d <- read.table("audusd.csv", colClasses = c("character", rep("numeric",5)), sep = ";", header = TRUE) underlying <- as.vector(t(d[, 2:5])) t <- rep( d[,6], each = 4) n <- length(t) option_price <- rep(0, n) for (i in 1:n) { option_price[i] <- dnt1(S = underlying[i], K = 1000000, U = 0.9600, L = 0.9200, sigma = 0.06, T = t[i]/(60*24*365), r = 0.0025, b = -0.0250) } a <- min(option_price) b <- max(option_price) option_price_transformed = (option_price - a) * 0.03 / (b - a) + 0.92 par(mar = c(6, 3, 3, 5)) matplot(cbind(underlying,option_price_transformed), type = "l", lty = 1, col = c("grey", "red"), main = "Price of underlying and DNT", xaxt = "n", yaxt = "n", ylim = c(0.91,0.97), ylab = "", xlab = "Remaining time") abline(h = c(0.92, 0.96), col = "green") axis(side = 2, at = pretty(option_price_transformed), col.axis = "grey", col = "grey") axis(side = 4, at = pretty(option_price_transformed), labels = round(seq(a/1000,1000,length = 7)), las = 2, col = "red", col.axis = "red") axis(side = 1, at = seq(1,n, length=6), labels = round(t[round(seq(1,n, length=6))]/60/24)) The following is the output for the preceding code: The price of a DNT is shown in red on the right axis (divided by 1000), and the actual AUDUSD price is shown in grey on the left axis. The green lines are the barriers of 0.9200 and 0.9600. The chart shows that in 2014 Q2, the AUDUSD currency pair was traded inside the (0.9200; 0.9600) interval; thus, the payout of the DNT would have been USD 1 million. This DNT looks like a very good investment; however, reality is just one trajectory out of an a priori almost infinite set. It could have happened differently. For example, on May 02, 2014, there were still 59 days left until expiry, and AUDUSD was traded at 0.9203, just three pips away from the lower barrier. At this point, the price of this DNT was only USD 5,302 dollars which is shown in the following code: dnt1(0.9203, 1000000, 0.9600, 0.9200, 0.06, 59/365, 0.0025, -0.025) [1] 5302.213 Compare this USD 5,302 to the initial USD 48,564 option price! In the following simulation, we will show some different trajectories. All of them start from the same 0.9266 AUDUSD spot price as it was on the dawn of April 01, and we will see how many of them stayed inside the (0.9200; 0.9600) interval. To make it simple, we will simulate geometric Brown motions by using the same 6 percent volatility as we used to price the DNT: library(matrixStats) DNT_sim <- function(S0 = 0.9266, mu = 0, sigma = 0.06, U = 0.96, L = 0.92, N = 5) { dt <- 5 / (365 * 24 * 60) t <- seq(0, 0.25, by = dt) Time <- length(t) W <- matrix(rnorm((Time - 1) * N), Time - 1, N) W <- apply(W, 2, cumsum) W <- sqrt(dt) * rbind(rep(0, N), W) S <- S0 * exp((mu - sigma^2 / 2) * t + sigma * W ) option_price <- matrix(0, Time, N) for (i in 1:N) for (j in 1:Time) option_price[j,i] <- dnt1(S[j,i], K = 1000000, U, L, sigma, 0.25-t[j], r = 0.0025, b = -0.0250)*(min(S[1:j,i]) > L & max(S[1:j,i]) < U) survivals <- sum(option_price[Time,] > 0) dev.new(width = 19, height = 10) par(mfrow = c(1,2)) matplot(t,S, type = "l", main = "Underlying price", xlab = paste("Survived", survivals, "from", N), ylab = "") abline( h = c(U,L), col = "blue") matplot(t, option_price, type = "l", main = "DNT price", xlab = "", ylab = "")} set.seed(214) system.time(DNT_sim()) The following is the output for the preceding code: Here, the only surviving trajectory is the red one; in all other cases, the DNT hits either the higher or the lower barrier. The line set.seed(214) grants that this simulation will look the same anytime we run this. One out of five is still not that bad; it would suggest that for an end user or gambler who does no dynamic hedging, this option has an approximate value of 20 percent of the payout (especially since the interest rates are low, the time value of money is not important). However, five trajectories are still too few to jump to such conclusions. We should check the DNT survivorship ratio for a much higher number of trajectories. The ratio of the surviving trajectories could be a good estimator of the a priori real-world survivorship probability of this DNT; thus, the end user value of it. Before increasing N rapidly, we should keep in mind how much time this simulation took. For my computer, it took 50.75 seconds for N = 5, and 153.11 seconds for N = 15. The following is the output for N = 15: Now, 3 out of 15 survived, so the estimated survivorship ratio is still 3/15, which is equal to 20 percent. Looks like this is a very nice product; the price is around 5 percent of the payout, while 20 percent is the estimated survivorship ratio. Just out of curiosity, run the simulation for N equal to 200. This should take about 30 minutes. The following is the output for N = 200: The results are shocking; now, only 12 out of 200 survive, and the ratio is only 6 percent! So to get a better picture, we should run the simulation for a larger N. The movie Whatever Works by Woody Allen (starring Larry David) is 92 minutes long; in simulation time, that is N = 541. For this N = 541, there are only 38 surviving trajectories, resulting in a survivorship ratio of 7 percent. What is the real expected survivorship ratio? Is it 20 percent, 6 percent, or 7 percent? We simply don't know at this point. Mathematicians warn us that the law of large numbers requires large numbers, where large is much more than 541, so it would be advisable to run this simulation for as large an N as time allows. Of course, getting a better computer also helps to do more N during the same time. Anyway, from this point of view, Hui's (1996) relatively fast converging DNT pricing formula gets some respect. Summary We started this article by introducing exotic options. In a brief theoretical summary, we explained how exotics are linked together. There are many types of exotics. We showed one possible way of classification that is consistent with the fExoticOptions package. We showed how the Black-Scholes surface (a 3D chart that contains the price of a derivative dependent on time and the underlying price) can be constructed for any pricing function. Resources for Article: Further resources on this subject: What is Quantitative Finance? [article] Learning Option Pricing [article] Derivatives Pricing [article]

0
0
8088

Packt

05 Mar 2015

43 min read

Hadoop and MapReduce

Packt

05 Mar 2015

43 min read

0
0
7562

article-image-learning-random-forest-using-mahout

Packt

05 Mar 2015

11 min read

Learning Random Forest Using Mahout

Packt

05 Mar 2015

11 min read

0
1
4176

Packt

04 Mar 2015

20 min read

Writing Consumers

Packt

04 Mar 2015

20 min read

0
0
3687

How-To Tutorials - Data

Machine Learning Using Spark MLlib

Basic Concepts of Machine Learning and Logistic Regression Example in Mahout

PostgreSQL – New Features

Storm for Real-time High Velocity Computation

Big Data Analysis (R and Hadoop)

Cassandra Architecture

Classifying with Real-world Examples

Introducing Interactive Plotting

Finding People and Things

Entity Framework Code-First: Accessing Database Views and Stored Procedures

Trending Topics

Text Mining with R: Part 1

Pricing the Double-no-touch option

Hadoop and MapReduce

Learning Random Forest Using Mahout

Writing Consumers

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access