Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-securing-wal-stream
Packt
15 Jul 2014
6 min read
Save for later

Securing the WAL Stream

Packt
15 Jul 2014
6 min read
(For more resources related to this topic, see here.) The primary mechanism that PostgreSQL uses to provide a data durability guarantee is through its Write Ahead Log (WAL). All transactional data is written to this location before ever being committed to database files. Once WAL files are no longer necessary for crash recovery, PostgreSQL will either delete or archive them. For the purposes of a highly available server, we recommend that you keep these important files as long as possible. There are several reasons for this; they are as follows: Archived WAL files can be used for Point In Time Recovery (PITR) If you are using streaming replication, interrupted streams can be re-established by applying WAL files until the replica has caught up WAL files can be reused to service multiple server copies In order to gain these benefits, we need to enable PostgreSQL WAL archiving and save these files until we no longer need them. This section will address our recommendations for long term storage of WAL files. Getting ready In order to properly archive WAL files, we recommend that you provision a server dedicated to backups or file storage. Depending on the transaction volume, an active PostgreSQL database might produce thousands of these on a daily basis. At 16 MB apiece, this is not an idle concern. For instance, for a 1 TB database, we recommend at least 3 TB of storage space. In addition, we will be using rsync as a daemon on this archive server. To install this on a Debian-based server, execute the following command as a root-level user: sudo apt-get install rsync Red-Hat-based systems will need this command instead: sudo yum install rsync xinetd How to do it... Our archive server has a 3 TB mount at the /db directory and is named arc_server on our network. The PostgreSQL source server resides at 192.168.56.10. Follow these steps for long-term storage of important WAL files on an archive server Enable rsync to run as a daemon on the archive server. On Debian based systems, edit the /etc/default/rsync file and change the RSYNC_ENABLE variable to true. On Red-Hat-based systems, edit the /etc/xinet.d/rsync file and change the disable parameter to no. Create a directory to store archived WAL files as the postgres user with these commands: sudo mkdir /db/pg_archived sudo chown postgres:postgres /db/pg_archived Create a file named /etc/rsyncd.conf and fill it with the following contents: [wal_store] path = /db/pg_archived comment = DB WAL Archives uid = postgres gid = postgres read only = false hosts allow = 192.168.56.10 hosts deny = * Start the rsync daemon. Debian-based systems should execute the following command: sudo service rsync start Red-Hat-based systems can start rsync with this command instead: sudo service xinetd start Change the archive_mode and archive_command parameters in postgresql.conf to read the following: archive_mode = on archive_command = 'rsync -aq %p arc_server::wal_store/%f' Restart the PostgreSQL server with a command similar to this: pg_ctl -D $PGDATA restart How it works The rsync utility is normally used to transfer files between two servers. However, we can take advantage of using it as a daemon to avoid connection overhead imposed by using SSH as an rsync protocol. Our first step is to ensure that the service is not disabled in some manner, which would make the rest of this guide moot. Next, we need a place to store archived WAL files on the archive server. Assuming that we have 3 TB of space in the /db directory, we simply claim /db/pg_archived as the desired storage location. There should be enough space to use /db for backups as well, but we won't discuss that here. Next, we create a file named /etc/rsyncd.conf, which will configure how rsync operates as a daemon. Here, we name the /db/pg_archived directory wal_store so that we can address the path by its name when sending files. We give it a human-readable name and ensure that files are owned by the postgres user, as this user also controls most of the PostgreSQL-related services. The next, and possibly the most important step, is to block all hosts but the primary PostgreSQL server from writing to this location. We set hosts deny to *, which blocks every server. Then, we set hosts allow to the primary database server's IP address so that only it has access. If everything goes well, we can start the rsync (or xinetd on Red Hat systems) service and we can see that in the following screenshot: Next, we enable archive_mode by setting it to on. With archive mode enabled, we can specify a command that will execute when PostgreSQL no longer needs a WAL file for crash recovery. In this case, we invoke the rsync command with the -a parameter to preserve elements such as file ownership, timestamps, and so on. In addition, we specify the -q setting to suppress output, as PostgreSQL only checks the command exit status to determine its success. In the archive_command setting, the %p value represents the full path to the WAL file, and %f resolves to the filename. In this context, we're sending the WAL file to the archive server at the wal_store module we defined in rsyncd.conf. Once we restart PostgreSQL, it will start storing all the old WAL files by sending them to the archive server. In case any rsync command fails because the archive server is unreachable, PostgreSQL will keep trying to send it until it is successful. If the archive server is unreachable for too long, we suggest that you change the archive_command setting to store files elsewhere. This prevents accidentally overfilling the PostgreSQL server storage. There's more... As we will likely want to use the WAL files on other servers, we suggest that you make a list of all the servers that could need WAL files. Then, modify the rsyncd.conf file on the archive server and add this section: [wal_fetch] path = /db/pg_archived comment = DB WAL Archive Retrieval uid = postgres gid = postgres read only = true hosts allow = host1, host2, host3 hosts deny = * Now, we can fetch WAL files from any of the hosts in hosts allow. As these are dedicated PostgreSQL replicas, recovery servers, or other defined roles, this makes the archive server a central location for all our WAL needs. See also We suggest that you read more about the archive_mode and archive_command settings on the PostgreSQL site. We've included a link here: http://www.postgresql.org/docs/9.3/static/runtime-config-wal.html The rsyncd.conf file should also have its own manual page. Read it with this command to learn more about the available settings: man rsyncd.conf Summary In this article, we've successfully learned how to secure the WAL stream by following the given steps. Resources for Article: Further resources on this subject: PostgreSQL 9: Reliable Controller and Disk Setup [article] Backup in PostgreSQL 9 [article] Recovery in PostgreSQL 9 [article]
Read more
  • 0
  • 0
  • 2292

article-image-making-most-your-hadoop-data-lake-part-1-data-compression
Kristen Hardwick
30 Jun 2014
6 min read
Save for later

Making the Most of Your Hadoop Data Lake, Part 1: Data Compression

Kristen Hardwick
30 Jun 2014
6 min read
In the world of big data, the Data Lake concept reigns supreme. Hadoop users are encouraged to keep all data in order to prepare for future use cases and as-yet-unknown data integration points. This concept is part of what makes Hadoop and HDFS so appealing, so it is important to make sure that the data is being stored in a way that prolongs that behavior. In the first part of this two-part series, “Making the Most of Your Hadoop Data Lake”, we will address one important factor in improving manageability—data compression. Data compression is an area that is often overlooked in the context of Hadoop. In many cluster environments, compression is disabled by default, putting the burden on the user. In this post, we will discuss the tradeoffs involved in deciding how to take advantage of compression techniques and the advantages and disadvantages of specific compression codec options with respect to Hadoop. To compress or not to compress Whenever data is converted to something other than its raw data format, that naturally implies some overhead involved in completing the conversion process. When data compression is being discussed, it is important to take that overhead into account with respect to the benefits of reducing the data footprint. One obvious benefit is that compressed data will reduce the amount of disk space that is required for storage of a particular dataset. In the big data environment, this benefit is especially significant—either the Hadoop cluster will be able to keep data for a larger time range, or storing data for the same time range will require fewer nodes, or the disk usage ratios will remain lower for longer. In addition, the smaller file sizes will mean lower data transfer times—either internally for MapReduce jobs or when performing exports of data results. The cost of these benefits, however, is that the data must be decompressed at every point where the data needs to be read, and compressed before being inserted into HDFS. With respect to MapReduce jobs, this processing overhead at both the map phase and the reduce phase will increase the CPU processing time. Fortunately, by making informed choices about the specific compression codecs used at any given phase in the data transformation process, the cluster administrator or user can ensure that the advantages of compression outweigh the disadvantages. Choosing the right codec for each phase Hadoop provides the user with some flexibility on which compression codec is used at each step of the data transformation process. It is important to realize that certain codecs are optimal for some stages, and non-optimal for others. In the next sections, we will cover some important notes for each choice. zlib The major benefit of using this codec is that it is the easiest way to get the benefits of data compression from a cluster and job configuration standpoint—the zlib codec is the default compression option. From the data transformation perspective, this codec will decrease the data footprint on disk, but will not provide much of a benefit in terms of job performance. gzip The gzip codec available in Hadoop is the same one that is used outside of the Hadoop ecosystem. It is common practice to use this as the codec for compressing the final output from a job, simply for the benefit of being able to share the compressed result with others (possibly outside of Hadoop) using a standard file format. bzip2 There are two important benefits for the bzip2 codec. First, if reducing the data footprint is a high priority, this algorithm will compress the data more than the default zlib option. Second, this is the only supported codec that produces “splittable” compressed data. A major characteristic of Hadoop is the idea of splitting the data so that they can be handled on each node independently. With the other compression codecs, there is an initial requirement to gather all parts of the compressed file in order to have all information necessary to decompress the data. With this format, the data can be decompressed in parallel. This splittable quality makes this format ideal for compressing data that will be used as input to a map function, either in a single step or as part of a series of chained jobs. LZO, LZ4, Snappy These three codecs are ideal for compressing intermediate data—the data output from the mappers that will be immediately read in by the reducers. All three codecs heavily favor compression speed over file size ratio, but the detailed specifications for each algorithm should be examined based on the specific licensing, cluster, and job requirements. Enabling compression Once the appropriate compression codec for any given transformation phase has been selected, there are a few configuration properties that need to be adjusted in order to have the changes take effect in the cluster. Intermediate data to reducer mapreduce.map.output.compress = true (Optional) mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.SnappyCodec Final output from a job mapreduce.output.fileoutputformat.compress = true (Optional) mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.BZip2Codec These compression codecs are also available within some of the ecosystem tools like Hive and Pig. In most cases, the tools will default to the Hadoop-configured values for particular codecs, but the tools also provide the option to compress the data generated between steps. Pig pig.tmpfilecompression = true (Optional) pig.tmpfilecompression.codec = snappy Hive hive.exec.compress.intermediate = true hive.exec.compress.output = true Conclusion This post detailed the benefits and disadvantages of data compression, along with some helpful guidelines on how to choose a codec and enable it at various stages in the data transformation workflow. In the next post, we will go through some additional techniques that can be used to ensure that users can make the most of the Hadoop Data Lake. For more Big Data and Hadoop tutorials and insight, visit our dedicated Hadoop page.  About the author Kristen Hardwick has been gaining professional experience with software development in parallel computing environments in the private, public, and government sectors since 2007. She has interfaced with several different parallel paradigms including Grid, Cluster, and Cloud. She started her software development career with Dynetics in Huntsville, AL, and then moved to Baltimore, MD, to work for Dynamics Research Corporation. She now works at Spry where her focus is on designing and developing big data analytics for the Hadoop ecosystem.
Read more
  • 0
  • 0
  • 6671

article-image-making-most-your-hadoop-data-lake-part-2-optimized-file-formats
Kristen Hardwick
30 Jun 2014
5 min read
Save for later

Making the Most of Your Hadoop Data Lake, Part 2: Optimized File Formats

Kristen Hardwick
30 Jun 2014
5 min read
One major factor of making the conversion to Hadoop is the concept of the Data Lake. That idea suggests that users keep as much data as possible in HDFS in order to prepare for future use cases and as-yet-unknown data integration points. As your data grows, it is important to make sure that the data is being stored in a way that prolongs that behavior. Data compression is not the only technique that can be used to speed up job performance and improve cluster organization. In addition to the Text and Sequence File options that are typically used by default, Hadoop offers a few more optimized file formats that are specifically designed to improve the process of interacting with the data. In the second part of this two-part series, “Making the Most of Your Hadoop Data Lake”, we will address another important factor in improving manageability—optimized file formats. Using a smarter file for your data: RCFile RCFile stands for Record Columnar File, and it serves as an ideal format for storing relational data that will be accessed through Hive. This format offers performance improvements by storing the data in an optimized way. First, the data is partitioned horizontally, into groups of rows. Then each row group is partitioned vertically, into collections of columns. Finally, the data in each column collection is compressed and stored in column-row format, as if it were a column-oriented database. The first benefit of this altered storage mechanism is apparent at the row level. All HDFS blocks used to form RCFiles will be made up of the horizontally partitioned collections of rows. This is significant because it ensures that no row of data will be split across multiple blocks, and will therefore always be on the same machine. This is not the case for traditional HDFS file formats, which typically use data size to split the file. This optimized data storage will reduce the amount of network bandwidth that is required to serve queries. The second benefit comes from optimizations at the column level, in the form of disk I/O reduction. Since the columns are stored vertically within each row group, the system will be able to seek directly to the required column position in the file, rather than being required to scan across all columns and filter out data that is not necessary. This is extremely useful in queries that only require access to a small subset of the existing columns. RCFiles can be used natively in both Hive and Pig with very little configuration. In Hive CREATE TABLE … STORED AS RCFILE; ALTER TABLE … SET FILEFORMAT RCFILE; SET hive.default.fileformat=RCFile; In Pig: register …/piggybank.jar; a = LOAD '/user/hive/warehouse/table' USING org.apache.pig.piggybank.storage.hiverc.HiveRCInputFormat(…); The Pig jar file referenced here is just one option for enabling the RCFile. At the time of writing, there was also an RCFilePigStorageclass available through Twitter’s Elephant Bird open source library. Hortonworks’ ORCFile and Cloudera’s Parquet formats RCFiles provide optimization for relational files primarily by implementing modifications at the storage level. New innovations have provided improvements on the RCFile format, namely the ORCFile format from Hortonworks and the Parquet format from Cloudera. When storing data using the Optimized Row Columnar file or Parquet formats, several pieces of metadata are automatically written at the column level within each row group; for example, minimum and maximum values for numeric data types and dictionary-style metadata for text data types. The specific metadata is also configurable. One such use case would be for a user to configure the dataset to be sorted on a given set of columns for efficient access. This excess metadata allows for queries to take advantage of an improvement on the original RCFiles–predicate pushdown. That technique allows Hive to evaluate the where clause during the record gathering process, instead of filtering data after all records have been collected. The predicate pushdown technique will evaluate the conditions of the query against the metadata associated with a particular row group, allowing it to skip over entire file blocks if possible, or to seek directly to the correct row. One major benefit of this process is that the more complex a particular where clause is, the more potential there is for row groups and columns to be filtered as irrelevant to the final result. Cloudera’s Parquet format is typically used in conjunction with Impala, but just like with RCFiles, ORCFiles can be incorporated into both Hive and Pig. HCatalog can be used as the primary method to read and write ORCFiles using Pig. The commands for Hive are provided below: In Hive: CREATE TABLE … STORED AS ORC; ALTER TABLE … SET FILEFORMAT ORC SET hive.default.fileformat=Orc Conclusion This post has detailed the alternatives to the default file formats that can be used in Hadoop in order to optimize data access and storage. This information combined with the compression techniques described in the previous post (part 1) will provide some guidelines that can be used to ensure that users can make the most of the Hadoop Data Lake. About the author Kristen Hardwick has been gaining professional experience with software development in parallel computing environments in the private, public, and government sectors since 2007. She has interfaced with several different parallel paradigms including Grid, Cluster, and Cloud. She started her software development career with Dynetics in Huntsville, AL, and then moved to Baltimore, MD, to work for Dynamics Research Corporation. She now works at Spry where her focus is on designing and developing big data analytics for the Hadoop ecosystem.
Read more
  • 0
  • 0
  • 2089

article-image-mapreduce-openstack-swift-and-zerovm
Lars Butler
30 Jun 2014
6 min read
Save for later

MapReduce with OpenStack Swift and ZeroVM

Lars Butler
30 Jun 2014
6 min read
Originally coined in a 2004 research paper produced by Google, the term “MapReduce” was defined as a “programming model and an associated implementation for processing and generating large datasets”. While Google’s proprietary implementation of this model is known simply as “MapReduce”, the term has since become overloaded to refer to any software that follows the same general computation model. The philosophy behind the MapReduce computation model is based on a divide and conquer approach: the input dataset is divided into small pieces and distributed among a pool of computation nodes for processing. The advantage here is that many nodes can run in parallel to solve a problem. This can be much quicker than a single machine, provided that the cost of communication (loading input data, relaying intermediate results between compute nodes, and writing final results to persistent storage) does not outweigh the computation time. Indeed, the cost of moving data between storage and computation nodes can diminish the advantage of distributed parallel computation if task payloads are too granular. On the other hand, they must be granular enough in order to be distributed evenly (although achieving perfect distribution is nearly impossible if exact computational complexity of each task cannot be known in advance). A distributed MapReduce solution can be effective for processing large data sets, provided that each task is sufficiently long-running to offset communication costs. If the need for data transfer (between compute and storage nodes) could somehow be reduced or eliminated, the task size limitations would all but disappear, enabling a wider range of use cases. One way to achieve this is to perform the computation closer to the data, that is, on the same physical machine. Stored procedures in relational database systems, for example, are one way to achieve data-local computation. Simpler yet, computation could be run directly on the system where the data is stored in files. In both cases, the same problems are present: changes to the code require direct access to the system, and thus access needs to be carefully controlled (through group management, read/write permissions, and so on). Scaling in these scenarios is also problematic: vertical scaling (beefing up a single server) is the only feasible way to gain performance while maintaining the efficiency of data-local computation. Achieving horizontal scalability by adding more machines is not feasible unless the storage system inherently supports this. Enter Swift and ZeroVM OpenStack Swift is one such example of a horizontally scalable data store. Swift can store petabytes of data in millions of objects across thousands of machines. Swift’s storage operations can also be extended by writing custom middleware applications, which makes it a good foundation for building a “smart” storage system capable of running computations directly on the data. ZeroCloud is a middleware application for Swift which does just that. It embeds ZeroVM inside Swift and exposes a new API method for executing jobs. ZeroVM provides sufficient program isolation and a guarantee of security such that any arbitrary code can be run safely without compromising the storage system. Access to data is governed by Swift’s inherent access control, so there is no need to further lock down the storage system, even in a multi-tenant environment. The combination of Swift and ZeroVM results in something like stored procedures, but much more powerful. First and foremost, user application code can be updated at any time without the need to worry about malicious or otherwise destructive side effects. (The worst thing that can happen is that a user can crash their own machines and delete their own data—but not another user’s.) Second, ZeroVM instances can start very quickly: in about five milliseconds. This means that to process one thousand objects, ZeroCloud can instantaneously spawn one thousand ZeroVM instances (one for each file), run the job, and destroy the instances very quickly. This also means that any job, big or small, can feasibly run on the system. Finally, the networking component of ZeroVM allows intercommunication between instances, enabling the creation of arbitrary job pipelines and multi-stage computations. This converged compute and storage solution makes for a good alternative to popular MapReduce frameworks like Hadoop because little effort is required to set up and tear down computation nodes for a job. Also, once a job is complete, there is no need to extract result data, pipe it across the network, and then save it in a separate persistent data store (which can be an expensive operation if there is a large quantity of data to save). With ZeroCloud, the results can simply be saved back into Swift. The same principle applies to the setup of a job. There is no need to move or copy data to the computation cluster; the data is already where it needs to be. Limitations Currently, ZeroCloud uses a fairly naïve scheduling algorithm. To perform a data-local computation on an object, ZeroCloud determines which machines in the cluster contain a replicated copy of an object and then randomly chooses one to execute a ZeroVM process. While this functions well as a proof-of-concept for data-local computing, it is quite possible for jobs to be spread unevenly across a cluster, resulting in an inefficient use of resources. A research group at the University of Texas at San Antonio (UTSA) sponsored by Rackspace is currently working on developing better algorithms for scheduling workloads running on ZeroVM-based infrastructure. A short video of the research proposal can be found here:http://link.brightcove.com/services/player/bcpid3530100726001?bckey=AQ~~,AAACa24Pu2k~,Q186VLPcl3-oLBDP8npyqxjCNB5jgYcT. Further reading Rackspace has launched a ZeroCloud playground service called Zebra for developers to try out the platform. At the time this was written, Zebra is still in a private beta and invitations are limited. But it also possible to install and run your own copy of ZeroCloud for testing and development; basic installation instructions are here: https://github.com/zerovm/zerocloud/blob/icehouse/doc/Hacking.md. There are also some tutorials (including sample applications) for creating, packaging, deploying, and executing applications on ZeroCloud: http://zerovm.readthedocs.org/en/latest/zebra/tutorial.html. The tutorials are intended for use with the Zebra service, but can be run on any deployment of ZeroCloud. Big Data in the cloud? It's not the future, it's already here! Find more Hadoop tutorials and extra content here, and dive deeper into OpenStack by visiting this page, dedicated to one of the most exciting cloud platforms around today.  About the author Lars Butler is a software developer at Rackspace, the open cloud company. He has worked as a software developer in avionics, seismic hazard research, and most recently, on ZeroVM. He can be reached @larsbutler.
Read more
  • 0
  • 0
  • 2966

article-image-hunt-data
Packt
25 Jun 2014
10 min read
Save for later

The Hunt for Data

Packt
25 Jun 2014
10 min read
(For more resources related to this topic, see here.) Examining a JSON file with the aeson package JavaScript Object Notation (JSON) is a way to represent key-value pairs in plain text. The format is described extensively in RFC 4627 (http://www.ietf.org/rfc/rfc4627). In this recipe, we will parse a JSON description about a person. We often encounter JSON in APIs from web applications. Getting ready Install the aeson library from hackage using Cabal. Prepare an input.json file representing data about a mathematician, such as the one in the following code snippet: $ cat input.json {"name":"Gauss", "nationality":"German", "born":1777, "died":1855} We will be parsing this JSON and representing it as a usable data type in Haskell. How to do it... Use the OverloadedStrings language extension to represent strings as ByteString, as shown in the following line of code: {-# LANGUAGE OverloadedStrings #-} Import aeson as well as some helper functions as follows: import Data.Aeson import Control.Applicative import qualified Data.ByteString.Lazy as B Create the data type corresponding to the JSON structure, as shown in the following code: data Mathematician = Mathematician { name :: String , nationality :: String , born :: Int , died :: Maybe Int } Provide an instance for the parseJSON function, as shown in the following code snippet: instance FromJSON Mathematician where parseJSON (Object v) = Mathematician <$> (v .: "name") <*> (v .: "nationality") <*> (v .: "born") <*> (v .:? "died") Define and implement main as follows: main :: IO () main = do Read the input and decode the JSON, as shown in the following code snippet: input <- B.readFile "input.json" let mm = decode input :: Maybe Mathematician case mm of Nothing -> print "error parsing JSON" Just m -> (putStrLn.greet) m Now we will do something interesting with the data as follows: greet m = (show.name) m ++ " was born in the year " ++ (show.born) m We can run the code to see the following output: $ runhaskell Main.hs "Gauss" was born in the year 1777 How it works... Aeson takes care of the complications in representing JSON. It creates native usable data out of a structured text. In this recipe, we use the .: and .:? functions provided by the Data.Aeson module. As the Aeson package uses ByteStrings instead of Strings, it is very helpful to tell the compiler that characters between quotation marks should be treated as the proper data type. This is done in the first line of the code which invokes the OverloadedStrings language extension. We use the decode function provided by Aeson to transform a string into a data type. It has the type FromJSON a => B.ByteString -> Maybe a. Our Mathematician data type must implement an instance of the FromJSON typeclass to properly use this function. Fortunately, the only required function for implementing FromJSON is parseJSON. The syntax used in this recipe for implementing parseJSON is a little strange, but this is because we're leveraging applicative functions and lenses, which are more advanced Haskell topics. The .: function has two arguments, Object and Text, and returns a Parser a data type. As per the documentation, it retrieves the value associated with the given key of an object. This function is used if the key and the value exist in the JSON document. The :? function also retrieves the associated value from the given key of an object, but the existence of the key and value are not mandatory. So, we use .:? for optional key value pairs in a JSON document. There's more… If the implementation of the FromJSON typeclass is too involved, we can easily let GHC automatically fill it out using the DeriveGeneric language extension. The following is a simpler rewrite of the code: {-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE DeriveGeneric #-} import Data.Aeson import qualified Data.ByteString.Lazy as B import GHC.Generics data Mathematician = Mathematician { name :: String , nationality :: String , born :: Int , died :: Maybe Int } deriving Generic instance FromJSON Mathematician main = do input <- B.readFile "input.json" let mm = decode input :: Maybe Mathematician case mm of Nothing -> print "error parsing JSON" Just m -> (putStrLn.greet) m greet m = (show.name) m ++" was born in the year "++ (show.born) m Although Aeson is powerful and generalizable, it may be an overkill for some simple JSON interactions. Alternatively, if we wish to use a very minimal JSON parser and printer, we can use Yocto, which can be downloaded from http://hackage.haskell.org/package/yocto. Reading an XML file using the HXT package Extensible Markup Language (XML) is an encoding of plain text to provide machine-readable annotations on a document. The standard is specified by W3C (http://www.w3.org/TR/2008/REC-xml-20081126/). In this recipe, we will parse an XML document representing an e-mail conversation and extract all the dates. Getting ready We will first set up an XML file called input.xml with the following values, representing an e-mail thread between Databender and Princess on December 18, 2014 as follows: $ cat input.xml <thread> <email> <to>Databender</to> <from>Princess</from> <date>Thu Dec 18 15:03:23 EST 2014</date> <subject>Joke</subject> <body>Why did you divide sin by tan?</body> </email> <email> <to>Princess</to> <from>Databender</from> <date>Fri Dec 19 3:12:00 EST 2014</date> <subject>RE: Joke</subject> <body>Just cos.</body> </email> </thread> Using Cabal, install the HXT library which we use for manipulating XML documents: $ cabal install hxt How to do it... We only need one import, which will be for parsing XML, using the following line of code: import Text.XML.HXT.Core Define and implement main and specify the XML location. For this recipe, the file is retrieved from input.xml. Refer to the following code: main :: IO () main = do input <- readFile "input.xml" Apply the readString function to the input and extract all the date documents. We filter items with a specific name using the hasName :: String -> a XmlTree XmlTree function. Also, we extract the text using the getText :: a XmlTree String function, as shown in the following code snippet: dates <- runX $ readString [withValidate no] input //> hasName "date" //> getText We can now use the list of extracted dates as follows: print dates By running the code, we print the following output: $ runhaskell Main.hs ["Thu Dec 18 15:03:23 EST 2014", "Fri Dec 19 3:12:00 EST 2014"] How it works... The library function, runX, takes in an Arrow. Think of an Arrow as a more powerful version of a Monad. Arrows allow for stateful global XML processing. Specifically, the runX function in this recipe takes in IOSArrow XmlTree String and returns an IO action of the String type. We generate this IOSArrow object using the readString function, which performs a series of operations to the XML data. For a deep insight into the XML document, //> should be used whereas /> only looks at the current level. We use the //> function to look up the date attributes and display all the associated text. As defined in the documentation, the hasName function tests whether a node has a specific name, and the getText function selects the text of a text node. Some other functions include the following: isText: This is used to test for text nodes isAttr: This is used to test for an attribute tree hasAttr: This is used to test whether an element node has an attribute node with a specific name getElemName: This is used to select the name of an element node All the Arrow functions can be found on the Text.XML.HXT.Arrow.XmlArrow documentation at http://hackage.haskell.org/package/hxt/docs/Text-XML-HXT-Arrow-XmlArrow.html. Capturing table rows from an HTML page Mining Hypertext Markup Language (HTML) is often a feat of identifying and parsing only its structured segments. Not all text in an HTML file may be useful, so we find ourselves only focusing on a specific subset. For instance, HTML tables and lists provide a strong and commonly used structure to extract data whereas a paragraph in an article may be too unstructured and complicated to process. In this recipe, we will find a table on a web page and gather all rows to be used in the program. Getting ready We will be extracting the values from an HTML table, so start by creating an input.html file containing a table as shown in the following figure: The HTML behind this table is as follows: $ cat input.html <!DOCTYPE html> <html> <body> <h1>Course Listing</h1> <table> <tr> <th>Course</th> <th>Time</th> <th>Capacity</th> </tr> <tr> <td>CS 1501</td> <td>17:00</td> <td>60</td> </tr> <tr> <td>MATH 7600</td> <td>14:00</td> <td>25</td> </tr> <tr> <td>PHIL 1000</td> <td>9:30</td> <td>120</td> </tr> </table> </body> </html> If not already installed, use Cabal to set up the HXT library and the split library, as shown in the following command lines: $ cabal install hxt $ cabal install split How to do it... We will need the htx package for XML manipulations and the chunksOf function from the split package, as presented in the following code snippet: import Text.XML.HXT.Core import Data.List.Split (chunksOf) Define and implement main to read the input.html file. main :: IO () main = do input <- readFile "input.html" Feed the HTML data into readString, thereby setting withParseHTML to yes and optionally turning off warnings. Extract all the td tags and obtain the remaining text, as shown in the following code: texts <- runX $ readString [withParseHTML yes, withWarnings no] input //> hasName "td" //> getText The data is now usable as a list of strings. It can be converted into a list of lists similar to how CSV was presented in the previous CSV recipe, as shown in the following code: let rows = chunksOf 3 texts print $ findBiggest rows By folding through the data, identify the course with the largest capacity using the following code snippet: findBiggest :: [[String]] -> [String] findBiggest [] = [] findBiggest items = foldl1 (a x -> if capacity x > capacity a then x else a) items capacity [a,b,c] = toInt c capacity _ = -1 toInt :: String -> Int toInt = read Running the code will display the class with the largest capacity as follows: $ runhaskell Main.hs {"PHIL 1000", "9:30", "120"} How it works... This is very similar to XML parsing, except we adjust the options of readString to [withParseHTML yes, withWarnings no].
Read more
  • 0
  • 0
  • 1848

article-image-exact-inference-using-graphical-models
Packt
25 Jun 2014
7 min read
Save for later

Exact Inference Using Graphical Models

Packt
25 Jun 2014
7 min read
(For more resources related to this topic, see here.) Complexity of inference A graphical model can be used to answer both probability queries and MAP queries. The most straightforward way to use this model is to generate the joint distribution and sum out all the variables, except the ones we are interested in. However, we need to determine and specify the joint distribution where an exponential blowup happens. In worst-case scenarios, we need to determine the exact inference in NP-hard. By the word exact, we mean specifying the probability values with a certain precision (say, five digits after the decimals). Suppose we tone down our precision requirements (for example, only up to two digits after the decimals). Now, is the (approximate) inference task any easier? Unfortunately not—even approximate inference is NP-hard, that is, getting values is far better than random guessing (50 percent or a probability of 0.5), which takes exponential time. It might seem like inference is a hopeless task, but that is only in the worst case. In general cases, we can use exact inference to solve certain classes of real-world problems (such as Bayesian networks that have a small number of discrete random variables). Of course, for larger problems, we have to resort to approximate inference. Real-world issues Since inference is a task that is NP-hard, inference engines are written in languages that are as close to bare metal as possible; usually in C or C++. Use Python implementations of inference algorithms. Complete and mature packages for these are uncommon. Use inference engines that have a Python interface, such as Stan (mc-stan.org). This choice serves a good balance between running the Python code and a fast inference implementation. Use inference engines that do not have a Python interface, which is true for majority of the inference engines out there. A fairly comprehensive list can be found at http://en.wikipedia.org/wiki/Bayesian_network#Software. The use of Python here is limited to creating a file that describes the model in a format that the inference engine can consume. In the article on inference, we will stick to the first two choices in the list. We will use native Python implementations (of inference algorithms) to peek into the interiors of the inference algorithms while running toy-sized problems, and then use an external inference engine with Python interfaces to try out a more real-world problem. The tree algorithm We will now look at another class of exact inference algorithms based on message passing. Message passing is a general mechanism, and there exist many variations of message passing algorithms. We shall look at a short snippet of the clique tree-message passing algorithm (which is sometimes called the junction tree algorithm too). Other versions of the message passing algorithm are used in approximate inference as well. We initiate the discussion by clarifying some of the terms used. A cluster graph is an arrangement of a network where groups of variables are placed in the cluster. It is similar to a factor where each cluster has a set of variables in its scope. The message passing algorithm is all about passing messages between clusters. As an analogy, consider the gossip going on at a party, where Shelly and Clair are in a conversation. If Shelly knows B, C, and D, and she is chatting with Clair who knows D, E, and F (note that the only person they know in common is D), they can share information (or pass messages) about their common friend D. In the message passing algorithm, two clusters are connected by a Separation Set (sepset), which contains variables common to both clusters. Using the preceding example, the two clusters and are connected by the sepset , which contains the only variable common to both clusters. In the next section, we shall learn about the implementation details of the junction tree algorithm. We will first understand the four stages of the algorithm and then use code snippets to learn about it from an implementation perspective. The four stages of the junction tree algorithm In this section, we will discuss the four stages of the junction tree algorithm. In the first stage, the Bayes network is converted into a secondary structure called a join tree (alternate names for this structure in the literature are junction tree, cluster tree, or a clique tree). The transformation from the Bayes network to junction tree proceeds as per the following steps: We will construct a moral graph by changing all the directed edges to undirected edges. All nodes that have V-structures that enter the said node have their parents connected with an edge. We have seen an example of this process (in the VE algorithm) called moralization, which is a possible reference to connect (apparently unmarried) parents that have a child (node). Then, we will selectively add edges to the moral graph to create a triangulated graph. A triangulated graph is an undirected graph where the maximum cycle length between the nodes is 3. From the triangulated graph, we will identify the subsets of nodes (called cliques). Starting with the cliques as clusters, we will arrange the clusters to form an undirected tree called the join tree, which satisfies the running intersection property. This property states that if a node appears in two cliques, it should also appear in all the nodes on the path that connect the two cliques. In the second stage, the potentials at each cluster are initialized. The potentials are similar to a CPD or a table. They have a list of values against each assignment to a variable in their scope. Both clusters and sepsets contain a set of potentials. The term potential is used as opposed to probabilities because in Markov networks, unlike probabilities, the values of the potentials are not obliged to sum to 1. This stage consists of message passing or belief propagation between neighboring clusters. Each message consists of a belief the cluster has about a particular variable. Each message can be passed asynchronously, but it has to wait for information from other clusters before it collates that information and passes it to the next cluster. It can be useful to think of a tree-structured cluster graph, where the message passing happens in two stages: an upward pass stage and a downward pass stage. Only after a node receives messages from the leaf nodes, will it send the message to its parent (in the "upward pass"), and only after the node receives a message from its parents will it send a message to its children (in the "downward pass"). The message passing stage completes when each cluster sepset has consistent beliefs. Recall that a cluster connected to a sepset has common variables. For example, cluster C and sepset S have and variables in its scope. Then, the potential against obtained from either the cluster or the sepset has the same value, which is why it is said that the cluster graph has consistent beliefs or that the cliques are calibrated. Once the whole cluster graph has consistent beliefs, the fourth stage is marginalization, where we can query the marginal distribution for any variable in the graph. Summary We first explored the inference problem where we studied the types of inference. We then learned that inference is NP-hard and understood that, for large networks, exact inference is infeasible. Resources for Article: Further resources on this subject: Getting Started with Spring Python [article] Python Testing: Installing the Robot Framework [article] Discovering Python's parallel programming tools [article]
Read more
  • 0
  • 0
  • 3946
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at ₹800/month. Cancel anytime
article-image-working-pentaho-mobile-bi
Packt
23 Jun 2014
14 min read
Save for later

Working with Pentaho Mobile BI

Packt
23 Jun 2014
14 min read
(For more resources related to this topic, see here.) We've always talked about using the Pentaho platform from a mobile device, trying to understand what there really is about it. On the Internet, there are some videos on it, but nothing can give you a clear idea of what it is and what can we do with it. We are proud to talk about it (maybe this is the first article that touches this topic), and we hope to clear any doubts regarding this platform. Pentaho Mobile is a web app available (see the previous screenshot for the web application's main screen) only with the Enterprise Edition Version of Pentaho, to let iPad users (and only the users on that device) have a wonderful experience with Pentaho on their mobile device. At the time of writing this article, no other mobile platform or devices were considered. It lets us interact with the Pentaho system more or less in the same way as we do with Pentaho User Console. These examples show what we can do with Pentaho Mobile and what we cannot do in a clear and detailed way to help understand if accessing Pentaho from a mobile platform could be helpful for our users. Only for this article, because we are on a mobile device, we will talk about touching (touch) instead of clicking as the action that activates something in the application. With this term, touch, we refer to the user's finger gesture instead of the normal mouse click. Different environments have different ways to interact! The examples in this article are based on the assumption that you have iPad device available to try each example and that you are able to successfully log in to Pentaho Mobile. In case we want to use demo users, remember that we can use the following logins to access our system: admin/password: This is the new Pentaho demo administrator after the famous user, joe (the Pentaho recognized administrator until Pentaho 4.8), was dismissed in this new version. suzy/password: This is another simple user we can use to access the system. Because suzy is not a member of the administrator role, it is useful to see the difference in case a user who is not an administrator tries to use the system. Accessing BA server from a mobile device Accessing Pentaho Mobile is as easy as accessing it from a Pentaho User Console. Just open our iPad browser (either Safari or Chrome) and point your browser to the Pentaho server. This example shows the basics of accessing and logging in to Pentaho from an iPad device through Pentaho Mobile. Remember that this example makes use of Pentaho Mobile, a web app that is available only for iPad and only in the EE Version of Pentaho. Getting ready To get ready for this example, the only thing we need is an iPad to connect to our Pentaho system. How to do it… The following steps detail how simply we access our Pentaho Mobile application: To connect to Pentaho Mobile, open either Safari or Chrome on the iPad device. As soon as the browser window is ready, type the complete URL to the Pentaho server in the following format: http://<ip_address>:<port>/pentaho Pentaho immediately detects that we are connecting from an iPad device, and the Pentaho Mobile login screen appears. Touch the Login button; the login dialog box appears as shown in the following screenshot. Enter your login credentials and press Login. The Login dialog box closes and we will be taken to Pentaho Mobile's home page. How it works… Pentaho Mobile has a slightly different look and feel with respect to Pentaho User Console in order to facilitate a mobile user's experience. The following screenshot shows the landing page we get after we have successfully logged in to Pentaho Mobile. To the left-hand side of the Pentaho Mobile's home page, we have the following set of buttons: Browse Files: This lets us start our navigation in the Pentaho Solution Repository. Create New Content: This lets us start the Pentaho Analyzer to create a new Analyser report from the mobile device. Analyser report content is the only kind of content we can create from our iPad. Dashboards and interactive reports can be created only from the Pentaho User Console. Startup Screen: This lets us change what we display as the default startup screen as soon as we log in to Pentaho Mobile. Settings: This changes the configuration settings for our Pentaho Mobile application. To the right-hand side of the button list (see the previous screenshot for details), we have three list boxes that display the Recent files we opened so far, the Favorites files, and the set of Open Files. The Open Files list box is more or less the same as the Opened perspective in Pentaho User Console—it collects all of the opened content in one single place for easy access. Look at the upper-right corner of Pentaho Mobile's user interface (see the previous screenshot for details); we have two icons: The folder icon gives access, from a different path, to the Pentaho Solution's folders The gear icon opens the Settings dialog box There's more… Now, let's see which settings we can either set or change from the mobile user interface by going to the Settings options. Changing the Settings configuration in Pentaho Mobile We can easily access the Settings dialog box either by pressing the Settings button in the left-hand side area of the Pentaho Mobile's home page or by pressing the gear icon in the upper-right corner of Pentaho Mobile. The Settings dialog box allows us to easily change the following configuration items (see the following screenshot for details): We can set Startup Screen by changing the referenced landing home page for our Pentaho Mobile application. In the Recent Files section of the Settings dialog, we can set the maximum number of items allowable in the Recent Files list. The default setting's value is 10, but we can alter this value by pressing the related icon buttons. Another button situated immediately below Recent Files, lets us easily empty the Recent Files list box. The next two buttons let us clear the Favorites items' list (Clear All Favorites) and reset the settings to the default values (Reset All Settings). Finally, we have a button to take us to a help guide and the application's Logout button. See also Look at the Accessing folders and files section to obtain details about how to browse the Pentaho Solution and start new content In the Changing the default startup Screen section, we will find details about how to change the default Pentaho Mobile session startup screen Accessing folders and files From our Pentaho Mobile, we can easily access and navigate the Pentaho Solution folders. This example will show how we can navigate the Pentaho Solution folders and start our content on the mobile device. Remember that this example makes use of Pentaho Mobile, a web app available only for iPad and only in the EE Version of Pentaho. How to do it… The following steps detail how simply we can access the Pentaho Solution folders and start an existing BI content: From the Pentaho Mobile home page, either touch on the Browse Files button located on the left-hand side of page, or touch on the Folder icon button located in the upper-right side of the home page. The Browse Files dialog opens to the right of the Pentaho Mobile user interface as shown in the following screenshot. Navigate the solution to the folder containing the content we want to start. As soon as we get to the content to start, touch on the content's icon to launch it. The content will be displayed in the entire Pentaho Mobile user interface screen. How it works… Accessing Pentaho objects from the Pentaho Mobile application is really intuitive. After you have successfully logged in, open the Browse Files dialog and navigate freely through the Pentaho Solution folder's structure to get to your content. To start the content, just touch the content icon and the report or the dashboard will display on your iPad. As we can see, at the time of writing this article, we cannot do any administrative tasks (share content, move content, schedule, and other tasks) from the Pentaho Mobile application. We can only navigate to the content, get it, and start it. There's more… As soon as we have some content items open, they are shown in the Opened list box. However, we would like to close them and free unused memory resources. Let's see how to do this. Closing opened content Pentaho Mobile continuously monitors the resource usage of our iPad and warns as soon as we have a lot of items open. As soon as we have a lot of opened items, a warning dialog box informs you about this, and it is a good opportunity to close some unused (and eventually forget the opened) content items. To do this, go to Pentaho Mobile's home page, look for items to close, and touch on the rounded x icon to the right of the content item's label (see the following screenshot for details). The content item will immediately close. Adding files to favorites As we saw in Pentaho User Console, even in the Pentaho Mobile application, we can set our favorites and start accessing content from the favorites list. This article will show how we can do this. Remember that this article makes use of Pentaho Mobile, a web app available only for iPad and only in the EE Version of Pentaho How to do it… The following steps detail how simply we can make a content item a favorite: From the Pentaho Mobile's home page, either touch on the Browse Files button located on the left-hand side of the page or touch on the Folder icon button located in the upper-right side of the home page. The Browse Files dialog opens to the right of the Pentaho Mobile user interface. Navigate the solution to the folder containing the content we want as a favorite. Touch on the star located to the right-hand side of the content item's label to mark that item a favorite. How it works… Usually, it would be useful to define some Pentaho objects as favorites. Favorite items help the user to quickly find the report or dashboard to start with. After we have successfully logged in, open the Browse Files dialog and navigate freely through the Pentaho Solution folders' structure to get to your content. To mark the content a favorite, just touch the star in the right-hand side of the content label and our report or dashboard will be marked as favorite (see the following screenshot for details). The favorite status of an item is identified by the following elements: The content item's star located to the right-hand side of the item's label becomes bold on the boundary to put in evidence that the content has been marked as a favorite The content will appear in the Favorites list box on the Pentaho Mobile home page There's more… What should we do if we want to remove the favorite status from our content items? Let's see how we can do this. Removing an item from the Favorites items list To remove an item from the Favorites list, we can follow two different approaches: Go to the Favorites items list on the Pentaho Mobile home page. Look for the item we want to un-favorite and touch on the star icon with the bold boundaries located on the right-hand side of the content item's label. The content item will be immediately removed from the Favorites items list. Navigate to the Pentaho Solution's folders to the location containing the item we want to un-favorite and touch on the star icon with the bold boundaries located to the right-hand side of the content item's label. The content item will be immediately removed from the Favorites items list. See also Take a look at the Accessing folders and files section in case you want to review how to access content in the Pentaho Solution to mark it as a favorite. Changing the default startup screen Imagine that we want to change the default startup screen with a specific content item we have in our Pentaho Solution. After the new startup screen has been set, after the login, the user will be able to immediately access this new content item opened as the startup screen for Pentaho Mobile instead of the default home page. It would be fine to let our CEO immediately have in front of them the company's main KPI dashboard and immediately react to it. This article will show you how to make a specific content item the default startup screen in Pentaho Mobile. Remember that this example makes use of Pentaho Mobile, a web app available only for iPad and only in the EE Version of Pentaho. How to do it… The following steps detail how simply we can define a new startup screen with an existing BI content: From the Pentaho Mobile home page, touch on the Startup Screen button located on the left-hand side of the home page. The Browse Files dialog opens to the right of the Pentaho Mobile user interface. Navigate the solution to the folder containing the content we want to use. Touch the content item we want to show as the default startup screen. The Browse Files dialog box immediately closes and the Settings dialog box opens. A reference to the new, selected item is shown as the default Startup Screen content item (see the following screenshot for details): Touch outside the Settings dialog to close this dialog. How it works… Changing the startup screen could be interesting to give your user access to important content any time immediately after a successful login. From the Pentaho Mobile's home page, touch on the Startup Screen button located on the left-hand side of the home page and open the Browse Files dialog. Navigate the solution to the folder containing the content we want and then touch the content item to show as the default startup screen. The Browse Files dialog box immediately closes and the Settings dialog box opens. The new selected item is shown as the default startup screen content item, referenced by Name, and the complete path to the Pentaho Solution folder is seen. We can change the startup screen at any time, and we can also reset it to the default Pentaho Mobile home page by touching on the Pentaho Default Home radio button. There's more… We have always showed pictures from Pentaho Mobile in landscape orientation, but the user interface has a responsive behavior, showing things organized differently depending on the orientation of the device. Pentaho Mobile's responsive behavior We always show pictures of Pentaho Mobile with a landscape orientation, but Pentaho Mobile has a responsive layout and changes the display of some of the items in the page we are looking at depending on the device's orientation. The following screenshot gives an idea about displaying a dashboard on Pentaho Mobile in portrait orientation: If we look at the home page with a device in the portrait mode, the Recent, Favorites, and Opened lists covers the available page's width, equally divided by each list; and all of the buttons we always saw on the left side of the user interface are now relocated to the bottom, below the three lists we talked about so far. This is another interesting layout; it is up to our taste or viewing needs to decide which of the two could be the best option for us. Summary In this article, we learned about accessing BA server from a mobile device, accessing files and folders, adding files to favorites, and changing the default startup screen from a mobile device. Resources for Article: Further resources on this subject: Getting Started with Pentaho Data Integration [article] Integrating Kettle and the Pentaho Suite [article] Installing Pentaho Data Integration with MySQL [article]
Read more
  • 0
  • 0
  • 3295

article-image-what-quantitative-finance
Packt
20 Jun 2014
11 min read
Save for later

What is Quantitative Finance?

Packt
20 Jun 2014
11 min read
(For more resources related to this topic, see here.) Discipline 1 – finance (financial derivatives) In general, a financial derivative is a contract between two parties who agree to exchange one or more cash flows in the future. The value of these cash flows depends on some future event, for example, that the value of some stock index or interest rate being above or below some predefined level. The activation or triggering of this future event thus depends on the behavior of a variable quantity known as the underlying. Financial derivatives receive their name because they derive their value from the behavior of another financial instrument. As such, financial derivatives do not have an intrinsic value in themselves (in contrast to bonds or stocks); their price depends entirely on the underlying. A critical feature of derivative contracts is thus that their future cash flows are probabilistic and not deterministic. The future cash flows in a derivative contract are contingent on some future event. That is why derivatives are also known as contingent claims. This feature makes these types of contracts difficult to price. The following are the most common types of financial derivatives: Futures Forwards Options Swaps Futures and forwards are financial contracts between two parties. One party agrees to buy the underlying from the other party at some predetermined date (the maturity date) for some predetermined price (the delivery price). An example could be a one-month forward contract on one ounce of silver. The underlying is the price of one ounce of silver. No exchange of cash flows occur at inception (today, t=0), but it occurs only at maturity (t=T). Here t represents the variable time. Forwards are contracts negotiated privately between two parties (in other words, Over The Counter (OTC)), while futures are negotiated at an exchange. Options are financial contracts between two parties. One party (called the holder of the option) pays a premium to the other party (called the writer of the option) in order to have the right, but not the obligation, to buy some particular asset (the underlying) for some particular price (the strike price) at some particular date in the future (the maturity date). This type of contract is called a European Call contract. Example 1 Consider a one-month call contract on the S&P 500 index. The underlying in this case will be the value of the S&P 500 index. There are cash flows both at inception (today, t=0) and at maturity (t=T). At inception, (t=0) the premium is paid, while at maturity (t=T), the holder of the option will choose between the following two possible scenarios, depending on the value of the underlying at maturity S(T): Scenario A: To exercise his/her right and buy the underlying asset for K Scenario B: To do nothing if the value of the underlying at maturity is below the value of the strike, that is, S(T)<K The option holder will choose Scenario A if the value of the underlying at maturity is above the value of the strike, that is, S(T)>K. This will guarantee him/her a profit of S(T)-K. The option holder will choose Scenario B if the value of the underlying at maturity is below the value of the strike, that is, S(T)<K. This will guarantee him/her to limit his/her losses to zero. Example 2 An Interest Rate Swap (IRS) is a financial contract between two parties A and B who agree to exchange cash flows at regular intervals during a given period of time (the life of a contract). Typically, the cash flows from A to B are indexed to a fixed rate of interest, while the cash flows from B to A are indexed to a floating interest rate. The set of fixed cash flows is known as the fixed leg, while the set of floating cash flows is known as the floating leg. The cash flows occur at regular intervals during the life of the contract between inception (t=0) and maturity (t=T). An example could be a fixed-for-floating IRS, who pays a rate of 5 percent on the agreed notional N every three months and receives EURIBOR3M on the agreed notional N every three months. Example 3 A futures contract on a stock index also involves a single future cash flow (the delivery price) to be paid at the maturity of the contract. However, the payoff in this case is uncertain because how much profit I will get from this operation will depend on the value of the underlying at maturity. If the price of the underlying is above the delivery price, then the payoff I get (denoted by function H) is positive (indicating a profit) and corresponds to the difference between the value of the underlying at maturity S(T) and the delivery price K. If the price of the underlying is below the delivery price, then the payoff I get is negative (indicating a loss) and corresponds to the difference between the delivery price K and the value of the underlying at maturity S(T). This characteristic can be summarized in the following payoff formula: Equation 1 Here, H(S(T)) is the payoff at maturity, which is a function of S(T). Financial derivatives are very important to the modern financial markets. According to the Bank of International Settlements (BIS) as of December 2012, the amounts outstanding for OTC derivative contracts worldwide were Foreign exchange derivatives with 67,358 billion USD, Interest Rate Derivatives with 489,703 billion USD, Equity-linked derivatives with 6,251 billion USD, Commodity derivatives with 2,587 billion USD, and Credit default swaps with 25,069 billion USD. For more information, see http://www.bis.org/statistics/dt1920a.pdf. Discipline 2 – mathematics We need mathematical models to capture both the future evolution of the underlying and the probabilistic nature of the contingent cash flows we encounter in financial derivatives. Regarding the contingent cash flows, these can be represented in terms of the payoff function H(S(T)) for the specific derivative we are considering. Because S(T) is a stochastic variable, the value of H(S(T)) ought to be computed as an expectation E[H(S(T))]. And in order to compute this expectation, we need techniques that allow us to predict or simulate the behavior of the underlying S(T) into the future, so as to be able to compute the value of ST and finally be able to compute the mean value of the payoff E[H(S(T))]. Regarding the behavior of the underlying, typically, this is formalized using Stochastic Differential Equations (SDEs), such as Geometric Brownian Motion (GBM), as follows: Equation 2 The previous equation fundamentally says that the change in a stock price (dS), can be understood as the sum of two effects—a deterministic effect (first term on the right-hand side) and a stochastic term (second term on the right-hand side). The parameter is called the drift, and the parameter is called the volatility. S is the stock price, dt is a small time interval, and dW is an increment in the Wiener process. This model is the most common model to describe the behavior of stocks, commodities, and foreign exchange. Other models exist, such as jump, local volatility, and stochastic volatility models that enhance the description of the dynamics of the underlying. Regarding the numerical methods, these correspond to ways in which the formal expression described in the mathematical model (usually in continuous time) is transformed into an approximate representation that can be used for calculation (usually in discrete time). This means that the SDE that describes the evolution of the price of some stock index into the future, such as the FTSE 100, is changed to describe the evolution at discrete intervals. An approximate representation of an SDE can be calculated using the Euler approximation as follows: Equation 3 The preceding equation needs to be solved in an iterative way for each time interval between now and the maturity of the contract. If these time intervals are days and the contract has a maturity of 30 days from now, then we compute tomorrow's price in terms of todays. Then we compute the day after tomorrow as a function of tomorrow's price and so on. In order to price the derivative, we require to compute the expected payoff E[H(ST)] at maturity and then discount it to the present. In this way, we would be able to compute what should be the fair premium associated with a European option contract with the help of the following equation: Equation 4 Discipline 3 – informatics (C++ programming) What is the role of C++ in pricing derivatives? Its role is fundamental. It allows us to implement the actual calculations that are required in order to solve the pricing problem. Using the preceding techniques to describe the dynamics of the underlying, we require to simulate many potential future scenarios describing its evolution. Say we ought to price a futures contract on the EUR/USD exchange rate with one year maturity. We have to simulate the future evolution of EUR/USD for each day for the next year (using equation 3). We can then compute the payoff at maturity (using equation 1). However, in order to compute the expected payoff (using equation 4), we need to simulate thousands of such possible evolutions via a technique known as Monte Carlo simulation. The set of steps required to complete this process is known as an algorithm. To price a derivative, we ought to construct such algorithm and then implement it in an advanced programming language such as C++. Of course C++ is not the only possible choice, other languages include Java, VBA, C#, Mathworks Matlab, and Wolfram Mathematica. However, C++ is an industry standard because it's flexible, fast, and portable. Also, through the years, several numerical libraries have been created to conduct complex numerical calculations in C++. Finally, C++ is a powerful modern object-oriented language. It is always difficult to strike a balance between clarity and efficiency. We have aimed at making computer programs that are self-contained (not too object oriented) and self-explanatory. More advanced implementations are certainly possible, particularly in the context of larger financial pricing libraries in a corporate context. In this article, all the programs are implemented with the newest standard C++11 using Code::Blocks (http://www.codeblocks.org) and MinGW (http://www.mingw.org). The Bento Box template A Bento Box is a single portion take-away meal common in Japanese cuisine. Usually, it has a rectangular form that is internally divided in compartments to accommodate the various types of portions that constitute a meal. In this article, we use the metaphor of the Bento Box to describe a visual template to facilitate, organize, and structure the solution of derivative problems. The Bento Box template is simply a form that we will fill sequentially with the different elements that we require to price derivatives in a logical structured manner. The Bento Box template when used to price a particular derivative is divided into four areas or boxes, each containing information critical for the solution of the problem. The following figure illustrates a generic template applicable to all derivatives: The Bento Box template – general case The following figure shows an example of the Bento Box template as applied to a simple European Call option: The Bento Box template – European Call option In the preceding figure, we have filled the various compartments, starting in the top-left box and proceeding clockwise. Each compartment contains the details about our specific problem, taking us in sequence from the conceptual (box 1: derivative contract) to the practical (box 4: algorithm), passing through the quantitative aspects required for the solution (box 2: mathematical model and box 3: numerical method). Summary This article gave an overview of the main elements of Quantitative Finance as applied to pricing financial derivatives. The Bento Box template technique will be used to organize our approach to solve problems in pricing financial derivatives. We will assume that we are in possession with enough information to fill box 1 (derivative contract). Resources for Article: Further resources on this subject: Application Development in Visual C++ - The Tetris Application [article] Getting Started with Code::Blocks [article] Creating and Utilizing Custom Entities [article]
Read more
  • 0
  • 0
  • 3915

article-image-machine-learning-bioinformatics
Packt
20 Jun 2014
8 min read
Save for later

Machine Learning in Bioinformatics

Packt
20 Jun 2014
8 min read
(For more resources related to this topic, see here.) Supervised learning for classification Like clustering, classification is also about categorizing data instances, but in this case, the categories are known and are termed as class labels. Thus, it aims at identifying the category that a new data point belongs to. It uses a dataset where the class labels are known to find the pattern. Classification is an instance of supervised learning where the learning algorithm takes a known set of input data and corresponding responses called class labels and builds a predictor model that generates reasonable predictions for the class labels in the unknown data. To illustrate, let's imagine that we have gene expression data from cancer patients as well as healthy patients. The gene expression pattern in these samples can define whether the patient has cancer or not. In this case, if we have a set of samples for which we know the type of tumor, the data can be used to learn a model that can identify the type of tumor. In simple terms, it is a predictive function used to determine the tumor type. Later, this model can be applied to predict the type of tumor in unknown cases. There are some do's and don'ts to keep in mind while learning a classifier. You need to make sure that you have enough data to learn the model. Learning with smaller datasets will not allow the model to learn the pattern in an unbiased manner and again, you will end up with an inaccurate classification. Furthermore, the preprocessing steps (such as normalization) for the training and test data should be the same. Another important thing that one should take care of is to keep the training and test data distinct. Learning on the entire data and then using a part of this data for testing will lead to a phenomenon called over fitting. It is always recommended that you take a look at it manually and understand the question that you need to answer via your classifier. There are several methods of classification. In this recipe, we will talk about some of these methods. We will discuss linear discriminant analysis (LDA), decision tree (DT), and support vector machine (SVM). Getting ready To perform the classification task, we need two preparations. First, a dataset with known class labels (training set), and second, the test data that the classifier has to be tested on (test set). Besides this, we will use some R packages, which will be discussed when required. As a dataset, we will use approximately 2300 gene from tumor cells. The data has ~83 data points with four different types of tumors. These will be used as our class labels. We will use 60 of the data points for the training and the remaining 23 for the test. To find out more about the dataset, refer to the Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks article by Khan and others (http://research.nhgri.nih.gov/microarray/Supplement/). The set has been precompiled in a format that is readily usable in R and is available on the book's web page (code files) under the name cancer.rda. How to do it… To classify data points based on their features, perform the following steps: First, load the following MASS library as it has some of the classification functions: > library(MASS) Now, you need your data to learn and test the classifiers. Load the data from the code files available on the book's web page (cancer.rda) as follows: > load ("path/to/code/directory/cancer.rda") # located in the code file directory for the chapter, assign the path accordingly Randomly sample 60 data points for the training and the remaining 23 for the test set as follows—ensure that these two datasets do not overlap and are not biased towards any specific tumor type (random sampling): > train <- mldata[train_row,] # use sampled indexes to extract training data > test <- mldata[-train_row,] # test set is select by selecting all the other data points For the training data, retain the class labels, which are the tumor columns here, and remove this information from the test data. However, store this information for comparison purposes: > testClass <- test$tumor > test$tumor <- NULL Now, try the linear discriminate analysis classifier, as follows, to get the classifier model: > myLD <- lda(tumor ~ ., train) # might issue a warning Test this classifier to predict the labels on your test set, as follows: > testRes_lda <- predict(myLD, test) To check the number of correct and incorrect predictions, simply compare the predicted classes with the testClass object, which was created in step 4, as follows: > sum(testRes_lda$class == testClass) # correct prediction [1] 19 > sum(testRes_lda$class != testClass) # incorrect prediction [1] 4 Now, try another simple classifier called DT. For this, you need the rpart package: > library(rpart) Create the decision tree based on your training data, as follows: > myDT <- rpart(tumor~ ., data = train, control = rpart.control(minsplit = 10)) Plot your tree by typing the following commands, as shown in the next diagram: > plot(myDT) > text(myDT, use.n=T) The following screenshot shows the cut off for each feature (represented by the branches) to differentiate between the classes: The tree for DT-based learning Now, test the decision tree classifier on your test data using the following prediction function: > testRes_dt <- predict(myDT, newdata= test) Take a look at the species that each data instance is put in by the predicted classifier, as follows (1 if predicted in the class, else 0): > classes <- round(testRes_dt) > head(classes) BL EW NB RM 4 0 0 0 1 10 0 0 0 1 15 1 0 0 0 16 0 0 1 0 18 0 1 0 0 21 0 1 0 0 Finally, you'll work with SVMs. To be able to use them, you need another R package named e1071 as follows: > library(e1071) Create the svm classifier from the training data as follows: > mySVM <- svm(tumor ~ ., data = train) Then, use your classifier, the model (mySVM object) learned to predict for the test data. You will see the predicted labels for each instance as follows: > testRes_svm <- predict(mySVM, test) > testRes_svm How it works… We started our recipe by loading the input data on tumors. The supervised learning methods we saw in the recipe used two datasets: the training set and test set. The training set carries the class label information. The first part in most of the learning methods shown here, the training set is used to identify a pattern and model the pattern to find a distinction between the classes. This model is then applied on the test set that does not have the class label data to predict the class labels. To identify the training and test sets, we first randomly sample 60 indexes out of the entire data and use the remaining 23 for testing purposes. The supervised learning methods explained in this recipe follow a different principle. LDA attempts to model the difference between classes based on the linear combination of its features. This combination function forms the model based on the training set and is used to predict the classes in the test set. The LDA model trained on 60 samples is then used to predict for the remaining 23 cases. DT is, however, a different method. It forms regression trees that form a set of rules to distinguish one class from the other. The tree learned on a training set is applied to predict classes in test sets or other similar datasets. SVM is a relatively complex technique of classification. It aims to create a hyperplane(s) in the feature space, making the data points separable along these planes. This is done on a training set and is then used to assign classes to new data points. In general, LDA uses linear combination and SVM uses multiple dimensions as the hyperplane for data distinction. In this recipe, we used the svm functionality from the e1071 package, which has many other utilities for learning. We can compare the results obtained by the models we used in this recipe (they can be computed using the provided code on the book's web page). There's more... One of the most popular classifier tools in the machine learning community is WEKA. It is a Java-based tool and implements many libraries to perform classification tasks using DT, LDA, Random Forest, and so on. R supports an interface to the WEKA with a library named RWeka. It is available on the CRAN repository at http://cran.r-project.org/web/packages/RWeka/ . It uses RWekajars, a separate package, to use the Java libraries in it that implement different classifiers. See also The Elements of Statistical Learning book by Hastie, Tibshirani, and Friedman at http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf, which provides more information on LDA, DT, and SVM
Read more
  • 0
  • 0
  • 2642

article-image-signal-processing-techniques
Packt
12 Jun 2014
6 min read
Save for later

Signal Processing Techniques

Packt
12 Jun 2014
6 min read
(For more resources related to this topic, see here.) Introducing the Sunspot data Sunspots are dark spots visible on the Sun's surface. This phenomenon has been studied for many centuries by astronomers. Evidence has been found for periodic sunspot cycles. We can download up-to-date annual sunspot data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. This is provided by the Belgian Solar Influences Data Analysis Center. The data goes back to 1700 and contains more than 300 annual averages. In order to determine sunspot cycles, scientists successfully used the Hilbert-Huang transform (refer to http://en.wikipedia.org/wiki/Hilbert%E2%80%93Huang_transform). A major part of this transform is the so-called Empirical Mode Decomposition (EMD) method. The entire algorithm contains many iterative steps, and we will cover only some of them here. EMD reduces data to a group of Intrinsic Mode Functions (IMF). You can compare this to the way Fast Fourier Transform decomposes a signal in a superposition of sine and cosine terms. Extracting IMFs is done via a sifting process. The sifting of a signal is related to separating out components of a signal one at a time. The first step of this process is identifying local extrema. We will perform the first step and plot the data with the extrema we found. Let's download the data in CSV format. We also need to reverse the array to have it in the correct chronological order. The following code snippet finds the indices of the local minima and maxima respectively: mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] Now we need to concatenate these arrays and use the indices to select the corresponding values. The following code accomplishes that and also plots the data: import numpy as np import sys import matplotlib.pyplot as plt from scipy import signal data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True,skiprows=1) #reverse order data = data[::-1] mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] extrema = np.concatenate((mins, maxs)) year_range = np.arange(1700, 1700 + len(data)) plt.plot(1700 + extrema, data[extrema], 'go') plt.plot(year_range, data) plt.show() We will see the following chart: In this plot, you can see the extrema is indicated with dots. Sifting continued The next steps in the sifting process require us to interpolate with a cubic spline of the minima and maxima. This creates an upper envelope and a lower envelope, which should surround the data. The mean of the envelopes is needed for the next iteration of the EMD process. We can interpolate minima with the following code snippet: spl_min = interpolate.interp1d(mins, data[mins], kind='cubic') min_rng = np.arange(mins.min(), mins.max()) l_env = spl_min(min_rng) Similar code can be used to interpolate the maxima. We need to be aware that the interpolation results are only valid within the range over which we are interpolating. This range is defined by the first occurrence of a minima/maxima and ends at the last occurrence of a minima/maxima. Unfortunately, the interpolation ranges we can define in this way for the maxima and minima do not match perfectly. So, for the purpose of plotting, we need to extract a shorter range that lies within both the maxima and minima interpolation ranges. Have a look at the following code: import numpy as np import sys import matplotlib.pyplot as plt from scipy import signal from scipy import interpolate data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True,skiprows=1) #reverse order data = data[::-1] mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] extrema = np.concatenate((mins, maxs)) year_range = np.arange(1700, 1700 + len(data)) spl_min = interpolate.interp1d(mins, data[mins], kind='cubic') min_rng = np.arange(mins.min(), mins.max()) l_env = spl_min(min_rng) spl_max = interpolate.interp1d(maxs, data[maxs], kind='cubic') max_rng = np.arange(maxs.min(), maxs.max()) u_env = spl_max(max_rng) inclusive_rng = np.arange(max(min_rng[0], max_rng[0]), min(min_rng[-1],max_rng[-1])) mid = (spl_max(inclusive_rng) + spl_min(inclusive_rng))/2 plt.plot(year_range, data) plt.plot(1700 + min_rng, l_env, '-x') plt.plot(1700 + max_rng, u_env, '-x') plt.plot(1700 + inclusive_rng, mid, '--') plt.show() The code produces the following chart: What you see is the observed data, with computed envelopes and mid line. Obviously, negative values don't make any sense in this context. However, for the algorithm we only need to care about the mid line of the upper and lower envelopes. In these first two sections, we basically performed the first iteration of the EMD process. The algorithm is a bit more involved, so we will leave it up to you whether or not you want to continue with this analysis on your own. Moving averages Moving averages are tools commonly used to analyze time-series data. A moving average defines a window of previously seen data that is averaged each time the window slides forward one period. The different types of moving average differ essentially in the weights used for averaging. The exponential moving average, for instance, has exponentially decreasing weights with time. This means that older values have less influence than newer values, which is sometimes desirable. We can express an equal-weight strategy for the simple moving average as follows in the NumPy code: weights = np.exp(np.linspace(-1., 0., N)) weights /= weights.sum() A simple moving average uses equal weights which, in code, looks as follows: def sma(arr, n): weights = np.ones(n) / n return np.convolve(weights, arr)[n-1:-n+1] The following code plots the simple moving average for the 11- and 22-year sunspot cycle: import numpy as np import sys import matplotlib.pyplot as plt data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True, skiprows=1) #reverse order data = data[::-1] year_range = np.arange(1700, 1700 + len(data)) def sma(arr, n): weights = np.ones(n) / n return np.convolve(weights, arr)[n-1:-n+1] sma11 = sma(data, 11) sma22 = sma(data, 22) plt.plot(year_range, data, label='Data') plt.plot(year_range[10:], sma11, '-x', label='SMA 11') plt.plot(year_range[21:], sma22, '--', label='SMA 22') plt.legend() plt.show() In the following plot, we see the original data and the simple moving averages for 11- and 22-year periods. As you can see, moving averages are not a good fit for this data; this is generally the case for sinusoidal data. Summary This article gave us examples of signal processing and time series analysis. We looked at shifting continued that performs the first iteration of the EMD process. We also learned about Moving averages, which are tools commonly used to analyze time-series data. Resources for Article: Further resources on this subject: Advanced Indexing and Array Concepts [Article] Fast Array Operations with NumPy [Article] Move Further with NumPy Modules [Article]
Read more
  • 0
  • 0
  • 4085
article-image-interacting-data-dashboards
Packt
23 May 2014
11 min read
Save for later

Interacting with Data for Dashboards

Packt
23 May 2014
11 min read
(For more resources related to this topic, see here.) Hierarchies for revealing the dashboard message It can become difficult to manage data, particularly if you have many columns. It can become more difficult if they are similarly named too. As you'd expect, Tableau helps you to organize your data so that it is easier to navigate and keep track of everything. From the user perspective, hierarchies improve navigation and use by allowing the users to navigate from a headline down to a detailed level. From the Tableau perspective, hierarchies are groups of columns that are arranged in increasing levels of granularity. Each deeper level of the hierarchy refers to more specific details of the data. Some hierarchies are natural hierarchies, such as date. So, say Tableau works out that a column is a date and automatically adds in a hierarchy in this order: year, quarter, month, week, and date. You have seen this already, for example, when you dragged a date across to the Columns shelf, Tableau automatically turned the date into a year. Some hierarchies are not always immediately visible. These hierarchies would need to be set up, and we will look at setting up a product hierarchy that straddles across different tables. This is a nice feature because it means that the hierarchy can reflect the users' understanding of the data and isn't determined only by the underlying data. Getting ready In this article, we will use the existing workbook that you created for this article. We will use the same data. For this article, let's take a copy of the existing worksheet and call it Hierarchies. To do this, right-click on the Worksheet tab and select the Duplicate Sheet option. You can then rename the sheet to Hierarchies. How to do it... Navigate to the DimProductCategory dimension and right-click on the EnglishProductCategoryName attribute. From the pop-up menu, select the Create Hierarchy feature. You can see its location in the following illustration: When you select the option, you will get a textbox entitled Create Hierarchy, which will ask you to specify the name of the hierarchy. We will call our hierarchy Product Category. Once you have entered this into the textbox, click on OK. Your hierarchy will now be created, and it will appear at the bottom of the Dimensions list on the left-hand side of Tableau's interface. Next, go to the DimProductSubcategory dimension and look for the EnglishProductSubCategoryName attribute. Drag it to the Product Category hierarchy under EnglishProductCategoryName, which is already part of the Product Category hierarchy. Now we will add the EnglishProductName attribute, which we will find under the DimProduct dimension. Drag-and-drop it under the EnglishProductSubCategoryName attribute that is already under the Product Category hierarchy. The Product Category hierarchy should now look as follows: The Product Category hierarchy will be easier to understand if we rename the attributes. To do this, right-click on each attribute and choose Rename. Change EnglishProductCategoryName to Product Category. Rename EnglishProductSubcategoryName to Product Subcategory by right-clicking on the attribute and selecting Rename. Rename EnglishProductName to Product. Once you have done this, the hierarchy should look as follows: You can now use your hierarchy to change the details that you wish to see in the data visualization. Now, we will use Product Category of our data visualization rather than Dimension. Remove everything from the Rows shelf and drag the Product Category hierarchy to the Rows shelf. Then, click on the plus sign; it will open the hierarchy, and you will see data for the next level under Product Category, which are subcategories. An example of the Tableau workbook is given in the following illustration. You can see that the biggest differences occurred in the Bikes product category, and they occurred in the years 2006 and 2007 for the Mountain Bikes and Road Bikes categories. To summarize, we have used the Hierarchy feature in Tableau to vary the degree of analysis we see in the dashboard. How it works… Tableau saves the additional information as part of the Tableau workbook. When you share the workbook, the hierarchies will be preserved. The Tableau workbook would need revisions if the hierarchy is changed, or if you add in new dimensions and they need to be maintained. Therefore, they may need some additional maintenance. However, they are very useful features and worth the little extra touch they offer in order to help the dashboard user. There's more... Dashboarding data usually involves providing "at a glance" information for team members to clearly see the issues in the data and to make actionable decisions. Often, we don't need to provide further information unless we are asked for it, and it is a very useful feature that will help us answer more detailed questions. It saves us space on the page and is a very useful dashboard feature. Let's take the example of a business meeting where the CEO wants to know more about the biggest differences or "swings" in the sales amount by category, and then wants more details. The Tableau analyst can quickly place a hierarchy in order to answer more detailed questions if required, and this is done quite simply as described here. Hierarchies also allow us to encapsulate business rules into the dashboard. In this article, we used product hierarchies. We could also add in hierarchies for different calendars, for example, in order to reflect different reporting periods. This will allow the dashboard to be easily reused in order to reflect different reporting calendars, say, you want to show data according to a fiscal year or a calendar year. You could have two different hierarchies: one for fiscal and the other for the calendar year. The dashboard could contain the same measures but sliced by different calendars according to user requirements. The hierarchies feature fits nicely with the Golden Mantra of Information Visualization, since it allows us to summarize the data and then drill down into it as the next step. See also http://www.tableausoftware.com/about/blog/2013/4/lets-talk-about-sets-23043 Classifying your data for dashboards Bins are a simple way of categorizing and bucketing values, depending on the measure value. So, for example, you could "bin" customers depending on their age group or the number of cars that they own. Bins are useful for dashboards because they offer a summary view of the data, which is essential for the "at a glance" function of dashboards. Tableau can create bins automatically, or we can also set up bins manually using calculated fields. This article will show both versions in order to meet the business needs. Getting ready In this article, we will use the existing workbook that you created for this article. We will use the same data. For this article, let's take a copy of the Hierarchies worksheet and by right-clicking on the Worksheet tab, select the Duplicate Sheet option. You can then rename the sheet to Bins. How to do it... Once you have your Bins worksheet in place, right-click on the SalesAmount measure and select the Create Bin option. You can see an example of this in the following screenshot: We will change the value to 5. Once you've done this, press the Load button to reveal the Min, Max, and Diff values of the data, as shown in the following screenshot: When you click on the OK button, you will see a bin appear under the Dimensions area. The following is an example of this: Let's test out our bins! To do this, remove everything from the Rows shelf, leaving only the Product Category hierarchy. Remove any filters from the worksheet and all of the calculations in the Marks shelf. Next, drag SalesAmount (bin) to the Marks area under the Detail and Tooltip buttons. Once again, take SalesAmount (bin) and drag it to the Color button on the Marks shelf. Now, we will change the size of the data points to reflect the size of the elements. To do this, drag SalesAmount (bin) to the Size button. You can vary the overall size of the elements by right-clicking on the Size button and moving the slider horizontally so that you can get your preferred size. To neaten the image, right-click on the Date column heading and select Hide Field Names for Columns from the list. The Tableau worksheet should now look as follows: This allows us to see some patterns in the data. We can also see more details if we click on the data points; you can see an illustration of the details in the data in the following screenshot: However, we might find that the automated bins are not very clear to business users. We can see in the previous screenshot that the SalesAmount(bin) value is £2,440.00. This may not be meaningful to business users. How can we set the bins so that they are meaningful to business users, rather than being automated by Tableau? For example, what if the business team wants to know about the proportion of their sales that fell into well-defined buckets, sliced by years? Fortunately, we can emulate the same behavior as in bins by simply using a calculated field. We can create a very simple IF… THEN ... ELSEIF formula that will place the sales amounts into buckets, depending on the value of the sales amount. These buckets are manually defined using a calculated field, and we will see how to do this now. Before we begin, take a copy of the existing worksheet called Bins and rename it to Bins Set Manually. To do this, right-click on the Sales Amount metric and choose the Create Calculated Field option. In the calculated field, enter the following formula: If [SalesAmount] <= 1000 THEN "1000" ELSEIF [SalesAmount] <= 2000 THEN "2000" ELSEIF [SalesAmount] <= 3000 THEN "3000" ELSEIF [SalesAmount] <= 4000 THEN "4000" ELSEIF [SalesAmount] <= 5000 THEN "5000" ELSEIF [SalesAmount] <= 6000 THEN "6000" ELSE "7000" END When this formula is entered into the Calculated Field window, it looks like what the following screenshot shows. Rename the calculated field to SalesAmount Buckets. Now that we have our calculated field in place, we can use it in our Tableau worksheet to create a dashboard component. On the Columns shelf, place the SalesAmount Buckets calculated field and the Year(Date) dimension attribute. On the Rows shelf, place Sum(SalesAmount) from the Measures section. Place the Product Category hierarchy on the Color button. Drag SalesAmount Buckets from the Dimensions pane to the Size button on the Marks shelf. Go to the Show Me panel and select the Circle View option. This will provide a dot plot feel to data visualization. You can resize the chart by hovering the mouse over the foot of the y axis where the £0.00 value is located. Once you're done with this, drag-and-drop the activities. The Tableau worksheet will look as it appears in the following screenshot: To summarize, we have created bins using Tableau's automatic bin feature. We have also looked at ways of manually creating bins using the Calculated Field feature. How it works... Bins are constructed using a default Bins feature in Tableau, and we can use Calculated Fields in order to make them more useful and complex. They are stored in the Tableau workbook, so you will be able to preserve your work if you send it to someone else. In this article, we have also looked at dot plot visualization, which is a very simple way of representing data that does not use a lot of "ink". The data/ink ratio is useful to simplify a data visualization in order to get the message of the data across very clearly. Dot plots might be considered old fashioned, but they are very effective and are perhaps underused. We can see from the screenshot that the 3000 bucket contained the highest number of sales amount. We can also see that this figure peaks in the year 2007 and then falls in 2008. This is a dashboard element that could be used as a start for further analysis. For example, business users will want to know the reason for the fall in sales for the highest occurring "bin". See also Visual Display of Quantitative Information, Edward Tufte, Graphics Press USA
Read more
  • 0
  • 0
  • 2137

Packt
22 May 2014
13 min read
Save for later

A/B Testing – Statistical Experiments for the Web

Packt
22 May 2014
13 min read
(For more resources related to this topic, see here.) Defining A/B testing At its most fundamental level, A/B testing just involves creating two different versions of a web page. Sometimes, the changes are major redesigns of the site or the user experience, but usually, the changes are as simple as changing the text on a button. Then, for a short period of time, new visitors are randomly shown one of the two versions of the page. The site tracks their behavior, and the experiment determines whether one version or the other increases the users' interaction with the site. This may mean more click-through, more purchases, or any other measurable behavior. This is similar to other methods in other domains that use different names. The basic framework randomly tests two or more groups simultaneously and is sometimes called random-controlled experiments or online-controlled experiments. It's also sometimes referred to as split testing, as the participants are split into two groups. These are all examples of between-subjects experiment design. Experiments that use these designs all split the participants into two groups. One group, the control group, gets the original environment. The other group, the test group, gets the modified environment that those conducting the experiment are interested in testing. Experiments of this sort can be single-blind or double-blind. In single-blind experiments, the subjects don't know which group they belong to. In double-blind experiments, those conducting the experiments also don't know which group the subjects they're interacting with belong to. This safeguards the experiments against biases that can be introduced by participants being aware of which group they belong to. For example, participants could get more engaged if they believe they're in the test group because this is newer in some way. Or, an experimenter could treat a subject differently in a subtle way because of the group that they belong to. As the computer is the one that directly conducts the experiment, and because those visiting your website aren't aware of which group they belong to, website A/B testing is generally an example of double-blind experiments. Of course, this is an argument for only conducting the test on new visitors. Otherwise, the user might recognize that the design has changed and throw the experiment away. For example, the users may be more likely to click on a new button when they recognize that the button is, in fact, new. However, if they are new to the site as a whole, then the button itself may not stand out enough to warrant extra attention. In some cases, these subjects can test more variant sites. This divides the test subjects into more groups. There needs to be more subjects available in order to compensate for this. Otherwise, the experiment's statistical validity might be in jeopardy. If each group doesn't have enough subjects, and therefore observations, then there is a larger error rate for the test, and results will need to be more extreme to be significant. In general, though, you'll want to have as many subjects as you reasonably can. Of course, this is always a trade-off. Getting 500 or 1000 subjects may take a while, given the typical traffic of many websites, but you still need to take action within a reasonable amount of time and put the results of the experiment into effect. So we'll talk later about how to determine the number of subjects that you actually need to get a certain level of significance. Another wrinkle that is you'll want to know as soon as possible is whether one option is clearly better or not so that you can begin to profit from it early. In the multi-armed bandit problem, this is a problem of exploration versus exploitation. This refers to the tension in the experiment design (and other domain) between exploring the problem space and exploiting the resources you've found in the experiment so far. We won't get into this further, but it is a factor to stay aware of as you perform A/B tests in the future. Because of the power and simplicity of A/B testing, it's being widely used in a variety of domains. For example, marketing and advertising make extensive use of it. Also, it has become a powerful way to test and improve measurable interactions between your website and those who visit it online. The primary requirement is that the interaction be somewhat limited and very measurable. Interesting would not make a good metric; the click-through rate or pages visited, however, would. Because of this, A/B tests validate changes in the placement or in the text of buttons that call for action from the users. For example, a test might compare the performance of Click for more! against Learn more now!. Another test may check whether a button placed in the upper-right section increases sales versus one in the center of the page. These changes are all incremental, and you probably don't want to break a large site redesign into pieces and test all of them individually. In a larger redesign, several changes may work together and reinforce each other. Testing them incrementally and only applying the ones that increase some metric can result in a design that's not aesthetically pleasing, is difficult to maintain, and costs you users in the long run. In these cases, A/B testing is not recommended. Some other things that are regularly tested in A/B tests include the following parts of a web page: The wording, size, and placement of a call-to-action button The headline and product description The length, layout, and fields in a form The overall layout and style of the website as a larger test, which is not broken down The pricing and promotional offers of products The images on the landing page The amount of text on a page Now that we have an understanding of what A/B testing is and what it can do for us, let's see what it will take to set up and perform an A/B test. Conducting an A/B test In creating an A/B test, we need to decide several things, and then we need to put our plan into action. We'll walk through those decisions here and create a simple set of web pages that will test the aspects of design that we are interested in changing, based upon the behavior of the user. Before we start building stuff, though, we need to think through our experiment and what we'll need to build. Planning the experiment For this article, we're going to pretend that we have a website for selling widgets (or rather, looking at the website Widgets!). The web page in this screenshot is the control page. Currently, we're getting 24 percent click-through on it from the Learn more! button. We're interested in the text of the button. If it read Order now! instead of Learn more!, it might generate more click-through. (Of course, actually explaining what the product is and what problems it solves might be more effective, but one can't have everything.) This will be the test page, and we're hoping that we can increase the click-through rate to 29 percent (a five percent absolute increase). Now that we have two versions of the page to experiment with, we can frame the experiment statistically and figure out how many subjects we'll need for each version of the page in order to achieve a statistically meaningful increase in the click-through rate on that button. Framing the statistics First, we need to frame our experiment in terms of the null-hypothesis test. In this case, the null hypothesis would look something like this: Changing the button copy from Learn more! to Order now! Would not improve the click-through rate. Remember, this is the statement that we're hoping to disprove (or fail to disprove) in the course of this experiment. Now we need to think about the sample size. This needs to be fixed in advance. To find the sample size, we'll use the standard error formula, which will be solved to get the number of observations to make for about a 95 percent confidence interval in order to get us in the ballpark of how large our sample should be: In this, δ is the minimum effect to detect and σ² is the sample variance. If we are testing for something like a percent increase in the click-through, the variance is σ² = p(1 – p), where p is the initial click-through rate with the control page. So for this experiment, the variance will be 0.24(1-0.24) or 0.1824. This would make the sample size for each variable 16(0.1824 / 0.052) or almost 1170. The code to compute this in Clojure is fairly simple: (defn get-target-sample [rate min-effect] (let [v (* rate (- 1.0 rate))] (* 16.0 (/ v (* min-effect min-effect))))) Running the code from the prompt gives us the response that we expect: user=> (get-target-sample 0.24 0.05) 1167.36 Part of the reason to calculate the number of participants needed is that monitoring the progress of the experiment and stopping it prematurely can invalidate the results of the test because it increases the risk of false positives where the experiment says it has disproved the null hypothesis when it really hasn't. This seems counterintuitive, doesn't it? Once we have significant results, we should be able to stop the test. Let's work through it. Let's say that in actuality, there's no difference between the control page and the test page. That is, both sets of copy for the button get approximately the same click-through rate. If we're attempting to get p ≤ 0.05, then it means that the test will return a false positive five percent of the time. It will incorrectly say that there is a significant difference between the click-through rates of the two buttons five percent of the time. Let's say that we're running the test and planning to get 3,000 subjects. We end up checking the results of every 1,000 participants. Let's break down what might happen: Run A B C D E F G H 1000 No No No No Yes Yes Yes Yes 2000 No No Yes Yes No Yes No Yes 3000 No Yes No Yes No No Yes Yes Final No Yes No Yes No No Yes Yes Stopped No Yes Yes Yes Yes Yes Yes Yes Let's read this table. Each lettered column represents a scenario for how the significance of the results may change over the run of the test. The rows represent the number of observations that have been made. The row labeled Final represents the experiment's true finishing result, and the row labeled Stopped represents the result if the experiment is stopped as soon as a significant result is seen. The final results show us that out of eight different scenarios, the final result would be significant in four cases (B, D, G, and H). However, if the experiment is stopped prematurely, then it will be significant in seven cases (all but A). The test could drastically over-generate false positives. In fact, most statistical tests assume that the sample size is fixed before the test is run. It's exciting to get good results, so we'll design our system so that we can't easily stop it prematurely. We'll just take that temptation away. With this in mind, let's consider how we can implement this test. Building the experiment There are several options to actually implement the A/B test. We'll consider several of them and weigh their pros and cons. Ultimately, the option that works best for you really depends on your circumstances. However, we'll pick one for this article and use it to implement the test for it. Looking at options to build the site The first way to implement A/B testing is to use a server-side implementation. In this case, all of the processing and tracking is handled on the server, and visitors' actions would be tracked using GET or POST parameters on the URL for the resource that the experiment is attempting to drive traffic towards. The steps for this process would go something like the following ones: A new user visits the site and requests for the page that contains the button or copy that is being tested. The server recognizes that this is a new user and assigns the user a tracking number. It assigns the user to one of the test groups. It adds a row in a database that contains the tracking number and the test group that the user is part of. It returns the page to the user with the copy, image, or design that is reflective of the control or test group. The user views the returned page and decides whether to click on the button or link or not. If the server receives a request for the button's or link's target, it updates the user's row in the tracking table to show us that the interaction was a success, that is, that the user did a click-through or made a purchase. This way of handling it keeps everything on the server, so it allows more control and configuration over exactly how you want to conduct your experiment. A second way of implementing this would be to do everything using JavaScript (or ClojureScript, https://github.com/clojure/clojurescript). In this scenario, the code on the page itself would randomly decide whether the user belonged to the control or the test group, and it would notify the server that a new observation in the experiment was beginning. It would then update the page with the appropriate copy or image. Most of the rest of this interaction is the same as the one in previous scenario. However, the complete steps are as follows: A new user visits the site and requests for the page that contains the button or copy being tested. The server inserts some JavaScript to handle the A/B test into the page. As the page is being rendered, the JavaScript library generates a new tracking number for the user. It assigns the user to one of the test groups. It renders that page for the group that the user belongs to, which is either the control group or the test group. It notifies the server of the user's tracking number and the group. The server takes this notification and adds a row for the observation in the database. The JavaScript in the browser tracks the user's next move either by directly notifying the server using an AJAX call or indirectly using a GET parameter in the URL for the next page. The server receives the notification whichever way it's sent and updates the row in the database. The downside of this is that having JavaScript take care of rendering the experiment might take slightly longer and may throw off the experiment. It's also slightly more complicated, because there are more parts that have to communicate. However, the benefit is that you can create a JavaScript library, easily throw a small script tag into the page, and immediately have a new A/B experiment running. In reality, though, you'll probably just use a service that handles this and more for you. However, it still makes sense to understand what they're providing for you, and that's what this article tries to do by helping you understand how to perform an A/B test so that you can be make better use of these A/B testing vendors and services.
Read more
  • 0
  • 0
  • 2308

article-image-data-warehouse-design
Packt
20 May 2014
14 min read
Save for later

Data Warehouse Design

Packt
20 May 2014
14 min read
(For more resources related to this topic, see here.) Most companies are establishing or planning to establish a Business Intelligence system and a data warehouse (DW). Knowledge related to the BI and data warehouse are in great demand in the job market. This article gives you an understanding of what Business Intelligence and data warehouse is, what the main components of the BI system are, and what the steps to create the data warehouse are. This article focuses on the designing of the data warehouse, which is the core of a BI system. A data warehouse is a database designed for analysis, and this definition indicates that designing a data warehouse is different from modeling a transactional database. Designing the data warehouse is also called dimensional modeling. In this article, you will learn about the concepts of dimensional modeling. Understanding Business Intelligence Based on Gartner's definition (http://www.gartner.com/it-glossary/business-intelligence-bi/), Business Intelligence is defined as follows: Business Intelligence is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance. As the definition states, the main purpose of a BI system is to help decision makers to make proper decisions based on the results of data analysis provided by the BI system. Nowadays, there are many operational systems in each industry. Businesses use multiple operational systems to simplify, standardize, and automate their everyday jobs and requirements. Each of these systems may have their own database, some of which may work with SQL Server, some with Oracle. Some of the legacy systems may work with legacy databases or even file operations. There are also systems that work through the Web via web services and XML. Operational systems are very useful in helping with day-to-day business operations such as the process of hiring a person in the human resources department, and sale operations through a retail store and handling financial transactions. The rising number of operational systems also adds another requirement, which is the integration of systems together. Business owners and decision makers not only need integrated data but also require an analysis of the integrated data. As an example, it is a common requirement for the decision makers of an organization to compare their hiring rate with the level of service provided by a business and the customer satisfaction based on that level of service. As you can see, this requirement deals with multiple operational systems such as CRM and human resources. The requirement might also need some data from sales and inventory if the decision makers want to bring sales and inventory factors into their decisions. As a supermarket owner or decision maker, it would be very important to understand what products in which branches were in higher demand. This kind of information helps you to provide enough products to cover demand, and you may even think about creating another branch in some regions. The requirement of integrating multiple operational systems together in order to create consolidated reports and dashboards that help decision makers to make a proper decision is the main directive for Business Intelligence. Some organizations and businesses use ERP systems that are integrated, so a question may appear in your mind that there won't be a requirement for integrating data because consolidated reports can be produced easily from these systems. So does that mean that these systems still require a BI solution? The answer in most cases is yes. The companies or businesses might not require a separate BI system for internal and parts of the operations that implemented it through ERP. However, they might require getting some data from outside, for example, getting some data from another vendor's web service or many other protocols and channels to send and receive information. This indicates that there would be a requirement for consolidated analysis for such information, which brings the BI requirement back to the table. The architecture and components of a BI system After understanding what the BI system is, it's time to discover more about its components and understand how these components work with each other. There are also some BI tools that help to implement one or more components. The following diagram shows an illustration of the architecture and main components of the Business Intelligence system: The BI architecture and components differ based on the tools, environment, and so on. The architecture shown in the preceding diagram contains components that are common in most of the BI systems. In the following sections, you will learn more about each component. The data warehouse The data warehouse is the core of the BI system. A data warehouse is a database built for the purpose of data analysis and reporting. This purpose changes the design of this database as well. As you know, operational databases are built on normalization standards, which are efficient for transactional systems, for example, to reduce redundancy. As you probably know, a 3NF-designed database for a sales system contains many tables related to each other. So, for example, a report on sales information may consume more than 10 joined conditions, which slows down the response time of the query and report. A data warehouse comes with a new design that reduces the response time and increases the performance of queries for reports and analytics. You will learn more about the design of a data warehouse (which is called dimensional modeling) later in this article. Extract Transform Load It is very likely that more than one system acts as the source of data required for the BI system. So there is a requirement for data consolidation that extracts data from different sources and transforms it into the shape that fits into the data warehouse, and finally, loads it into the data warehouse; this process is called Extract Transform Load (ETL). There are many challenges in the ETL process, out of which some will be revealed (conceptually) later in this article. According to the definition of states, ETL is not just a data integration phase. Let's discover more about it with an example; in an operational sales database, you may have dozen of tables that provide sale transactional data. When you design that sales data into your data warehouse, you can denormalize it and build one or two tables for it. So, the ETL process should extract data from the sales database and transform it (combine, match, and so on) to fit it into the model of data warehouse tables. There are some ETL tools in the market that perform the extract, transform, and load operations. The Microsoft solution for ETL is SQL Server Integration Service (SSIS), which is one of the best ETL tools in the market. SSIS can connect to multiple data sources such as Oracle, DB2, Text Files, XML, Web services, SQL Server, and so on. SSIS also has many built-in transformations to transform the data as required. Data model – BISM A data warehouse is designed to be the source of analysis and reports, so it works much faster than operational systems for producing reports. However, a DW is not that fast to cover all requirements because it is still a relational database, and databases have many constraints that reduce the response time of a query. The requirement for faster processing and a lower response time on one hand, and aggregated information on another hand causes the creation of another layer in BI systems. This layer, which we call the data model, contains a file-based or memory-based model of the data for producing very quick responses to reports. Microsoft's solution for the data model is split into two technologies: the OLAP cube and the In-memory tabular model. The OLAP cube is a file-based data storage that loads data from a data warehouse into a cube model. The cube contains descriptive information as dimensions (for example, customer and product) and cells (for example, facts and measures, such as sales and discount). The following diagram shows a sample OLAP cube: In the preceding diagram, the illustrated cube has three dimensions: Product, Customer, and Time. Each cell in the cube shows a junction of these three dimensions. For example, if we store the sales amount in each cell, then the green cell shows that Devin paid 23$ for a Hat on June 5. Aggregated data can be fetched easily as well within the cube structure. For example, the orange set of cells shows how much Mark paid on June 1 for all products. As you can see, the cube structure makes it easier and faster to access the required information. Microsoft SQL Server Analysis Services 2012 comes with two different types of modeling: multidimensional and tabular. Multidimensional modeling is based on the OLAP cube and is fitted with measures and dimensions, as you can see in the preceding diagram. The tabular model is based on a new In-memory engine for tables. The In-memory engine loads all data rows from tables into the memory and responds to queries directly from the memory. This is very fast in terms of the response time. The BI semantic model (BISM) provided by Microsoft is a combination of SSAS Tabular and Multidimensional solutions. Data visualization The frontend of a BI system is data visualization. In other words, data visualization is a part of the BI system that users can see. There are different methods for visualizing information, such as strategic and tactical dashboards, Key Performance Indicators (KPIs), and detailed or consolidated reports. As you probably know, there are many reporting and visualizing tools on the market. Microsoft has provided a set of visualization tools to cover dashboards, KPIs, scorecards, and reports required in a BI application. PerformancePoint, as part of Microsoft SharePoint, is a dashboard tool that performs best when connected to SSAS Multidimensional OLAP cube. Microsoft's SQL Server Reporting Services (SSRS) is a great reporting tool for creating detailed and consolidated reports. Excel is also a great slicing and dicing tool especially for power users. There are also components in Excel such as Power View, which are designed to build performance dashboards. Master Data Management Every organization has a part of its business that is common between different systems. That part of the data in the business can be managed and maintained as master data. For example, an organization may receive customer information from an online web application form or from a retail store's spreadsheets, or based on a web service provided by other vendors. Master Data Management (MDM) is the process of maintaining the single version of truth for master data entities through multiple systems. Microsoft's solution for MDM is Master Data Services (MDS). Master data can be stored in the MDS entities and it can be maintained and changed through the MDS Web UI or Excel UI. Other systems such as CRM, AX, and even DW can be subscribers of the master data entities. Even if one or more systems are able to change the master data, they can write back their changes into MDS through the staging architecture. Data Quality Services The quality of data is different in each operational system, especially when we deal with legacy systems or systems that have a high dependence on user inputs. As the BI system is based on data, the better the quality of data, the better the output of the BI solution. Because of this fact, working on data quality is one of the components of the BI systems. As an example, Auckland might be written as "Auckland" in some Excel files or be typed as "Aukland" by the user in the input form. As a solution to improve the quality of data, Microsoft provided users with DQS. DQS works based on Knowledge Base domains, which means a Knowledge Base can be created for different domains, and the Knowledge Base will be maintained and improved by a data steward as time passes. There are also matching policies that can be used to apply standardization on the data. Building the data warehouse A data warehouse is a database built for analysis and reporting. In other words, a data warehouse is a database in which the only data entry point is through ETL, and its primary purpose is to cover reporting and data analysis requirements. This definition clarifies that a data warehouse is not like other transactional databases that operational systems write data into. When there is no operational system that works directly with a data warehouse, and when the main purpose of this database is for reporting, then the design of the data warehouse will be different from that of transactional databases. If you recall from the database normalization concepts, the main purpose of normalization is to reduce the redundancy and dependency. The following table shows customers' data with their geographical information: Customer First Name Last Name Suburb City State Country Devin Batler Remuera Auckland Auckland New Zealand Peter Blade Remuera Auckland Auckland New Zealand Lance Martin City Center Sydney NSW Australia Let's elaborate on this example. As you can see from the preceding list, the geographical information in the records is redundant. This redundancy makes it difficult to apply changes. For example, in the structure, if Remuera, for any reason, is no longer part of the Auckland city, then the change should be applied on every record that has Remuera as part of its suburb. The following screenshot shows the tables of geographical information: So, a normalized approach is to retrieve the geographical information from the customer table and put it into another table. Then, only a key to that table would be pointed from the customer table. In this way, every time the value Remuera changes, only one record in the geographical region changes and the key number remains unchanged. So, you can see that normalization is highly efficient in transactional systems. This normalization approach is not that effective on analytical databases. If you consider a sales database with many tables related to each other and normalized at least up to the third normalized form (3NF), then analytical queries on such databases may require more than 10 join conditions, which slows down the query response. In other words, from the point of view of reporting, it would be better to denormalize data and flatten it in order to make it easier to query data as much as possible. This means the first design in the preceding table might be better for reporting. However, the query and reporting requirements are not that simple, and the business domains in the database are not as small as two or three tables. So real-world problems can be solved with a special design method for the data warehouse called dimensional modeling. There are two well-known methods for designing the data warehouse: the Kimball and Inmon methodologies. The Inmon and Kimball methods are named after the owners of these methodologies. Both of these methods are in use nowadays. The main difference between these methods is that Inmon is top-down and Kimball is bottom-up. In this article, we will explain the Kimball method. You can read more about the Inmon methodology in Building the Data Warehouse, William H. Inmon, Wiley (http://www.amazon.com/Building-Data-Warehouse-W-Inmon/dp/0764599445), and about the Kimball methodology in The Data Warehouse Toolkit, Ralph Kimball, Wiley (http://www.amazon.com/The-Data-Warehouse-Toolkit-Dimensional/dp/0471200247). Both of these books are must-read books for BI and DW professionals and are reference books that are recommended to be on the bookshelf of all BI teams. This article is referenced from The Data Warehouse Toolkit, so for a detailed discussion, read the referenced book. Dimensional modeling To gain an understanding of data warehouse design and dimensional modeling, it's better to learn about the components and terminologies of a DW. A DW consists of Fact tables and dimensions. The relationship between a Fact table and dimensions are based on the foreign key and primary key (the primary key of the dimension table is addressed in the fact table as the foreign key). Summary This article explains the first steps in thinking and designing a BI system. As the first step, a developer needs to design the data warehouse (DW) and needs an understanding of the key concepts of the design and methodologies to create the data warehouse. Resources for Article: Further resources on this subject: Self-service Business Intelligence, Creating Value from Data [Article] Oracle Business Intelligence : Getting Business Information from Data [Article] Business Intelligence and Data Warehouse Solution - Architecture and Design [Article]
Read more
  • 0
  • 0
  • 3221
article-image-backup-and-restore-improvements
Packt
25 Apr 2014
11 min read
Save for later

Backup and Restore Improvements

Packt
25 Apr 2014
11 min read
(For more resources related to this topic, see here.) Database backups to a URL and Microsoft Azure Storage The ability to backup to a URL was introduced in SQL Server 2012 Service Pack 1 cumulative update package 2. Prior to this, if you wanted to backup to a URL in SQL Server 2012, you needed to use Transact-SQL or PowerShell. SQL Server 2014 has integrated this option into Management Studio too. The reason for allowing backups to a URL is to allow you to integrate your SQL Server backups with cloud-based storage and store your backups in Microsoft Azure. By being able to create a backup there, you can keep database backups of your on-premise database in Microsoft Azure. This makes your backups safer and protected in the event that your main site is lost to a disaster as your backups are stored offsite. This can avoid the need for an actual disaster recovery site. In order to create a backup to Microsoft Azure Storage, you need a storage account and a storage container. From a SQL Server perspective, you will require a URL, which will specify a Uniform Resource Identifier (URI) to a unique backup file in Microsoft Cloud. It is the URL that provides the location for the backup and the backup filename. The URL will need to point to a blob, not just a container. If it does not exist, then it is created. However, if a backup file exists, then the backup will fail. This is unless the WITH FORMAT command is specified, which like in older versions of SQL Server allows the backup to overwrite the existing backup with the new one that you wish to create. You will also need to create a SQL Server credential to allow the SQL Server to authenticate with Microsoft Azure Storage. This credential will store the name of the storage account and also the access key. The WITH CREDENTIAL statement must be used when issuing the backup or restore commands. There are some limitations you need to consider when backing up your database to a URL and making use of Microsoft Azure Storage to store your database backups: Maximum backup size of 1 TB (Terabyte). Cannot be combined with backup devices. Cannot append to existing backups—in SQL Server, you can have more than one backup stored in a file. When taking a backup to a URL, the ratio should be of one backup to one file. You cannot backup to multiple blobs. In a normal SQL Server backup, you can stripe it across multiple files. You cannot do this with a backup to a URL on Microsoft Azure. There are some limitations you need to consider when backing up to the Microsoft Azure Storage; you can find more information on this at http://msdn.microsoft.com/en-us/library/dn435916(v=sql.120).aspx#backuptaskssms. For the purposes of this exercise, I have created a new container on my Microsoft Azure Storage account called sqlbackup. With the storage account container, you will now take the backup to a URL. As part of this process, you will create a credential using your Microsoft Azure publishing profile. This is slightly different to the process we just discussed, but you can download this profile from Microsoft Azure. Once you have your publishing profile, you can follow the steps explained in the following section. Backing up a SQL Server database to a URL You can use Management Studio's backup task to initiate the backup. In order to do this, you need to start Management Studio and connect to your local SQL Server instance. You will notice that I have a database called T3, and it is this database that I will be backing up to the URL as follows: Right-click on the database you want to back up and navigate to Tasks | Backup. This will start the backup task wizard for you. On the General page, you should change the backup destination from Disk to URL. Making this change will enable all the other options needed for taking a backup to a URL. You will need to provide a filename for your backup, then create the SQL Server credential you want to use to authenticate on the Windows Azure Storage container. Click on the Create Credential button to open the Create credential dialog box. There is an option to use your publishing profile, so click on the Browse button and select the publishing profile that you downloaded from the Microsoft Azure web portal. Once you have selected your publishing profile, it will prepopulate the credential name, management certificate, and subscription ID fields for you. Choose the appropriate Storage Account for your backups. Following this, you should then click on Create to create the credential. You will need to specify the Windows Azure Storage container to use for the backup. In this case, I entered sqlbackup. When you have finished, your General page should look like what is shown in the following screenshot: Following this, click on OK and the backup should run. If you want to use Transact-SQL, instead of Management Studio, to take the backup, the code would look like this: BACKUP DATABASE [T3] TO URL = N'https://gresqlstorage.blob.core.windows.net/sqlbackup/t3.bak' WITH CREDENTIAL = N'AzureCredential' , NOFORMAT, NOINIT, NAME = N'T3-Full Database Backup', NOSKIP, NOREWIND, NOUNLOAD, STATS = 10 GO This is a normal backup database statement, as it has always been, but it specifies a URL and a credential to use to take the backup as well. Restoring a backup stored on Windows Azure Storage In this section, you will learn how to restore a database using the backup you have stored on Windows Azure Storage: To carry out the restore, connect to your local instance of SQL Server in Management Studio, right-click on the databases folder, and choose the Restore database option. This will open the database restore pages. In the Source section of the General page, select the Device option, click on the dropdown and change the backup media type to URL, and click on Add. In the next screen, you have to specify the Windows Azure Storage account connection information. You will need to choose the storage account to connect to and specify an access key to allow SQL Server to connect to Microsoft Azure. You can get this from the Storage section of the Microsoft Azure portal. After this, you will need to specify a credential to use. In this case, I will use the credential that was created when I took the backup earlier. Click on Connect to connect to Microsoft Azure. You will then need to chose the backup to restore from. In this case, I'll use the backup of the T3 database that was created in the preceding section. You can then complete the restore options as you would do with a local backup. In this case, the database has been called T3_cloud, mainly for reference so that it can be easily identified. If you want to restore the existing database, you need to use the WITH REPLACE command in the restore statement. The restore statement would look like this: RESTORE DATABASE t3 FROM URL =' https://gresqlstorage.blob.core.windows.net/sqlbackup/t3.bak ' WITH CREDENTIAL = ' N'AzureCredential' ' ,REPLACE ,STATS = 5 When the restore has been completed, you will have a new copy of the database on the local SQL Server instance. SQL Server Managed Backup to Microsoft Azure Building on the ability to take a backup of a SQL Server database to a URL and Microsoft Azure Storage, you can now set up Managed Backups of your SQL Server databases to Microsoft Azure. It allows you to automate your database backups to the Microsoft Azure Storage. All database administrators appreciate automation, as it frees their time to focus on other projects. So, this feature will be useful to you. It's fully customizable, and you can build your backup strategy around the transaction workload of your database and set a retention policy. Configuring SQL Server-managed backups to Microsoft Azure In order to set up and configure Managed Backups in SQL Server 2014, a new stored procedure has been introduced to configure Managed Backups on a specific database. The stored procedure is called smart_admin.sp_set_db_backup. The syntax for the stored procedure is as follows: EXEC smart_admin.sp_set_db_backup [@database_name = ] 'database name' ,[@enable_backup = ] { 0 | 1} ,[@storage_url = ] 'storage url' ,[@retention_days = ] 'retention_period_in_days' ,[@credential_name = ] 'sql_credential_name' ,[@encryption_algorithm] 'name of the encryption algorithm' ,[@encryptor_type] {'CERTIFICATE' | 'ASYMMETRIC_KEY'} ,[@encryptor_name] 'name of the certificate or asymmetric key' This stored procedure will be used to set up Managed Backups on the T3 database. The SQL Server Agent will need to be running for this to work. In my case, I executed the following code to enable Managed Backups on my T3 database: Use msdb; GO EXEC smart_admin.sp_set_db_backup @database_name='T3' ,@enable_backup=1 ,@storage_url = 'https://gresqlstorage.blob.core.windows.net/' ,@retention_days=5 ,@credential_name='AzureCredential' ,@encryption_algorithm =NO_ENCRYPTION To view the Managed Backup information, you can run the following query: Use msdb GO SELECT * FROM smart_admin.fn_backup_db_config('T3') The results should look like this: To disable the Managed Backup, you can use the smart_admin.sp_set_db_backup procedure to disable it: Use msdb; GO EXEC smart_admin.sp_set_db_backup @database_name='T3' ,@enable_backup=0 Encryption For the first time in SQL Server, you can encrypt your backups using the native SQL Server backup tool. In SQL Server 2014, the backup tool supports several encryption algorithms, including AES 128, AES 192, AES 256, and Triple DES. You will need a certificate or an asymmetric key when taking encrypted backups. Obviously, there are a number of benefits to encrypting your SQL Server database backups, including securing the data in the database. This can also be very useful if you are using transparent data encryption (TDE) to protect your database's data files. Encryption is also supported using SQL Server Managed Backup to Microsoft Azure. Creating an encrypted backup To create an encrypted SQL Server backup, there are a few prerequisites that you need to ensure are set up on the SQL Server. Creating a database master key for the master database Creating the database master key is important because it is used to protect the private key certificate and the asymmetric keys that are stored in the master database, which will be used to encrypt the SQL Server backup. The following Transact-SQL will create a database master key for the master database: USE master; GO CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'P@$$W0rd'; GO In this example, a simple password has been used. In a production environment, it would be advisable to create a master key with a more secure password. Creating a certificate or asymmetric key The backup encryption process will need to make use of a certificate or asymmetric key to be able to take the backup. The following code creates a certificate that can be used to back up your databases using encryption: Use Master GO CREATE CERTIFICATE T3DBBackupCertificate WITH SUBJECT = 'T3 Backup Encryption Certificate'; GO Now you can take an encrypted backup of the database. Creating an encrypted database backup You can now take an encrypted backup of your databases. The following Transact-SQL statements back up the T3 database using the certificate you created in the preceding section: BACKUP DATABASE t3 TO DISK = N'C:Backupt3_enc.bak' WITH COMPRESSION, ENCRYPTION ( ALGORITHM = AES_256, SERVER CERTIFICATE = T3DBBackupCertificate ), STATS = 10 GO This is a local backup; it's located in the C:backup folder, and the encryption algorithm used is AES_256. Summary This article has shown some of the new backup features of SQL Server 2014. The ability to backup to Microsoft Azure Storage means that you can implement a robust backup and restore strategy at a relatively lower cost. Resources for Article: Further resources on this subject: SQL Server 2008 R2: Multiserver Management Using Utility Explorer [Article] Microsoft SQL Server 2008 High Availability: Installing Database Mirroring [Article] Manage SQL Azure Databases with the Web Interface 'Houston' [Article]
Read more
  • 0
  • 0
  • 1943

article-image-using-cross-validation
Packt
22 Apr 2014
7 min read
Save for later

Using cross-validation

Packt
22 Apr 2014
7 min read
(For more resources related to this topic, see here.) To start from, cross-validation is a common validation technique that can be used to evaluate machine learning models. Cross-validation essentially measures how well the estimated model will generalize some given data. This data is different from the training data supplied to our model, and is called the cross-validation set, or simply validation set, of our model. Cross-validation of a given model is also called rotation estimation. If an estimated model performs well during cross-validation, we can assume that the model can understand the relationship between its various independent and dependent variables. The goal of cross-validation is to provide a test to determine if a formulated model is overfit on the training data. In the perspective of implementation, cross-validation is a kind of unit test for a machine learning system. A single round of cross-validation generally involves partitioning all the available sample data into two subsets and then performing training on one subset and validation and/or testing on the other subset. Several such rounds, or folds, of cross-validation must be performed using different sets of data to reduce the variance of the overall cross-validation error of the given model. Any particular measure of the cross-validation error should be calculated as the average of this error over the different folds in cross-validation. There are several types of cross-validation we can implement as a diagnostic for a given machine learning model or system. Let's briefly explore a few of them as follows: A common type is k-fold cross-validation, in which we partition the cross-validation data into k equal subsets. The training of the model is then performed on subsets of the data and the cross-validation is performed on a single subset. A simple variation of k-fold cross-validation is 2-fold cross-validation, which is also called the holdout method. In 2-fold cross-validation, the training and cross-validation subsets of data will be almost equal in proportion. Repeated random subsampling is another simple variant of cross-validation in which the sample data is first randomized or shuffled and then used as training and cross-validation data. This method is notably not dependent on the number of folds used for cross-validation. Another form of k-fold cross-validation is leave-one-out cross-validation, in which only a single record from the available sample data is used for cross-validation. Leave-one-out cross-validation is essentially k-fold cross-validation in which k is equal to the number of samples or observations in the sample data. Cross-validation basically treats the estimated model as a black box, that is, it makes no assumptions about the implementation of the model. We can also use cross-validation to select features in a given model by using cross-validation to determine the feature set that produces the best fit model over the given sample data. Of course, there are a couple of limitations of classification, which can be summarized as follows: If a given model is needed to perform feature selection internally, we must perform cross-validation for each selected feature set in the given model. This can be computationally expensive depending on the amount of available sample data. Cross-validation is not very useful if the sample data comprises exactly or nearly equal samples. In summary, it's a good practice to implement cross-validation for any machine learning system that we build. Also, we can choose an appropriate cross-validation technique depending on the problem we are trying to model as well as the nature of the collected sample data. For the example that will follow, the namespace declaration should look similar to the following declaration: (ns my-namespace (:use [clj-ml classifiers data])) We can use the clj-ml library to cross-validate the classifier we built for the fish packaging plant. Essentially, we built a classifier to determine whether a fish is a salmon or a sea bass using the clj-ml library. To recap, a fish is represented as a vector containing the category of the fish and values for the various features of the fish. The attributes of a fish are its length, width, and lightness of skin. We also described a template for a sample fish, which is defined as follows: (def fish-template [{:category [:salmon :sea-bass]} :length :width :lightness]) The fish-template vector defined in the preceding code can be used to train a classifier with some sample data. For now, we will not bother about which classification algorithm we have used to model the given training data. We can only assume that the classifier was created using the make-classifier function from the clj-ml library. This classifier is stored in the *classifier* variable as follows: (def *classifier* (make-classifier ...)) Suppose the classifier was trained with some sample data. We must now evaluate this trained classification model. To do this, we must first create some sample data to cross-validate. For the sake of simplicity, we will use randomly generated data in this example. We can generate this data using the make-sample-fish function. This function simply creates a new vector of some random values representing a fish. Of course, we must not forget the fact that the make-sample-fish function has an in-built partiality, so we create a meaningful pattern in a number of samples created using this function as follows: (def fish-cv-data (for [i (range 3000)] (make-sample-fish))) We will need to use a dataset from the clj-ml library, and we can create one using the make-dataset function, as shown in the following code: (def fish-cv-dataset (make-dataset "fish-cv" fish-template fish-cv-data)) To cross-validate the classifier, we must use the classifier-evaluate function from the clj-ml.classifiers namespace. This function essentially performs k-fold cross-validation on the given data. Other than the classifier and the cross-validation dataset, this function requires the number of folds that we must perform on the data to be specified as the last parameter. Also, we will first need to set the class field of the records in fish-cv-dataset using the dataset-set-class function. We can define a single function to perform these operations as follows: (defn cv-classifier [folds] (dataset-set-class fish-cv-dataset 0) (classifier-evaluate *classifier* :cross-validation fish-cv-dataset folds)) We will use 10 folds of cross-validation on the classifier. Since the classifier-evaluate function returns a map, we bind this return value to a variable for further use, as follows: user> (def cv (cv-classifier 10)) #'user/cv We can fetch and print the summary of the preceding cross-validation using the :summary keyword as follows: user> (print (:summary cv)) Correctly Classified Instances 2986 99.5333 % Incorrectly Classified Instances 14 0.4667 % Kappa statistic 0.9888 Mean absolute error 0.0093 Root mean squared error 0.0681 Relative absolute error 2.2248 % Root relative squared error 14.9238 % Total Number of Instances 3000 nil As shown in the preceding code, we can view several statistical measures of performance for our trained classifier. Apart from the correctly and incorrectly classified records, this summary also describes the Root Mean Squared Error (RMSE) and several other measures of error in our classifier. For a more detailed view of the correctly and incorrectly classified instances in the classifier, we can print the confusion matrix of the cross-validation using the :confusion-matrix keyword, as shown in the following code: user> (print (:confusion-matrix cv)) === Confusion Matrix === a b <-- classified as 2129 0 | a = salmon 9 862 | b = sea-bass nil As shown in the preceding example, we can use the clj-ml library's classifier-evaluate function to perform a k-fold cross-validation on any given classifier. Although we are restricted to using classifiers from the clj-ml library when using the classifier-evaluate function, we must strive to implement similar diagnostics in any machine learning system we build.
Read more
  • 0
  • 0
  • 2251
Modal Close icon
Modal Close icon