Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-deep-learning-image-generation-getting-started-generative-adversarial-networks
Mohammad Pezeshki
27 Sep 2016
5 min read
Save for later

Deep Learning and Image generation: Get Started with Generative Adversarial Networks

Mohammad Pezeshki
27 Sep 2016
5 min read
In machine learning, a generative model is one that captures the observable data distribution. The objective of deep neural generative models is to disentangle different factors of variation in data and be able to generate new or similar-looking samples of the data. For example, an ideal generative model on face images disentangles all different factors of variation such as illumination, pose, gender, skin color, and so on, and is also able to generate a new face by the combination of those factors in a very non-linear way. Figure 1 shows a trained generative model that has learned different factors, including pose and the degree of smiling. On the x-axis, as we go to the right, the pose changes and on y-axis as we move upwards, smiles turn to frowns. Usually these factors are orthogonal to one another, meaning that changing one while keeping the others fixed leads to a single change in data space; e.g. in the first row of Figure 1, only the pose changes with no change in the degree of smiling. The figure is adapted from here.   Based on the assumption that these underlying factors of variation have a very simple distribution (unlike the data itself), to generate a new face, we can simply sample a random number from the assumed simple distribution (such as a Gaussian). In other words, if there are k different factors, we randomly sample from a k-dimensional Gaussian distribution (aka noise). In this post, we will take a look at one of the recent models in the area of deep learning and generative models, called generative adversarial network. This model can be seen as a game between two agents: the Generator and the Discriminator. The generator generates images from noise and the discriminator discriminates between real images and those images which are generated by the generator. The objective is then to train the model such that while the discriminator tries to distinguish generated images from real images, the generator tries to fool the discriminator.  To train the model, we need to define a cost. In the case of GAN, the errors made by the discriminator are considered as the cost. Consequently, the objective of the discriminator is to minimize the cost, while the objective for the generator is to fool the discriminator by maximizing the cost. A graphical illustration of the model is shown in Figure 2.   Formally, we define the discriminator as a binary classiffier D : Rm ! f0; 1g and the generator as the mapping G : Rk ! Rm in which k is the dimension of the latent space that represents all of the factors of variation. Denoting the data by x and a point in latent space by z, the model can be trained by playing the following minmax game:   Note that the rst term encourages the discriminator to discriminate between generated images and real ones, while the second term encourages the generator to come up with images that would fool the discriminator. In practice, the log in the second term can be saturated, which would hurt the row of the gradient. As a result, the cost may be reformulated equivalently as:   At the time of generation, we can sample from a simple k-dimensional Gaussian distribution with zero mean and unit variance and pass it onto the generator. Among different models that can be used as the discriminator and generator, we use deep neural networks with parameters D and G for the discriminator and generator, respectively. Since the training boils down to updating the parameters using the backpropagation algorithm, the update rule is defined as follows: If we use a convolutional network as the discriminator and another convolutional network with fractionally strided convolution layers as the generator, the model is called DCGAN (Deep Convolutional Generative Adversarial Network). Some samples of bedroom im-age generation from this model are shown in Figure 3.   The generator can also be a sequential model, meaning that it can generate an image using a sequence of images with lower-resolution or details. A few examples of the generated images using such a model are shown in Figure 4. The GAN and later variants such as the DCGAN are currently considered to be among the best when it comes to the quality of the generated samples. The images look so realistic that you might assume that the model has simply memorized instances of the training set, but a quick KNN search reveals this not to be the case. About the author Mohammad Pezeshk is a master’s student in the LISA lab of Universite de Montreal working under the supervision of Yoshua Bengio and Aaron Courville. He obtained his bachelor's in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014 and then started his master’s in September 2014. His research interests lie in the fields of artificial intelligence, machine learning, probabilistic models and specifically deep learning.
Read more
  • 0
  • 0
  • 4389

article-image-navigation-mesh-generation
Packt
19 Dec 2014
9 min read
Save for later

Navigation Mesh Generation

Packt
19 Dec 2014
9 min read
In this article by Curtis Bennett and Dan Violet Sagmiller, authors of the book Unity AI Programming Essentials, we will learn about navigation meshes in Unity. Navigation mesh generation controls how AI characters are able to travel around a game level and is one of the most important topics in game AI. In this article, we will provide an overview of navigation meshes and look at the algorithm for generating them. Then, we'll look at different options of customizing our navigation meshes better. To do this, we will be using RAIN 2.1.5, a popular AI plugin for Unity by Rival Theory, available for free at http://rivaltheory.com/rain/download/. In this article, you will learn about: How navigation mesh generation works and the algorithm behind it Advanced options for customizing navigation meshes Creating advanced navigation meshes with RAIN (For more resources related to this topic, see here.) An overview of a navigation mesh To use navigation meshes, also referred to as NavMeshes, effectively the first things we need to know are what exactly navigation meshes are and how they are created. A navigation mesh is a definition of the area an AI character could travel to in a level. It is a mesh, but it is not intended to be rendered or seen by the player, instead it is used by the AI system. A NavMesh usually does not cover all the area in a level (if it did we wouldn't need one) since it's just the area a character can walk. The mesh is also almost always a simplified version of the geometry. For instance, you could have a cave floor in a game with thousands of polygons along the bottom showing different details in the rock, but for the navigation mesh the areas would just be a handful of very large polys giving a simplified view of the level. The purpose of navigation mesh is to provide this simplified representation to the rest of the AI system a way to find a path between two points on a level for a character. This is its purpose; let's discuss how they are created. It used to be a common practice in the games industry to create navigation meshes manually. A designer or artist would take the completed level geometry and create one using standard polygon mesh modelling tools and save it out. As you might imagine, this allowed for nice, custom, efficient meshes, but was also a big time sink, since every time the level changed the navigation mesh would need to be manually edited and updated. In recent years, there has been more research in automatic navigation mesh generation. There are many approaches to automatic navigation mesh generation, but the most popular is Recast, originally developed and designed by Mikko Monomen. Recast takes in level geometry and a set of parameters defining the character, such as the size of the character and how big of steps it can take, and then does a multipass approach to filter and create the final NavMesh. The most important phase of this is voxelizing the level based on an inputted cell size. This means the level geometry is divided into voxels (cubes) creating a version of the level geometry where everything is partitioned into different boxes called cells. Then the geometry in each of these cells is analyzed and simplified based on its intersection with the sides of the boxes and is culled based on things such as the slope of the geometry or how big a step height is between geometry. This simplified geometry is then merged and triangulated to make a final navigation mesh that can be used by the AI system. The source code and more information on the original C++ implementation of Recast is available at https://github.com/memononen/recastnavigation. Advanced NavMesh parameters Now that we understand how navigation mesh generations works, let's look at the different parameters you can set to generate them in more detail. We'll look at how to do these with RAIN: Open Unity and create a new scene and a floor and some blocks for walls. Download RAIN from http://rivaltheory.com/rain/download/ and import it into your scene. Then go to RAIN | Create Navigation Mesh. Also right-click on the RAIN menu and choose Show Advanced Settings. The setup should look something like the following screenshot: Now let's look at some of the important parameters: Size: This is the overall size of the navigation mesh. You'll want the navigation mesh to cover your entire level and use this parameter instead of trying to scale up the navigation mesh through the Scale transform in the Inspector window. For our demo here, set the Size parameter to 20. Walkable Radius: This is an important parameter to define the character size of the mesh. Remember, each mesh will be matched to the size of a particular character, and this is the radius of the character. You can visualize the radius for a character by adding a Unity Sphere Collider script to your object (by going to Component | Physics | Sphere Collider) and adjusting the radius of the collider. Cell Size: This is also a very important parameter. During the voxel step of the Recast algorithm, this sets the size of the cubes to inspect the geometry. The smaller the size, the more detailed and finer mesh, but longer the processing time for Recast. A large cell size makes computation fast but loses detail. For example, here is a NavMesh from our demo with a cell size of 0.01: You can see the finer detail here. Here is the navigation mesh generated with a cell size of 0.1: Note the difference between the two screenshots. In the former, walking through the two walls lower down in our picture is possible, but in the latter with a larger cell size, there is no path even though the character radius is the same. Problems like this become greater with larger cell sizes. The following is a navigation mesh with a cell size of 1: As you can see, the detail becomes jumbled and the mesh itself becomes unusable. With such differing results, the big question is how large should a cell size be for a level? The answer is that it depends on the required result. However, one important consideration is that as the processing time to generate one is done during development and not at runtime even if it takes several minutes to generate a good mesh, it can be worth it to get a good result in the game. Setting a small cell size on a large level can cause mesh processing to take a significant amount of time and consume a lot of memory. It is a good practice to save the scene before attempting to generate a complex navigation mesh. The Size, Walkable Radius, and Cell Size parameters are the most important parameters when generating the navigation mesh, but there are more that are used to customize the mesh further: Max Slope: This is the largest slope that a character can walk on. This is how much a piece of geometry that is tilted can still be walked on. If you take the wall and rotate it, you can see it is walkable: The preceding is a screenshot of a walkable object with slope. Step Height: This is how high a character can step from one object to another. For example, if you have steps between two blocks, as shown in the following screenshot, this would define how far in height the blocks can be apart and whether the area is still considered walkable: This is a screenshot of the navigation mesh with step height set to connect adjacent blocks. Walkable Height: This is the vertical height that is needed for the character to walk. For example, in the previous illustration, the second block is not walkable underneath because of the walkable height. If you raise it to a least one unit off the ground and set the walkable height to 1, the area underneath would become walkable:   You can see a screenshot of the navigation mesh with walkable height set to allow going under the higher block. These are the most important parameters. There are some other parameters related to the visualization and to cull objects. We will look at culling more in the next section. Culling areas Being able to set up areas as walkable or not is an important part of creating a level. To demo this, let's divide the level into two parts and create a bridge between the two. Take our demo and duplicate the floor and pull it down. Then transform one of the walls to a bridge. Then, add two other pieces of geometry to mark areas that are dangerous to walk on, like lava. Here is an example setup: This is a basic scene with a bridge to cross. If you recreate the navigation mesh now, all of the geometry will be covered and the bridge won't be recognized. To fix this, you can create a new tag called Lava and tag the geometry under the bridge with it. Then, in the navigation meshes' RAIN component, add Lava to the unwalkable tags. If you then regenerate the mesh, only the bridge is walkable. This is a screenshot of a navigation mesh areas under bridge culled: Using layers and the walkable tag you can customize navigation meshes. Summary Navigation meshes are an important part of game AI. In this article, we looked at the different parameters to customize navigation meshes. We looked at things such as setting the character size and walkable slopes and discussed the importance of the cell size parameter. We then saw how to customize our mesh by tagging different areas as not walkable. This should be a good start for designing navigation meshes for your games. Resources for Article: Further resources on this subject: Components in Unity [article] Enemy and Friendly AIs [article] Introduction to AI [article]
Read more
  • 0
  • 0
  • 4378

article-image-microstrategy-10
Packt
15 Jul 2016
13 min read
Save for later

MicroStrategy 10

Packt
15 Jul 2016
13 min read
In this article by Dmitry Anoshin, Himani Rana, and Ning Ma, the authors of the book, Mastering Business Intelligence with MicroStrategy, we are going to talk about MicroStrategy 10 which is one of the leading platforms on the market, can handle all data analytics demands, and offers a powerful solution. We will be discussing the different concepts of MicroStrategy such as its history, deployment, and so on. (For more resources related to this topic, see here.) Meet MicroStrategy 10 MicroStrategy is a market leader in Business Intelligence (BI) products. It has rich functionality in order to meet the requirements of modern businesses. In 2015, MicroStrategy provided a new release of MicroStrategy, version 10. It offers both agility and governance like no other BI product. In addition, it is easy to use and enterprise ready. At the same time, it is great for both IT and business. In other words, MicroStrategy 10 offers an analytics platform that combines an easy and empowering user experience, together with enterprise-grade performance, management, and security capabilities. It is true bimodal BI and moves seamlessly between styles: Data discovery and visualization Enterprise reporting and dashboards In-memory high performance BI Scales from departments to enterprises Administration and security MicroStrategy 10 consists of three main products: MicroStrategy Desktop, MicroStrategy Mobile, and MicroStrategy Web. MicroStrategy Desktop lets users start discovering and visualizing data instantly. It is available for Mac and PC. It allows users to connect, prepare, discover, and visualize data. In addition, we can easily promote to a MicroStrategy Server. Moreover, MicroStrategy Desktop has a brand new HTML5 interface and includes all connection drivers. It allows us to use data blending, data preparation, and data enrichment. Finally, it has powerful advanced analytics and can be integrated with R. To cut a long story short, we want to notice main changes of new BI platform. All developers keep the same functionality, the looks as well as architect the same. All changes are about Web interface and Intelligence Server. Let's look closer at what MicroStrategy 10 can show us. MicroStrategy 10 expands the analytical ecosystem by using third-party toolkits such as: Data visualization libraries: We can easily plug in and use any visualization from the expanding range of Java libraries Statistical toolkits: R, SAS, SPSS, KXEN, and others Geolocation data visualization: Uses mapping capabilities to visualize and interact with location data MicroStrategy 10 has more than 25 new data sources that we can connect to quickly and simply. In addition, it allows us build reports on top of other BI tools, such as SAP Business Objects, Cognos, and Oracle BI. It has a new connector to Hadoop, which uses the native connector. Moreover, it allows us to blend multiple data sources in-memory. We want to notice that MicroStrategy 10 got reach functionality for work with data such as: Streamlined workflows to parse and prepare data Multi-table in-memory support from different sources Automatically parse and prepare data with every refresh 100+ inbuilt functions to profile and clean data Create custom groups on the fly without coding In terms of connection to Hadoop, most BI products use Hive or Impala ODBC drivers in order to use SQL to get data from Hadoop. However, this method is bad in terms of performance. MicroStrategy 10 queries directly against Hadoop. As a result, it is up to 50 times faster than via ODBC. Let's look at some of the main technical changes that have significantly improved MicroStrategy. The platform is now faster than ever before, because it doesn't have a two-billion-row limit on in-memory datasets and allows us to create analytical cubes up to 16 times bigger in size. It publishes cubes dramatically faster. Moreover, MicroStrategy 10 has higher data throughput and cubes can be loaded in parallel 4 times faster with multi-threaded parallel loading. In addition, the in-memory engine allows us to create cubes 80 times larger than before, and we can access data from cubes 50% faster, by using up to 8 parallel threads. Look at the following table, where we compare in-memory cube functionality in version 9 versus version 10: Feature Ver. 9 Ver. 10 Data volume 100 GB ~2TB Number of rows 2 billion 200 billion Load rate 8 GB/hour ~200 GB/hour Data model Star schema Any schema, tabular or multiple sets   In order to make the administration of MicroStrategy more effective in the new version, MicroStrategy Operation Manager was released. It gives MicroStrategy administrators powerful development tools to monitor, automate, and control systems. Operations Manager gives us: Centralized management in a web browser Enterprise Manager Console within Tool Triggers and 24/7 alerts System health monitors Server management Multiple environment administration MicroStrategy 10 education and certification MicroStrategy 10 offers new training courses that can be conducted offline in a training center, or online at http://www.microstrategy.com/us/services/education. We believe that certification is a good thing on your journey. The following certifications now exist for version 10: MicroStrategy 10 Certified Associated Analyst MicroStrategy 10 Certified Application Designer MicroStrategy 10 Certified Application Developer MicroStrategy 10 Certified Administrator After passing all of these exams, you will become a MicroStrategy 10 Application Engineer. More details can be found here: http://www.microstrategy.com/Strategy/media/downloads/training-events/MicroStrategy-certification-matrix_v10.pdf. History ofMicroStrategy Let us briefly look at the history of MicroStrategy, which began in 1991: 1991: Released first BI product, which allowed users to create graphical views and analyses of information data 2000: Released MicroStrategy 7 with a web interface 2003: First to release a fully integrated reporting tool, combining list reports, BI-style dashboards, and interface analyses in a single module. 2005: Released MicroStrategy 8, including one-click actions and drag-and-drop dashboard creation 2009: Released MicroStrategy 9, delivering a seamless consolidated path from department to enterprise BI 2010: Unveiled new mobile BI capabilities for iPad and iPhone, and was featured on the iTunes Bestseller List 2011: Released MicroStrategy Cloud, the first SaaS offering from a major BI vendor 2012: Released Visual Data Discovery and groundbreaking new security platform, Usher 2013: Released expanded Analytics Platform and free Analytics Desktop client 2014: Announced availability of MicroStrategy Analytics via Amazon Web Services (AWS) 2015: MicroStrategy 10 was released, the first ever enterprise analytics solution for centralized and decentralized BI DeployingMicroStrategy 10 We know only one way to master MicroStrategy, through practical exercises. Let's start by downloading and deploying MicroStrategy 10.2. Overview of training architecture In order to master MicroStrategy and learn about some BI considerations, we need to download the all-important software, deploy it, and connect to a network. During the preparation of the training environment, we will cover the installation of MicroStrategy on a Linux operating system. This is very good practice, because many people work with Windows and are not familiar with Linux, so this chapter will provide additional knowledge of working with Linux, as well as installing MicroStrategy and a web server. Look at the training architecture: There are three main components: Red Hat Linux 6.4: Used for deploying the web server and Intelligence Server. Windows machine: Uses MicroStrategy Client and Oracle database. Virtual machine with Hadoop: Ready virtual machine with Hadoop, which will connect to MicroStrategy using a brand new connection. In the real world, we should use separate machines for every component, and sometimes several machines in order to run one component. This is called clustering. Let's create a virtual machine. Creating of Red Hat Linux virtual machine Let's create a virtual machine with Red Hat Linux, which will host our Intelligence Server: Go to http://www.redhat.com/ and create an account Go to the software download center: https://access.redhat.com/downloads Download RHEL: https://access.redhat.com/downloads/content/69/ver=/rhel---7/7.2/x86_64/product-software Choose Red Hat Enterprise Linux Server Download Red Hat Enterprise Linux 6.4 x86_64 Choose Binary DVD Now we can create a virtual machine with RHEL 6.4. We have several options in order to choose the software for deploying virtual machine. In our case, we will use a VMware workstation. Before starting to deploy a new VM, we should adjust the default settings, such as increasing RAM and HDD, and adding one more network card in order to connect the external environment with the MicroStrategyClient and sample database. In addition, we should create a new network. When the deployment of the RHEL virtual machine is complete, we should activate a subscription in order to install the required packages. Let us do this with one command in the terminal: # subscription-manager register --username <username> --password <password> --auto-attach Performing prerequisites for MicroStrategy 10 According to the installation and configuration guide, we should deploy all necessary packages. In order to install them, we should execute them under the root: # su # yum install compat-libstdc++-33.i686 # yum install libXp.x86_64 # yum install elfutils-devel.x86_64 # yum install libstdc++-4.4.7-3.el6.i686 # yum install krb5-libs.i686 # yum install nss-pam-ldapd.i686 # yum install ksh.x86_64 The project design process Project design is not just about creating a project in MicroStrategy architect; it involves several steps and thorough analysis, such as how data is stored in the data warehouse, what reports the user wants based on the data, and so on. The following are the steps involved in our project design process: Logical data model design Once the user have business requirements documented, the user must create a fact qualifier matrix to identify the attributes, facts, and hierarchies, which are the building blocks of any logical data model. An example of a fact qualifier is as follows: A logical data model is created based on the source systems and designed before defining a data warehouse. So, it's good for seeing which objects the users want and checking whether the objects are in the source systems. It represents the definition, characteristics, and relationships of the data. This graphical representation of information is easily understandable by business users too. A logical data model graphically represents the following concepts: Attributes: Provides a detailed description of the data Facts: Provide numerical information about the data Hierarchies: Provide relationships between data Data warehouse schema design Physical data warehouse design is based on the logical data model and represents the storage and retrieval of data from the data warehouse. Here, we determine the optimal schema design, which ensures reporting performance and maintenance. The key components of a physical data warehouse schema are columns and tables: Columns: These store attribute and fact data. The following are the three types of columns: ID column: Stores the ID for an attribute Description column: Stores text description of the attribute Fact column: Stores fact data Tables: Physical grouping of related data. Following are the types of tables: Lookup tables: Store information about attributes such as IDs and descriptions Relationship tables: Store information about relationship between two or more attributes Fact tables: Store factual data and the level of aggregation, which is defined based on the attributes of the fact table. They contain base fact columns or derived fact columns: Base fact: Stores the data at the lowest possible level of detail. Aggregate fact: Stores data at a higher or summarized level of detail. Mobile server installation and configuration While mobile client is easy to install, mobile server is not. Here we provide a step-by-step guide on how to install mobile server: Download MicroStrategyMobile.war. Mobile server is packed in a WAR file, just like Operation Manager or Web: Copy MicroStrategyMobile.war from <Microstrategy Installation folder>/Mobile/MobileServer to /usr/local/tomcat7/webapps. Then restart Tomcat, by issuing the ./shutdown.sh and ./startup.sh commands: Connect to the mobile server. Go to http://192.168.81.134:8080/MicroStrategyMobile/servlet/mstrWebAdmin. Then add the server name localhost.localdomain and click connect: Configure mobile server. You can configure (1) Authentication settings for the mobile server application; (2) Privileges and permissions; (3) SSL encryption; (4) Client authentication with a certificate server; (5) Destination folder for the photo uploader widget and signature capture input control. Performing Pareto analysis One good thing about data discovery tools is their agile approach to the data. We can connect any data source and easily slice and dice data. Let's try to use the Pareto principle in order to answer the question: How are sales distributed among the different products? The Pareto principle states that, for many events, roughly 80% of results come from 20% of the causes. For example, 80% of profits come from 20% of the products offered. This type of analysis is very popular in product analytics. In MicroStrategy Desktop, we can use shortcut metrics in order to quickly make complex calculations such as running sums or a percent of the total. Let's build a visualization in order to see the 20% of products that bring us 80% of the money: Choose Combo Chart. Drag and drop Salesamount to the vertical and Englishproductname to the horizontal. Add Orderdate to the filters and restrict to 60 days. Right-click on Sales amountand choose Descending Sort. Right-click on Salesamount | ShortcutMetrics | Percent Running Total. Drag and drop Metric Names to the Color By. Change the color of Salesamount and Percent Running Total. Change the shape of Percent Running Total. As a result, we get this chart: From this chart we can quickly understand our top 20% of products which bring us 80% of revenue. Splunk and MicroStrategy MicroStrategy 10 has announced a new connection to Splunk. I suppose that Splunk is not very popular in the world of Business Intelligence. Most people who have heard about Splunk think that it is just a platform for processing logs. The answers is both true and false. Splunk was derived from the world of spelunking, because searching for root causes in logs is a kind of spelunking without light, and Splunk solves this problem by indexing machine data from a tremendous number of data sources, starting from applications, hardware, sensors, and so on. What is Splunk Splunk's goal is making machine data accessible, usable, and valuable for everyone, and turning machine data into business value. It can: Collect data from anywhere Search and analyze everything Gain real-time Operational Intelligence In the BI world, everyone knows what a data warehouse is. Creating reports from Splunk Now we are ready to build reports using MicroStrategy Desktop and Splunk. Let's do it: Go to MicroStrategy Desktop, click add data, and choose Splunk Create a connection using the existing DNS based on Splunk ODBC: Choose one of tables (Splunk reports): Add other tables as new data sources. Now we can build a dashboard using data from Splunk by dragging and dropping attributes and metrics: Summary In this article we looked at MicroStrategy 10 and its features. We learned about its history and deployment. We also learnt about the project design process, the Pareto analysis and about the connection of Splunk and MicroStrategy. Resources for Article: Further resources on this subject: Stacked Denoising Autoencoders [article] Creating external tables in your Oracle 10g/11g Database [article] Clustering Methods [article]
Read more
  • 0
  • 0
  • 4351

article-image-sap-hana-integration-microsoft-excel
Packt
03 Jan 2013
4 min read
Save for later

SAP HANA integration with Microsoft Excel

Packt
03 Jan 2013
4 min read
(For more resources related to this topic, see here.) Once your application is finished inside SAP HANA, and you can see that it performs as expected inside the Studio, you need to be able to deploy it to your users. Asking them to use the Studio is not really practical, and you don’t necessarily want to put the modeling software in the hands of all your users. Reporting on SAP HANA can be done in most of SAP’s Business Objects suite of applications, or in tools which can create and consume MDX queries and data. The simplest of these tools to start with is probably Microsoft Excel. Excel can connect to SAP HANA using the MDX language (a kind of multidimensional SQL) in the form of pivot tables. These in turn allow users to “slice and dice” data as they require, to extract the metrics they need. There are (at time of writing) limitations to the integration with SAP HANA and external reporting tools. These limitations are due to the relative youth of the HANA product, and are being addressed with each successive update to the software. Those listed here are valid for SAP HANA SP04, they may or may not be valid for your version: Hierarchies can only be visualized in Microsoft Excel, not in BusinessObjects Prompts can only be used in Business Objects BI4. Views which use variables can be used in other tools, but only if the variable has a default value (if you don’t have a default value on the variable, then Excel, notably, will complain that the view has been “changed on the server”) In order to make MDX connections to SAP HANA, the SAP HANA Client software is needed. This is separate to the Studio, and must be installed on the client workstation. Like the Studio itself, it can be found on the SAP HANA DVD set, or in the SWDC. Additionally, like the studio, SAP provides a developer download of the client software on SDN, at the following link: http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/webcontent/uuid/402aa158-6a7a-2f10-0195-f43595f6fe5f Just download the appropriate version for your Microsoft Office installation. Even if your PC has a 64-bit installation of Windows, you most likely have a 32-bit installation of Office, and you’ll need the 32-bit version of the SAP HANA Client software. If you’re not sure, you can find the information in the Help | About dialog box. In Excel 2010, for example, click on the File tab, then the Help menu entry. The version is specified on the right of the page: Just install the client software like you installed the studio, usually to the default location. Once the software is installed, there is no shortcut created on your desktop, and no entry will be created in your “Start” menu, so don’t be surprised to not see anything to run. We’re going to incorporate our sales simulator in Microsoft Excel, so launch Excel now. Go to the Data tab, and click on From Other Sources, then From Data Connection Wizard, as shown: Next, select Other/Advanced, then SAP HANA MDX provider, and then click Next. The SAP HANA Logon dialog will appear, so enter your Host, Instance, and login information (the same information you use to connect to SAP HANA with the Studio). Click on Test Connection to validate the connection. If the test succeeds, click on OK to choose the CUBE to which you want to connect. In Excel, all your Analytic and Calculation Views are considered to the cubes. Choose your Analytic or Calculation view and click Next. On this screen there’s a checkbox Save password in file – this will avoid having to type in the SAP HANA password every time the Excel file is opened – but the password is stored in the Excel file, which is a little less secure. Click on the Finish button to create the connection to SAP HANA, and your View. On the next screen you’ll be asked where you want to insert the pivot table, just click on OK, to see the results: Congratulations! You now have your reporting application available in Microsoft Excel, showing the same information you could see using the Data Preview feature of the SAP HANA Studio. Resources for Article : Further resources on SAP HANA Starter: SAP NetWeaver: MDM Scenarios and Fundamentals [Article] SAP BusinessObjects: Customizing the Dashboard [Article] SQL Query Basics in SAP Business One [Article]
Read more
  • 0
  • 0
  • 4336

article-image-creating-interactive-spreadsheets-using-tables-and-slicers
Packt
06 Oct 2015
10 min read
Save for later

Creating Interactive Spreadsheets using Tables and Slicers

Packt
06 Oct 2015
10 min read
In this article by Hernan D Rojas author of the book Data Analysis and Business Modeling with Excel 2013 introduces additional materials to the advanced Excel developer in the presentation stage of the data analysis life cycle. What you will learn in this article is how to leverage Excel's new features to add interactivity to your spreadsheets. (For more resources related to this topic, see here.) What are slicers? Slicers are essentially buttons that automatically filter your data. Excel has always been able to filter data, but slicers are more practical and visually appealing. Let's compare the two in the following steps: First, fire up Excel 2013, and create a new spreadsheet. Manually enter the data, as shown in the following figure: Highlight cells A1 through E11, and press Ctrl + T to convert our data to an Excel table. Converting your data to a table is the first step that you need to take in order to introduce slicers in your spreadsheet. Let's filter our data using the default filtering capabilities that we are already familiar with. Filter the Type column and only select the rows that have the value equal to SUV, as shown in the following figure. Click on the OK button to apply the filter to the table. You will now be left with four rows that have the Type column equal to SUV. Using a typical Excel filter, we were able to filter our data and only show all of the SUV cars. We can then continue to filter by other columns, such as MPG (miles per gallon) and Price. How can we accomplish the same results using slicers? Continue reading this article to find this out. How to create slicers? In this article, we will be going through simple but powerful steps that are required to build slicers. After we create our first slicer, make sure that you compare and contrast the old way of filtering versus the new way of filtering data. Remove the filter that we just applied to our table by clicking on the option named Clear Filter From "Type", as shown in the following figure: With your Excel table selected, click on the TABLE TOOLS tab. Click on the Insert Slicer button. In the Insert Slicers dialog box, select the Type checkbox, and click on the OK button, as shown in the following screenshot: You should now have a slicer that looks similar to the one in the following figure. Notice that you can resize and move the slicer anywhere you want in the spreadsheet. Click on the Sedan filter in the slicer that we build in the previous step. Wow! The data is filtered and only the rows where the Type column is equal to Sedan is shown in the results. Click on the Sport filter and see what happens. The data is now filtered where the Type column is equal to Sport. Notice that the previous filter of Sedan was removed as soon as we clicked on the Sport filter. What if we want to filter the data by both Sport and Sedan? We can just highlight both the filters with our mouse, or click on Sedan, press Ctrl, and then, click on the Sport filter. The end result will look like this: To clear the filter, just click on the Clear Filter button. Do you see the advantage of slicers over filters? Yes, of course, they are simply better. Filtering between Sedan, Sport, or SUV is very easy and convenient. It will certainly take less key strokes and the feedback is instant. Think about the end users interacting with your spreadsheet. At a touch of a button, they can answer questions that arise in their heads. This is what you call an interactive spreadsheet or an interactive dashboard. Styling slicers There are not many options to style slicers but Excel does give you a decent amount of color schemes that you can experiment with: With the Type slicer selected, navigate to the SLICER TOOLS tab, as shown in the following figure: Click on the various slicer styles available to get a feel of what Excel offers. Adding multiple slicers You are able to add multiple slicers and multiple charts in one spreadsheet. Why would we do this? Well, this is the beginning of a dashboard creation. Let's expand on the example we have just been working on, and see how we can turn raw data into an interactive dashboard: Let's start with creating slicers for # of Passengers and MPG, as shown in the following figure: Rename Sheet1 as Data, and create a new sheet called Dashboard, as shown here: Move the three slicers by cutting and pasting them from the Data sheet to the Dashboard sheet. Create a line chart using the columns Company and MPG, as shown in the following figure: Create a bar chart using the columns Type and MPG. Create another bar chart with the columns company and # of Passengers, as shown in the following figure. These types of charts are technically called column charts, but you can get away by calling them bar charts. Now, move the three charts from the Data tab to the Dashboard tab. Right-click on the bar chart, and select the Move Chart… option. In the Move Chart dialog box, change the Object in: parameter from Data to Dashboard, and then click on the OK button. Move the other two charts to the Dashboard tab so that there are no more charts in the Data tab. Rearrange the charts and slicers so that they look as closely as possible to the ones in the following figure. As you can see that this tab is starting to look like a dashboard. The Type slicer will look better if Sedan, Sport, and SUV are laid out horizontally. Select the Type slicer, and click on the SLICER TOOLS menu option. Change the Columns parameter from 1 to 3, as shown in the following figure. This is how we are able to change the layout or shape of the slicer. Resize the Type slicer so that it looks like the one in the following figure: Clearing filters You can click on one or more filters in the dashboard that we just created. Very cool! Every time we select a filter, all of the three charts that we created get updated. This again is called adding interactivity to your spreadsheets. This allows the end users of your dashboard to interact with your data and perform their own analysis. If you notice, there is really a no good way of removing multiple filters at once. For example, if you select Sedans that have a MPG of greater or equal to 30, how would you remove all of the filters? You would have to clear the filters from the Type slicer and then from the MPG slicer. This can be a little tedious to your end user, and you will want to avoid this at any cost. The next steps will show you how to create a button using VBA that will filter all of our data in a flash: Press Alt + F11, and create a sub procedure called Clear_Slicer, as shown in the following figure. This code will basically find all of the filters that you have selected and then manually clears them for you one at a time. The next step is to bind this code to a button: Sub Clear_Slicer() ' Declare Variables Dim cache As SlicerCache ' Loop through each filter For Each cache In ActiveWorkbook.SlicerCaches ' clear filter cache.ClearManualFilter Next cache End Sub Select the DEVELOPER tab, and click on the Insert button. In the pop-up menu called Form Controls, select the Button option. Now, click anywhere on the sheet, and you will get a dialog box that looks like the following figure. This is where we are going to assign a macro to the button. This means that whenever you click on the button we are creating, Excel will run the macro of our choice. Since we have already created a macro called Clear_Slicer, it will make sense to select this macro, and then click on the OK button. Change the text of the button to Clear All Filters and resize it so that it looks like this: Adjust the properties of the button by right-clicking on the button and selecting the Format Control… option. Here, you can change the font size and the color of your button label. Now, select a bunch of filters, and click on our new shiny button. Yes, that was pretty cool. The most important part is that it is now even easier to "reset" your dashboard and start a brand new analysis. What do I mean by start a brand new analysis? In general, when a user initially starts using your dashboard, he/she will click on the filters aimlessly. The users do this just to figure out how to mechanically use the dashboard. Then, after they get the hang of it, they want to start with a clean slate and perform some data analysis. If we did not have the Clear All Filters button, the users would have to figure out how they would clear every slicer one at a time to start over. The worst case scenario is when the user does not realize when the filters are turned on and when they are turned off. Now, do not laugh at this situation, or assume that your end user is not as smart as you are. This just means that you need to lower the learning curve of your dashboard and make it easy to use. With the addition of the clear button, the end user can think of a question, use the slicers to answer it, click on the clear button, and start the process all over again. These little details are what that is going to separate you from the average Excel developer. Summary The aim of this article was to give you ideas and tools to present your data artistically. Whether you like it or not, sometimes, a better looking analysis will trump the better but less attractive one. Excel gives you the tools to not be on the short end of the stick but to always be able to present visually stunning analysis. You now know have your Excel slicers, and you learned how to bind them to your data. Users of your spreadsheet can now slice and dice your data to answer multiple questions. Executives like flashy visualizations, and when you combine them with a strong analysis, you have a very powerful combination. In this article, we also went through a variety of strategies to customize slicers and chart elements. These little changes made to your dashboard will make them standout and help you get your message across. Excel as always has been an invaluable tool that gives you all of the tools necessary to overcome any data challenges you might come across. As I tell all my students, the key to become better is simply to practice, practice, and practice. Resources for Article: Further resources on this subject: Introduction to Stata and Data Analytics [article] Getting Started with Apache Spark DataFrames [article] Big Data Analysis (R and Hadoop) [article]
Read more
  • 0
  • 0
  • 4331

article-image-creating-line-graphs-r
Packt
17 Jan 2011
7 min read
Save for later

Creating Line Graphs in R

Packt
17 Jan 2011
7 min read
Adding customized legends for multiple line graphs Line graphs with more than one line, representing more than one variable, are quite common in any kind of data analysis. In this recipe we will learn how to create and customize legends for such graphs. Getting ready We will use the base graphics library for this recipe, so all you need to do is run the recipe at the R prompt. It is good practice to save your code as a script to use again later. How to do it... First we need to load the cityrain.csv example data file, which contains monthly rainfall data for four major cities across the world. You can download this file from here. We will use the cityrain.csv example dataset. rain<-read.csv("cityrain.csv") plot(rain$Tokyo,type="b",lwd=2, xaxt="n",ylim=c(0,300),col="black", xlab="Month",ylab="Rainfall (mm)", main="Monthly Rainfall in major cities") axis(1,at=1:length(rain$Month),labels=rain$Month) lines(rain$Berlin,col="red",type="b",lwd=2) lines(rain$NewYork,col="orange",type="b",lwd=2) lines(rain$London,col="purple",type="b",lwd=2) legend("topright",legend=c("Tokyo","Berlin","New York","London"), lty=1,lwd=2,pch=21,col=c("black","red","orange","purple"), ncol=2,bty="n",cex=0.8, text.col=c("black","red","orange","purple"), inset=0.01) How it works... We used the legend() function. It is quite a flexible function and allows us to adjust the placement and styling of the legend in many ways. The first argument we passed to legend() specifies the position of the legend within the plot region. We used "topright"; other possible values are "bottomright", "bottom", "bottomleft", "left", "topleft", "top", "right", and "center". We can also specify the location of legend with x and y co-ordinates as we will soon see. The other important arguments specific to lines are lwd and lty which specify the line width and type drawn in the legend box respectively. It is important to keep these the same as the corresponding values in the plot() and lines() commands. We also set pch to 21 to replicate the type="b" argument in the plot() command. cex and text.col set the size and colors of the legend text. Note that we set the text colors to the same colors as the lines they represent. Setting bty (box type) to "n" ensures no box is drawn around the legend. This is good practice as it keeps the look of the graph clean. ncol sets the number of columns over which the legend labels are spread and inset sets the inset distance from the margins as a fraction of the plot region. There's more... Let's experiment by changing some of the arguments discussed: legend(1,300,legend=c("Tokyo","Berlin","New York","London"), lty=1,lwd=2,pch=21,col=c("black","red","orange","purple"), horiz=TRUE,bty="n",bg="yellow",cex=1, text.col=c("black","red","orange","purple")) This time we used x and y co-ordinates instead of a keyword to position the legend. We also set the horiz argument to TRUE. As the name suggests, horiz makes the legend labels horizontal instead of the default vertical. Specifying horiz overrides the ncol argument. Finally, we made the legend text bigger by setting cex to 1 and did not use the inset argument. An alternative way of creating the previous plot without having to call plot() and lines() multiple times is to use the matplot() function. To see details on how to use this function, please see the help file by running ?matplot or help(matplot) at the R prompt. Using margin labels instead of legends for multiple line graphs While legends are the most commonly used method of providing a key to read multiple variable graphs, they are often not the easiest to read. Labelling lines directly is one way of getting around that problem. Getting ready We will use the base graphics library for this recipe, so all you need to do is run the recipe at the R prompt. It is good practice to save your code as a script to use again later. How to do it... Let's use the gdp.txt example dataset to look at the trends in the annual GDP of five countries: gdp<-read.table("gdp_long.txt",header=T) library(RColorBrewer) pal<-brewer.pal(5,"Set1") par(mar=par()$mar+c(0,0,0,2),bty="l") plot(Canada~Year,data=gdp,type="l",lwd=2,lty=1,ylim=c(30,60), col=pal[1],main="Percentage change in GDP",ylab="") mtext(side=4,at=gdp$Canada[length(gdp$Canada)],text="Canada", col=pal[1],line=0.3,las=2) lines(gdp$France~gdp$Year,col=pal[2],lwd=2) mtext(side=4,at=gdp$France[length(gdp$France)],text="France", col=pal[2],line=0.3,las=2) lines(gdp$Germany~gdp$Year,col=pal[3],lwd=2) mtext(side=4,at=gdp$Germany[length(gdp$Germany)],text="Germany", col=pal[3],line=0.3,las=2) lines(gdp$Britain~gdp$Year,col=pal[4],lwd=2) mtext(side=4,at=gdp$Britain[length(gdp$Britain)],text="Britain", col=pal[4],line=0.3,las=2) lines(gdp$USA~gdp$Year,col=pal[5],lwd=2) mtext(side=4,at=gdp$USA[length(gdp$USA)]-2, text="USA",col=pal[5],line=0.3,las=2) How it works... We first read the gdp.txt data file using the read.table() function. Next we loaded the RColorBrewer color palette library and set our color palette pal to "Set1" (with five colors). Before drawing the graph, we used the par() command to add extra space to the right margin, so that we have enough space for the labels. Depending on the size of the text labels you may have to experiment with this margin until you get it right. Finally, we set the box type (bty) to an L-shape ("l") so that there is no line on the right margin. We can also set it to "c" if we want to keep the top line. We used the mtext() function to label each of the lines individually in the right margin. The first argument we passed to the function is the side where we want the label to be placed. Sides (margins) are numbered starting from 1 for the bottom side and going round in a clockwise direction so that 2 is left, 3 is top, and 4 is right. The at argument was used to specify the Y co-ordinate of the label. This is a bit tricky because we have to make sure we place the label as close to the corresponding line as possible. So, here we have used the last value of each line. For example, gdp$France[length(gdp$France) picks the last value in the France vector by using its length as the index. Note that we had to adjust the value for USA by subtracting 2 from its last value so that it doesn't overlap the label for Canada. We used the text argument to set the text of the labels as country names. We set the col argument to the appropriate element of the pal vector by using a number index. The line argument sets an offset in terms of margin lines, starting at 0 counting outwards. Finally, setting las to 2 rotates the labels to be perpendicular to the axis, instead of the default value of 1 which makes them parallel to the axis. Sometimes, simply using the last value of a set of values may not work because the value may be missing. In that case we can use the second last value or visually choose a value that places the label closest to the line. Also, the size of the plot window and the proximity of the final values may cause overlapping of labels. So, we may need to iterate a few times before we get the placement right. We can write functions to automate this process but it is still good to visually inspect the outcome.
Read more
  • 0
  • 0
  • 4319
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-working-apps-splunk
Packt
08 Mar 2013
6 min read
Save for later

Working with Apps in Splunk

Packt
08 Mar 2013
6 min read
(For more resources related to this topic, see here.) Defining an app In the strictest sense, an app is a directory of configurations and, sometimes, code. The directories and files inside have a particular naming convention and structure. All configurations are in plain text, and can be edited using your choice of text editor. Apps generally serve one or more of the following purposes: A container for searches, dashboards, and related configurations: This is what most users will do with apps. This is not only useful for logical grouping, but also for limiting what configurations are applied and at what time. This kind of app usually does not affect other apps. Providing extra functionality: Many objects can be provided in an app for use by other apps. These include field extractions, lookups, external commands, saved searches, workflow actions, and even dashboards. These apps often have no user interface at all; instead they add functionality to other apps. Configuring a Splunk installation for a specific purpose: In a distributed deployment, there are several different purposes that are served by the multiple installations of Splunk. The behavior of each installation is controlled by its configuration, and it is convenient to wrap those configurations into one or more apps. These apps completely change the behavior of a particular installation. Included apps Without apps, Splunk has no user interface, rendering it essentially useless. Luckily, Splunk comes with a few apps to get us started. Let's look at a few of these apps: gettingstarted: This app provides the help screens that you can access from the launcher. There are no searches, only a single dashboard that simply includes an HTML page. search: This is the app where users spend most of their time. It contains the main search dashboard that can be used from any app, external search commands that can be used from any app, admin dashboards, custom navigation, custom css, a custom app icon, a custom app logo, and many other useful elements. splunk_datapreview: This app provides the data preview functionality in the admin interface. It is built entirely using JavaScript and custom REST endpoints. SplunkDeploymentMonitor: This app provides searches and dashboards to help you keep track of your data usage and the health of your Splunk deployment. It also defines indexes, saved searches, and summary indexes. It is a good source for more advanced search examples. SplunkForwarder and SplunkLightForwarder: These apps, which are disabled by default, simply disable portions of a Splunk installation so that the installation is lighter in weight. If you never create or install another app, and instead simply create saved searches and dashboards in the app search, you can still be quite successful with Splunk. Installing and creating more apps, however, allows you to take advantage of others' work, organize your own work, and ultimately share your work with others. Installing apps Apps can either be installed from Splunkbase or uploaded through the admin interface. To get started, let's navigate to Manager | Apps, or choose Manage apps... from the App menu as shown in the following screenshot: Installing apps from Splunkbase If your Splunk server has direct access to the Internet, you can install apps from Splunkbase with just a few clicks. Navigate to Manager | Apps and click on Find more apps online. The most popular apps will be listed as follows: Let's install a pair of apps and have a little fun. First, install Geo Location Lookup Script (powered by MAXMIND) by clicking on the Install free button. You will be prompted for your splunk.com login. This is the same login that you created when you downloaded Splunk. If you don't have an account, you will need to create one. Next, install the Google Maps app. This app was built by a Splunk customer and contributed back to the Splunk community. This app will prompt you to restart Splunk. Once you have restarted and logged back in, check the App menu. Google Maps is now visible, but where is Geo Location Lookup Script? Remember that not all apps have dashboards; nor do they necessarily have any visible components at all. Using Geo Location Lookup Script Geo Location Lookup Script provides a lookup script to provide geolocation information for IP addresses. Looking at the documentation, we see this example: eventtype=firewall_event | lookup geoip clientip as src_ip You can find the documentation for any Splunkbase app by searching for it at splunkbase.com, or by clicking on Read more next to any installed app by navigating to Manager | Apps | Browse more apps. Let's read through the arguments of the lookup command: geoip: This is the name of the lookup provided by Geo Location Lookup Script. You can see the available lookups by going to Manager | Lookups | Lookup definitions. clientip: This is the name of the field in the lookup that we are matching against. as src_ip: This says to use the value of src_ip to populate the field before it; in this case, clientip. I personally find this wording confusing. In my mind, I read this as "using" instead of "as". Included in the ImplementingSplunkDataGenerator app (available at http://packtpub.com/) is a sourcetype instance named impl_splunk_ips, which looks like this: 2012-05-26T18:23:44 ip=64.134.155.137 The IP addresses in this fictitious log are from one of my websites. Let's see some information about these addresses: sourcetype="impl_splunk_ips" | lookup geoip clientip AS ip | top client_country This gives us a table similar to the one shown in the following screenshot: That's interesting. I wonder who is visiting my site from Slovenia! Using Google Maps Now let's do a similar search in the Google Maps app. Choose Google Maps from the App menu. The interface looks like the standard search interface, but with a map instead of an event listing. Let's try this remarkably similar (but not identical) query using a lookup provided in the Google Maps app: sourcetype="impl_splunk_ips" | lookup geo ip The map generated looks like this: Unsurprisingly, most of the traffic to this little site came from my house in Austin, Texas. Installing apps from a file It is not uncommon for Splunk servers to not have access to the Internet, particularly in a datacenter. In this case, follow these steps: Download the app from splunkbase.com. The file will have a .spl or .tgz extension. Navigate to Manager | Apps. Click on Install app from file. Upload the downloaded file using the form provided. Restart if the app requires it. Configure the app if required. That's it. Some apps have a configuration form. If this is the case, you will see a Set up link next to the app when you go to Manager | Apps. If something goes wrong, contact the author of the app. If you have a distributed environment, in most cases the app only needs to be installed on your search head. The components that your indexers need will be distributed automatically by the search head. Check the documentation for the app.
Read more
  • 0
  • 0
  • 4316

article-image-deploying-storm-hadoop-advertising-analysis
Packt
24 Mar 2014
5 min read
Save for later

Deploying Storm on Hadoop for Advertising Analysis

Packt
24 Mar 2014
5 min read
(For more resources related to this topic, see here.) Establishing the architecture The recent componentization within Hadoop allows any distributed system to use it for resource management. In Hadoop 1.0, resource management was embedded into the MapReduce framework as shown in the following diagram: Hadoop 2.0 separates out resource management into YARN, allowing other distributed processing frameworks to run on the resources managed under the Hadoop umbrella. In our case, this allows us to run Storm on YARN as shown in the following diagram: As shown in the preceding diagram, Storm fulfills the same function as MapReduce. It provides a framework for the distributed computation. In this specific use case, we use Pig scripts to articulate the ETL/analysis that we want to perform on the data. We will convert that script into a Storm topology that performs the same function, and then we will examine some of the intricacies involved in doing that transformation. To understand this better, it is worth examining the nodes in a Hadoop cluster and the purpose of the processes running on those nodes. Assume that we have a cluster as depicted in the following diagram: There are two different components/subsystems shown in the diagram. The first is YARN, which is the new resource management layer introduced in Hadoop 2.0. The second is HDFS. Let's first delve into HDFS since that has not changed much since Hadoop 1.0. Examining HDFS HDFS is a distributed filesystem. It distributes blocks of data across a set of slave nodes. The NameNode is the catalog. It maintains the directory structure and the metadata indicating which nodes have what information. The NameNode does not store any data itself, it only coordinates create, read, update, and delete (CRUD) operations across the distributed filesystem. Storage takes place on each of the slave nodes that run DataNode processes. The DataNode processes are the workhorses in the system. They communicate with each other to rebalance, replicate, move, and copy data. They react and respond to the CRUD operations of clients. Examining YARN YARN is the resource management system. It monitors the load on each of the nodes and coordinates the distribution of new jobs to the slaves in the cluster. The ResourceManager collects status information from the NodeManagers. The ResourceManager also services job submissions from clients. One additional abstraction within YARN is the concept of an ApplicationMaster. An ApplicationMaster manages resource and container allocation for a specific application. The ApplicationMaster negotiates with the ResourceManager for the assignment of resources. Once the resources are assigned, the ApplicationMaster coordinates with the NodeManagers to instantiate containers. The containers are logical holders for the processes that actually perform the work. The ApplicationMaster is a processing-framework-specific library. Storm-YARN provides the ApplicationMaster for running Storm processes on YARN. HDFS distributes the ApplicationMaster as well as the Storm framework itself. Presently, Storm-YARN expects an external ZooKeeper. Nimbus starts up and connects to the ZooKeeper when the application is deployed. The following diagram depicts the Hadoop infrastructure running Storm via Storm-YARN: As shown in the preceding diagram, YARN is used to deploy the Storm application framework. At launch, Storm Application Master is started within a YARN container. That, in turn, creates an instance of Storm Nimbus and the Storm UI. After that, Storm-YARN launches supervisors in separate YARN containers. Each of these supervisor processes can spawn workers within its container. Both Application Master and the Storm framework are distributed via HDFS. Storm-YARN provides command-line utilities to start the Storm cluster, launch supervisors, and configure Storm for topology deployment. We will see these facilities later in this article. To complete the architectural picture, we need to layer in the batch and real-time processing mechanisms: Pig and Storm topologies, respectively. We also need to depict the actual data. Often a queuing mechanism such as Kafka is used to queue work for a Storm cluster. To simplify things, we will use data stored in HDFS. The following depicts our use of Pig, Storm, YARN, and HDFS for our use case, omitting elements of the infrastructure for clarity. To fully realize the value of converting from Pig to Storm, we would convert the topology to consume from Kafka instead of HDFS as shown in the following diagram: As the preceding diagram depicts, our data will be stored in HDFS. The dashed lines depict the batch process for analysis, while the solid lines depict the real-time system. In each of the systems, the following steps take place: Step Purpose Pig Equivalent Storm-Yarn Equivalent 1 The processing frameworks are deployed The MapReduce Application Master is deployed and started Storm-YARN launches Application Master and distributes Storm framework 2 The specific analytics are launched The Pig script is compiled to MapReduce jobs and submitted as a job Topologies are deployed to the cluster 3 The resources are reserved Map and reduce tasks are created in YARN containers Supervisors are instantiated with workers 4 The analyses reads the data from storage and performs the analyses Pig reads the data out of HDFS Storm reads the work, typically from Kafka; but in this case, the topology reads it from a flat file Another analogy can be drawn between Pig and Trident. Pig scripts compile down into MapReduce jobs, while Trident topologies compile down into Storm topologies. For more information on the Storm-YARN project, visit the following URL: https://github.com/yahoo/storm-yarn
Read more
  • 0
  • 0
  • 4295

article-image-using-spark-shell
Packt
18 Oct 2013
5 min read
Save for later

Using the Spark Shell

Packt
18 Oct 2013
5 min read
(For more resources related to this topic, see here.) Loading a simple text file When running a Spark shell and connecting to an existing cluster, you should see something specifying the app ID like Connected to Spark cluster with app ID app-20130330015119-0001. The app ID will match the application entry as shown in the web UI under running applications (by default, it would be viewable on port 8080). You can start by downloading a dataset to use for some experimentation. There are a number of datasets that are put together for The Elements of Statistical Learning, which are in a very convenient form for use. Grab the spam dataset using the following command: wget http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/spam.data Now load it as a text file into Spark with the following command inside your Spark shell: scala> val inFile = sc.textFile("./spam.data") This loads the spam.data file into Spark with each line being a separate entry in the RDD (Resilient Distributed Datasets). Note that if you've connected to a Spark master, it's possible that it will attempt to load the file on one of the different machines in the cluster, so make sure it's available on all the cluster machines. In general, in future you will want to put your data in HDFS, S3, or similar file systems to avoid this problem. In a local mode, you can just load the file directly, for example, sc.textFile([filepah]). To make a file available across all the machines, you can also use the addFile function on the SparkContext by writing the following code: scala> import spark.SparkFiles; scala> val file = sc.addFile("spam.data") scala> val inFile = sc.textFile(SparkFiles.get("spam.data")) Just like most shells, the Spark shell has a command history.You can press the up arrow key to get to the previous commands. Getting tired of typing or not sure what method you want to call on an object? Press Tab, and the Spark shell will autocomplete the line of code as best as it can. For this example, the RDD with each line as an individual string isn't very useful, as our data input is actually represented as space-separated numerical information. Map over the RDD, and quickly convert it to a usable format (note that _.toDouble is the same as x => x.toDouble): scala> val nums = inFile.map(x => x.split(' ').map(_.toDouble)) Verify that this is what we want by inspecting some elements in the nums RDD and comparing them against the original string RDD. Take a look at the first element of each RDD by calling .first() on the RDDs: scala> inFile.first() [...] res2: String = 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.778 0 0 3.756 61 278 1 scala> nums.first() [...] res3: Array[Double] = Array(0.0, 0.64, 0.64, 0.0, 0.32, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.64, 0.0, 0.0, 0.0, 0.32, 0.0, 1.29, 1.93, 0.0, 0.96, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.778, 0.0, 0.0, 3.756, 61.0, 278.0, 1.0) Using the Spark shell to run logistic regression When you run a command and have not specified a left-hand side (that is, leaving out the val x of val x = y), the Spark shell will print the value along with res[number]. The res[number] function can be used as if we had written val res[number] = y.Now that you have the data in a more usable format, try to do something cool with it! Use Spark to run logistic regression over the dataset as follows: scala> import spark.util.Vectorimport spark.util.Vectorscala> case class DataPoint(x: Vector, y: Double)defined class DataPointscala> def parsePoint(x: Array[Double]): DataPoint = {DataPoint(new Vector(x.slice(0,x.size-2)) , x(x.size-1))}parsePoint: (x: Array[Double])this.DataPointscala> val points = nums.map(parsePoint(_))points: spark.RDD[this.DataPoint] = MappedRDD[3] at map at<console>:24scala> import java.util.Randomimport java.util.Randomscala> val rand = new Random(53)rand: java.util.Random = java.util.Random@3f4c24scala> var w = Vector(nums.first.size-2, _ => rand.nextDouble)13/03/31 00:57:30 INFO spark.SparkContext: Starting job: first at<console>:20...13/03/31 00:57:30 INFO spark.SparkContext: Job finished: first at<console>:20, took 0.01272858 sw: spark.util.Vector = (0.7290865701603526, 0.8009687428076777,0.6136632797111822, 0.9783178194773176, 0.3719683631485643,0.46409291255379836, 0.5340172959927323, 0.04034252433669905,0.3074428389716637, 0.8537414030626244, 0.8415816118493813,0.719935849109521, 0.2431646830671812, 0.17139348575456848,0.5005137792223062, 0.8915164469396641, 0.7679331873447098,0.7887571495335223, 0.7263187438977023, 0.40877063468941244,0.7794519914671199, 0.1651264689613885, 0.1807006937030201,0.3227972103818231, 0.2777324549716147, 0.20466985600105037,0.5823059390134582, 0.4489508737465665, 0.44030858771499415,0.6419366305419459, 0.5191533842209496, 0.43170678028084863,0.9237523536173182, 0.5175019655845213, 0.47999523211827544,0.25862648071479444, 0.020548000101787922, 0.18555332739714137, 0....scala> val iterations = 100iterations: Int = 100scala> import scala.math._scala> for (i <- 1 to iterations) {val gradient = points.map(p =>(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x).reduce(_ + _)w -= gradient}[....]scala> wres27: spark.util.Vector = (0.2912515190246098, 1.05257972144256,1.1620192443948825, 0.764385365541841, 1.3340446477767611,0.6142105091995632, 0.8561985593740342, 0.7221556020229336,0.40692442223198366, 0.8025693176035453, 0.7013618380649754,0.943828424041885, 0.4009868306348856, 0.6287356973527756,0.3675755379524898, 1.2488466496117185, 0.8557220216380228,0.7633511642942988, 6.389181646047163, 1.43344096405385,1.729216408954399, 0.4079709812689015, 0.3706358251228279,0.8683036382227542, 0.36992902312625897, 0.3918455398419239,0.2840295056632881, 0.7757126171768894, 0.4564171647415838,0.6960856181900357, 0.6556402580635656, 0.060307680034745986,0.31278587054264356, 0.9273189009376189, 0.0538302050535121,0.545536066902774, 0.9298009485403773, 0.922750704590723,0.072339496591 If things went well, you just used Spark to run logistic regression. Awsome! We have just done a number of things: we have defined a class, we have created an RDD, and we have also created a function. As you can see the Spark shell is quite powerful. Much of the power comes from it being based on the Scala REPL (the Scala interactive shell), so it inherits all the power of the Scala REPL (Read-Evaluate-Print Loop). That being said, most of the time you will probably want to work with a more traditionally compiled code rather than working in the REPL environment. Summary In this article, you have learned how to load our data and how to use Spark to run logistic regression. Resources for Article: Further resources on this subject: Python Data Persistence using MySQL Part II: Moving Data Processing to the Data [Article] Configuring Apache and Nginx [Article] Advanced Hadoop MapReduce Administration [Article]
Read more
  • 0
  • 0
  • 4294

article-image-executing-pdi-jobs-filesystem-simple
Packt
19 Sep 2013
7 min read
Save for later

Executing PDI jobs from a filesystem (Simple)

Packt
19 Sep 2013
7 min read
(For more resources related to this topic, see here.) Getting ready To get ready for this article, we first need to check that our Java environment is configured properly; to do this, check that the JAVA_HOME environment variable is set. Even if all the PDI scripts, when started, call other scripts that try to find out about our Java execution environment to get the values of the JAVA_HOME variable, it is always a good rule of thumb to have that variable set properly anytime we work with a Java application. The Kitchen script is in the PDI home directory, so the best thing to do to launch the script in the easiest way is to add the path to the PDI home directory to the PATH variable. This gives you the ability to start the Kitchen script from any place without specifying the absolute path to the Kitchen file location. If you do not do this, you will always have to specify the complete path to the Kitchen script file. To play with this article, we will use the samples in the directory <book_samples>/sample1; here, <book_samples> is the directory where you unpacked all the samples of the article. How to do it… For starting a PDI job in Linux or Mac, use the following steps: Open the command-line terminal and go to the <book_samples>/sample1 directory. Let's start the sample job. To identify which job file needs to be started by Kitchen, we need to use the –file argument with the following syntax: –file: <complete_filename_to_job_file> Remember to specify either an absolute path or a relative path by properly setting the correct path to the file. The simplest way to start the job is with the following syntax: $ kitchen.sh –file:./export-job.kjb If you're not positioned locally in the directory where the job files are located, you must specify the complete path to the job file as follows: $ kitchen.sh –file:/home/sramazzina/tmp/samples/export-job.kjb Another option to start our job is to separately specify the name of the directory where the job file is located and then give the name of the job file. To do this, we need to use the –dir argument together with the –file argument. The –dir argument lets you specify the location of the job file directory using the following syntax: –dir: <complete_path_to_ job_file_directory> So, if we're located in the same directory where the job resides, to start the job, we can use the following new syntax: $ kitchen.sh – dir:. –file:export-job.kjb If we're starting the job from a different directory than the directory where the job resides, we can use the absolute path and the –dir argument to set the job's directory as follows: $ kitchen.sh –dir:/home/sramazzina/tmp/samples –file:export-job.kjb For starting a PDI job with parameters in Linux or Mac, perform the following steps: Normally, PDI manages input parameters for the executing job. To set parameters using the command-line script, we need to use a proper argument. We use the –param argument to specify the parameters for the job we are going to launch. The syntax is as follows: -param: <parameter_name>= <parameter_value> Our sample job and transformation does accept a sample parameter called p_country that specifies the name of the country we want to export the customers to a file. Let's suppose we are positioned in the same directory where the job file resides and we want to call our job to extract all the customers for the country U.S.A. In this case, we can call the Kitchen script using the following syntax: $ kitchen.sh –param:p_country=USA -file=./export-job.kjb Of course, you can apply the –param switch to all the other three cases we detailed previously. For starting a PDI job in Windows, use the following steps: In Windows, a PDI job from the filesystem can be started by following the same rules that we saw previously, using the same arguments in the same way. The only difference is in the way we specify the command-line arguments. Any time we start the PDI jobs from Windows, we need to specify the arguments using the / character instead of the – character we used for Linux or Mac. Therefore, this means that: -file: <complete_filename_to_job_file> Will become: /file: <complete_filename_to_job_file> And: –dir: <complete_path_to_ job_file_directory> Will become: /dir: <complete_path_to_ job_file_directory> From the directory <book_samples>/sample1, if you want to start the job, you can run the Kitchen script using the following syntax: C:tempsamples>Kitchen.bat /file:./export-job.kjb Regarding the use of PDI parameters in command-line arguments, the second important difference on Windows is that we need to substitute the = character in the parameter assignment syntax with the : character. Therefore, this means that: –param: <parameter_name>= <parameter_value> Will become: /param: <parameter_name>: <parameter_value> From the directory <book_samples>/sample1, if you want to extract all the customers for the country U. S. A, you can start the job using the following syntax: C:tempsamples>Kitchen.bat /param:p_country:USA /file:./exportjob. kjb For starting the PDI transformations, perform the following steps: The Pan script starts PDI transformations. On Linux or Mac, you can find the pan.sh script in the PDI home directory. Assuming that you are in the same directory, <book_samples>/sample1, where the transformation is located, you can start a simple transformation with a command in the following way: $ pan.sh –file:./read-customers.ktr If you want to start a transformation by specifying some parameters, you can use the following command: $ pan.sh –param:p_country=USA –file:./read-customers.ktr In Windows, you can use the Pan.bat script, and the sample commands will be as follows: C:tempsamples>Pan.bat /file:./read-customers.ktr Again, if you want to start a transformation by specifying some parameters, you can use the following command: C:tempsamples>Pan.bat /param:p_country=USA /file:./readcustomers. ktr Summary IIn this article, you were guided through simply starting a PDI job using the script Kitchen. In this case, the PDI job we started were stored locally in the computer filesystem, but it could be anywhere in the network in any place that is directly accessible. You learned how to start simple jobs both with and without a set of input parameters previously defined in the job. Using command-line scripts was a fast way to start batches, but it was also the easiest way to schedule our jobs using our operating system's scheduler. The script accepted a set of inline arguments to pass the proper options required by the program to run our job in any specific situation. Resources for Article : Further resources on this subject: Integrating Kettle and the Pentaho Suite [Article] Installing Pentaho Data Integration with MySQL [Article] Pentaho – Using Formulas in Our Reports [Article]
Read more
  • 0
  • 0
  • 4289
article-image-faceting-solr-14-enterprise-search-server
Packt
01 Oct 2009
9 min read
Save for later

Faceting in Solr 1.4 Enterprise Search Server

Packt
01 Oct 2009
9 min read
(For more resources on Solr, see here.) Faceting, after searching, is arguably the second-most valuable feature in Solr. It is perhaps even the most fun you'll have, because you will learn more about your data than with any other feature. Faceting enhances search results with aggregated information over all of the documents found in the search to answer questions such as the ones mentioned  below, given a search on MusicBrainz releases: How many are official, bootleg, or promotional? What were the top five most common countries in which the releases occurred? Over the past ten years, how many were released in each year? How many have names in these ranges: A-C, D-F, G-I, and so on? Given a track search, how many are < 2 minutes long, 2-3, 3-4, or more? Moreover, in addition, it can power term-suggest aka auto-complete functionality, which enables your search application to suggest a completed word that the user is typing, which is based on the most commonly occurring words starting with what they have already typed. So if a user started typing siamese dr, then Solr might suggest that dreams is the most likely word, along with other alternatives. Faceting, sometimes referred to as faceted navigation, is usually used to power user interfaces that display this summary information with clickable links that apply Solr filter queries to a subsequent search. If we revisit the comparison of search technology to databases, then faceting is more or less analogous to SQL's group by feature on a column with count(*). However, in Solr, facet processing is performed subsequent to an existing search as part of a single request-response with both the primary search results and the faceting results coming back together. In SQL, you would need to potentially perform a series of separate queries to get the same information. A quick example: Faceting release types Observe the following search results. echoParams is set to explicit (defined in solrconfig.xml) so that the search parameters are seen here. This example is using the standard handler (though perhaps dismax is more typical). The query parameter q is *:*, which matches all documents. In this case, the index I'm using only has releases. If there were non-releases in the index, then I would add a filter fq=type%3ARelease to the URL or put this in the handler configuration, as that is the data set we'll be using for most of this article. I wanted to keep this example brief so I set rows to 2. Sometimes when using faceting, you only want the facet information and not the main search, so you would set rows to 0, if that is the case. It's important to understand that the faceting numbers are computed over the entire search result, which is all of the releases in this example, and not just the two rows being returned. <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">160</int> <lst name="params"> <str name="wt">standard</str> <str name="rows">2</str> <str name="facet">true</str> <str name="q">*:*</str> <str name="fl">*,score</str> <str name="qt">standard</str> <str name="facet.field">r_official</str> <str name="f.r_official.facet.missing">true</str> <str name="f.r_official.facet.method">enum</str> <str name="indent">on</str> </lst> </lst> <result name="response" numFound="603090" start="0" maxScore="1.0"> <doc> <float name="score">1.0</float> <str name="id">Release:136192</str> <str name="r_a_id">3143</str> <str name="r_a_name">Janis Joplin</str> <arr name="r_attributes"><int>0</int><int>9</int> <int>100</int></arr> <str name="r_name">Texas International Pop Festival 11-30-69</str> <int name="r_tracks">7</int> <str name="type">Release</str> </doc> <doc> <float name="score">1.0</float> <str name="id">Release:133202</str> <str name="r_a_id">6774</str> <str name="r_a_name">The Dubliners</str> <arr name="r_attributes"><int>0</int></arr> <str name="r_lang">English</str> <str name="r_name">40 Jahre</str> <int name="r_tracks">20</int> <str name="type">Release</str> </doc> </result> <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"> <lst name="r_official"> <int name="Official">519168</int> <int name="Bootleg">19559</int> <int name="Promotion">16562</int> <int name="Pseudo-Release">2819</int> <int>44982</int> </lst> </lst> <lst name="facet_dates"/> </lst> </response> The facet related search parameters are highlighted at the top. The facet.missing parameter was set using the field-specific syntax, which will be explained shortly. Notice that the facet results (highlighted) follow the main search result and are given a name facet_counts. In this example, we only faceted on one field, r_official, but you'll learn in a bit that you can facet on as many fields as you desire. The name attribute holds a facet value, which is simply an indexed term, and the integer following it is the number of documents in the search results containing that term, aka a facet count. The next section gives us an explanation of where r_official and r_type came from. MusicBrainz schema changes In order to get better self-explanatory faceting results out of the r_attributes field and to split its dual-meaning, I modified the schema and added some text analysis. r_attributes is an array of numeric constants, which signify various types of releases and it's official-ness, for lack of a better word. As it represents two different things, I created two new fields: r_type and r_official with copyField directives to copy r_attributes into them: <field name="r_attributes" type="integer" multiValued="true" indexed="false" /><!-- ex: 0, 1, 100 --> <field name="r_type" type="rType" multiValued="true" stored="false" /><!-- Album | Single | EP |... etc. --> <field name="r_official" type="rOfficial" multiValued="true" stored="false" /><!-- Official | Bootleg | Promotional --> And: <copyField source="r_attributes" dest="r_type" /> <copyField source="r_attributes" dest="r_official" /> In order to map the constants to human-readable definitions, I created two field types: rType and rOfficial that use a regular expression to pull out the desired numbers and a synonym list to map from the constant to the human readable definition. Conveniently, the constants for r_type are in the range 1-11, whereas r_official are 100-103. I removed the constant 0, as it seemed to be bogus. <fieldType name="rType" class="solr.TextField" sortMissingLast="true" omitNorms="true"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="^(0|1dd)$" replacement="" replace="first" /> <filter class="solr.LengthFilterFactory" min="1" max="100" /> <filter class="solr.SynonymFilterFactory" synonyms="mb_attributes.txt" ignoreCase="false" expand="false"/> </analyzer> </fieldType> The definition of the type rOfficial is the same as rType, except it has this regular expression: ^(0|dd?)$. The presence of LengthFilterFactory is to ensure that no zero-length (empty-string) terms get indexed. Otherwise, this would happen because the previous regular expression reduces text fitting unwanted patterns to empty strings. The content of mb_attributes.txt is as follows: # from: http://bugs.musicbrainz.org/browser/mb_server/trunk/ # cgi-bin/MusicBrainz/Server/Release.pm#L48 #note: non-album track seems bogus; almost everything has it 0=>Non-Album Track 1=>Album 2=>Single 3=>EP 4=>Compilation 5=>Soundtrack 6=>Spokenword 7=>Interview 8=>Audiobook 9=>Live 10=>Remix 11=>Other 100=>Official 101=>Promotion 102=>Bootleg 103=>Pseudo-Release It does not matter if the user interface uses the name (for example: Official) or constant (for example: 100) when applying filter queries when implementing faceted navigation, as the text analysis will let the names through and will map the constants to the names. This is not necessarily true in a general case, but it is for the text analysis as I've configured it above. The approach I took was relatively simple, but it is not the only way to do it. Alternatively, I might have split the attributes and/or mapped them as part of the import process. This would allow me to remove the multiValued setting in r_official. Moreover, it wasn't truly necessary to map the numbers to their names, as a user interface, which is going to present the data, could very well map it on the fly. Field requirements The principal requirement of a field that will be faceted on is that it must be indexed. In addition to all but the prefix faceting use case, you will also want to use text analysis that does not tokenize the text. For example, the value Non-Album Track is indexed the way it is in r_type. We need to be careful to escape the space where this appeared in mb_attributes.txt. Otherwise, faceting on this field would show tallies for Non-Album and Track separately. Depending on the type of faceting you want to do and other needs you have like sorting, you will often find it necessary to have a copy of a field just for faceting. Remember that with faceting, the facet values returned in search results are the actual terms indexed, and not the stored value, which isn't even used.
Read more
  • 0
  • 0
  • 4285

article-image-getting-started-apache-hadoop-and-apache-spark
Packt
22 Apr 2016
12 min read
Save for later

Getting Started with Apache Hadoop and Apache Spark

Packt
22 Apr 2016
12 min read
In this article by Venkat Ankam, author of the book, Big Data Analytics with Spark and Hadoop, we will understand the features of Hadoop and Spark and how we can combine them. (For more resources related to this topic, see here.) This article is divided into the following subtopics: Introducing Apache Spark Why Hadoop + Spark? Introducing Apache Spark Hadoop and MapReduce have been around for 10 years and have proven to be the best solution to process massive data with high performance. However, MapReduce lacked performance in iterative computing where the output between multiple MapReduce jobs had to be written to Hadoop Distributed File System (HDFS). In a single MapReduce job as well, it lacked performance because of the drawbacks of the MapReduce framework. Let's take a look at the history of computing trends to understand how computing paradigms have changed over the last two decades. The trend was to reference the URI when the network was cheaper (in 1990), Replicate when storage became cheaper (in 2000), and Recompute when memory became cheaper (in 2010), as shown in Figure 1: Figure 1: Trends of computing So, what really changed over a period of time? Over a period of time, tape is dead, disk has become tape, and SSD has almost become disk. Now, caching data in RAM is the current trend. Let's understand why memory-based computing is important and how it provides significant performance benefits. Figure 2 indicates that data transfer rates from various mediums to the CPU. Disk to CPU is 100 MB/s, SSD to CPU is 600 MB/s, and over a network to CPU is 1 MB to 1 GB/s. However, the RAM to CPU transfer speed is astonishingly fast, which is 10 GB/s. So, the idea is to cache all or partial data in memory so that higher performance can be achieved. Figure 2: Why memory? Spark history Spark started in 2009 as a research project in the UC Berkeley RAD Lab, that later became AMPLab. The researchers in the lab had previously been working on Hadoop MapReduce and observed that MapReduce was inefficient for iterative and interactive computing jobs. Thus, from the beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas such as support for in-memory storage and efficient fault recovery. In 2011, AMPLab started to develop high-level components in Spark, such as Shark and Spark Streaming. These components are sometimes referred to as Berkeley Data Analytics Stack (BDAS). Spark was first open sourced in March 2010 and transferred to the Apache Software Foundation in June 2013, where it is now a top-level project. In February 2014, it became a top-level project at the Apache Software Foundation. Spark has since become one of the largest open source communities in big data. Now, over 250+ contributors in 50+ organizations are contributing to Spark development. User base has increased tremendously from small companies to Fortune 500 companies.Figure 3 shows the history of Apache Spark: Figure 3: The history of Apache Spark What is Apache Spark? Let's understand what Apache Spark is and what makes it a force to reckon with in big data analytics: Apache Spark is a fast enterprise-grade large-scale data processing, which is interoperable with Apache Hadoop. It is written in Scala, which is both an object-oriented and functional programming language that runs in a JVM. Spark enables applications to distribute data in-memory reliably during processing. This is the key to Spark's performance as it allows applications to avoid expensive disk access and performs computations at memory speeds. It is suitable for iterative algorithms by having every iteration access data through memory. Spark programs perform 100 times faster than MapReduce in-memory or 10 times faster on the disk (http://spark.apache.org/). It provides native support for Java, Scala, Python, and R languages with interactive shells for Scala, Python, and R. Applications can be developed easily and often 2 to 10 times less code is needed. Spark powers a stack of libraries including Spark SQL and DataFrames for interactive analytics, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for real-time analytics. You can combine these features seamlessly in the same application. Spark runs on Hadoop, Mesos, standalone resource managers, on-premise hardware, or in the cloud. What Apache Spark is not Hadoop provides us with HDFS for storage and MapReduce for compute. However, Spark does not provide any specific storage medium. Spark is mainly a compute engine, but you can store data in-memory or on Tachyon to process it. Spark has the ability to create distributed datasets from any file stored in the HDFS or other storage systems supported by Hadoop APIs (including your local filesystem, Amazon S3, Cassandra, Hive, HBase, Elasticsearch, and so on). It's important to note that Spark is not Hadoop and does not require Hadoop to run. It simply has support for storage systems implementing Hadoop APIs. Spark supports text files, SequenceFiles, Avro, Parquet, and any other Hadoop InputFormat. Can Spark replace Hadoop? Spark is designed to interoperate with Hadoop. It's not a replacement for Hadoop but for the MapReduce framework on Hadoop. All Hadoop processing frameworks (Sqoop, Hive, Pig, Mahout, Cascading, Crunch, and so on) using MapReduce as the engine now use Spark as an additional processing engine. MapReduce issues MapReduce developers faced challenges with respect to performance and converting every business problem to a MapReduce problem. Let's understand the issues related to MapReduce and how they are addressed in Apache Spark: MapReduce (MR) creates separate JVMs for every Mapper and Reducer. Launching JVMs takes time. MR code requires a significant amount of boilerplate coding. The programmer needs to think and design every business problem in terms of Map and Reduce, which makes it a very difficult program. One MR job can rarely do a full computation. You need multiple MR jobs to finish the complete task and the programmer needs to design and keep track of optimizations at all levels. An MR job writes the data to the disk between each job and hence is not suitable for iterative processing. A higher level of abstraction, such as Cascading and Scalding, provides better programming of MR jobs, but it does not provide any additional performance benefits. MR does not provide great APIs either. MapReduce is slow because every job in a MapReduce job flow stores data on the disk. Multiple queries on the same dataset will read the data separately and create a high disk I/O, as shown in Figure 4: Figure 4: MapReduce versus Apache Spark Spark takes the concept of MapReduce to the next level to store the intermediate data in-memory and reuse it, as needed, multiple times. This provides high performance at memory speeds, as shown in Figure 4. If I have only one MapReduce job, does it perform the same as Spark? No, the performance of the Spark job is superior to the MapReduce job because of in-memory computations and shuffle improvements. The performance of Spark is superior to MapReduce even when the memory cache is disabled. A new shuffle implementation (sort-based shuffle instead of hash-based shuffle), a new network module (based on netty instead of using block manager to send shuffle data), and a new external shuffle service make Spark perform the fastest petabyte sort (on 190 nodes with 46TB RAM) and terabyte sort. Spark sorted 100 TB of data using 206 EC2 i2.8x large machines in 23 minutes. The previous world record was 72 minutes, set by a Hadoop MapReduce cluster of 2,100 nodes. This means that Spark sorted the same data 3x faster using 10x less machines. All the sorting took place on the disk (HDFS) without using Spark's in-memory cache (https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html). To summarize, here are the differences between MapReduce and Spark: MapReduce Spark Ease of use Not easy to code and use Spark provides a good API and is easy to code and use Performance Performance is relatively poor when compared with Spark In-memory performance Iterative processing Every MR job writes the data to the disk and the next iteration reads from the disk Spark caches data in-memory Fault Tolerance Its achieved by replicating the data in HDFS Spark achieves fault tolerance by resilient distributed dataset (RDD) lineage Runtime Architecture Every Mapper and Reducer runs in a separate JVM Tasks are run in a preallocated executor JVM Shuffle Stores data on the disk Stores data in-memory and on the disk Operations Map and Reduce Map, Reduce, Join, Cogroup, and many more Execution Model Batch Batch, Interactive, and Streaming Natively supported Programming Languages Java Java, Scala, Python, and R Spark's stack Spark's stack components are Spark Core, Spark SQL and DataFrames, Spark Streaming, MLlib, and Graphx, as shown in Figure 5: Figure 5: The Apache Spark ecosystem Here is a comparison of Spark components versus Hadoop components: Spark Hadoop Spark Core MapReduce Apache Tez Spark SQL and DataFrames Apache Hive Impala Apache Tez Apache Drill Spark Streaming Apache Storm Spark MLlib Apache Mahout Spark GraphX Apache Giraph To understand the framework at a higher level, let's take a look at these core components of Spark and their integrations: Feature Details Programming languages Java, Scala, Python, and R. Scala, Python, and R shell for quick development. Core execution engine Spark Core: Spark Core is the underlying general execution engine for the Spark platform and all the other functionality is built on top of it. It provides Java, Scala, Python, and R APIs for the ease of development. Tungsten: This provides memory management and binary processing, cache-aware computation and code generation. Frameworks Spark SQL and DataFrames: Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. Spark Streaming: Spark Streaming enables us to build scalable and fault-tolerant streaming applications. It integrates with a wide variety of data sources, including filesystems, HDFS, Flume, Kafka, and Twitter. MLlib: MLlib is a machine learning library to create data products or extract deep meaning from the data. MLlib provides a high performance because of in-memory caching of data. Graphx: GraphX is a graph computation engine with graph algorithms to build graph applications. Off-heap storage Tachyon: This provides reliable data sharing at memory speed within and across cluster frameworks/jobs. Spark's default OFF_HEAP (experimental) storage is Tachyon. Cluster resource managers Standalone: By default, applications are submitted to the standalone mode cluster and each application will try to use all the available nodes and resources. YARN: YARN controls the resource allocation and provides dynamic resource allocation capabilities. Mesos: Mesos has two modes, Coarse-grained and Fine-grained. The coarse-grained approach has a static number of resources just like the standalone resource manager. The fine-grained approach has dynamic resource allocation just like YARN. Storage HDFS, S3, and other filesystems with the support of Hadoop InputFormat. Database integrations HBase, Cassandra, Mongo DB, Neo4J, and RDBMS databases. Integrations with streaming sources Flume, Kafka and Kinesis, Twitter, Zero MQ, and File Streams. Packages http://spark-packages.org/ provides a list of third-party data source APIs and packages. Distributions Distributions from Cloudera, Hortonworks, MapR, and DataStax. The Spark ecosystem is a unified stack that provides you with the power of combining SQL, streaming, and machine learning in one program. The advantages of unification are as follows: No need of copying or ETL of data between systems Combines processing types in one program Code reuse One system to learn One system to maintain An example of unification is shown in Figure 6: Figure 6: Unification of the Apache Spark ecosystem Why Hadoop + Spark? Apache Spark shines better when it is combined with Hadoop. To understand this, let's take a look at Hadoop and Spark features. Hadoop features The Hadoop features are described as follows: Feature Details Unlimited scalability Stores unlimited data by scaling out HDFS Effectively manages the cluster resources with YARN Runs multiple applications along with Spark Thousands of simultaneous users Enterprise grade Provides security with Kerberos authentication and ACLs authorization Data encryption High reliability and integrity Multitenancy Wide range of applications Files: Strucutured, semi-structured, or unstructured Streaming sources: Flume and Kafka Databases: Any RDBMS and NoSQL database Spark features The Spark features are described as follows: Feature Details Easy development No boilerplate coding Multiple native APIs: Java, Scala, Python, and R REPL for Scala, Python, and R In-memory performance RDDs Direct Acyclic Graph (DAG) to unify processing Unification Batch, SQL, machine learning, streaming, and graph processing When both frameworks are combined, we get the power of enterprise-grade applications with in-memory performance, as shown in Figure 7: Figure 7: Spark applications on the Hadoop platform Frequently asked questions about Spark The following are the frequent questions that practitioners raise about Spark: My dataset does not fit in-memory. How can I use Spark? Spark's operators spill data to the disk if it does not fit in-memory, allowing it to run well on data of any size. Likewise, cached datasets that do not fit in-memory are either spilled to the disk or recomputed on the fly when needed, as determined by the RDD's storage level. By default, Spark will recompute the partitions that don't fit in-memory. The storage level can be changed to MEMORY_AND_DISK to spill partitions to the disk. Figure 8 shows the performance difference in fully cached versus on the disk:Figure 8: Spark performance: Fully cached versus on the disk How does fault recovery work in Spark? Spark's in-built fault tolerance based on RDD lineage will automatically recover from failures. Figure 9 shows the performance over failure in the 6th iteration in a k-means algorithm: Figure 9: Fault recovery performance Summary In this article, we saw an introduction to Apache Spark and the features of Hadoop and Spark and discussed how we can combine them together. Resources for Article: Further resources on this subject: Adding a Spark to R[article] Big Data Analytics[article] Big Data Analysis (R and Hadoop)[article]
Read more
  • 0
  • 0
  • 4265

article-image-null-5
Packt
17 Oct 2011
10 min read
Save for later

Tips and Tricks: Report Page in IBM Cognos 8 Report Studio

Packt
17 Oct 2011
10 min read
(Read more interesting articles on IBM Cognos here.) Showing images dynamically (Traffic Light report) Let us suppose that we have a report which shows month-on-month difference in sales quantity Getting ready Please note that you will need administrator rights on the Cognos server to complete this recipe. If the server is installed on your personal machine, you will have these rights by default. For this recipe, we need to first create three icons or images for red, yellow, and green. They should be already available on the Cognos server under <Cognos Installation>webcontentsamplesimages folder. If not, then create them using any image editor software or use the images supplied with this book. Once you have the three images which you need to conditionally show on the report, place them on the Cognos server under <Cognos Installation>webcontentsamplesimages folder. (If the folder is not there, create one). Change the IIS security to allow 'Anonymous Read and Browse' accesses. Now open the report that shows the month-on-month running differences. Insert a new 'image' from the insertable objects pane on the list report, as a new column. Now go to Condition Explorer and create a new string variable. Define the expression as: if ([Query1].[Running Difference] > 0)then ('green')else if ([Query1].[Running Difference] < 0)then ('red')else ('yellow') Call the variable Traffic and define three possible values for the same (red, yellow, and green). Now go back to the report page. Select the image. Open its URL Source Variable dialog. Choose the variable Traffic and click OK. From Condition Explorer, choose 'red' condition. Now click on the image again. It will allow you to define the image URL for this condition. Set the URL to: ../samples/images/Red.jpg Similarly, define the URL for 'yellow' and 'green' conditions as ../samples/images/yellow.jpg and ../samples/images/green.jpg respectively. Run the report to test it. How it works... Cognos Report Studio allows you to put the images in the report by specifying the URL of the image. The images can be anywhere on the intranet or internet. They will be displayed properly as long as the URL is accessible from Cognos application server and gateway. In this recipe, we are using a report which already calculates the Running Difference. Hence, we just had to define conditional variable to trap different possible conditions. The Image component allows us to define the URL for different conditions by attaching it to the Traffic variable in step 8. There's more... In this case, though the URL of the image changes dynamically, it is not truly 100% dynamic. There are three static URLs already defined in the report and one is picked up depending on the condition. We can also use a data item or report expression as source of the URL value. In that case, it will be totally dynamic, and based on the values coming from database; Cognos will work out the URL of the image and display it correctly. This is useful when the image filenames and locations are stored in the database. For example, Product Catalogue kind of reports. More info This recipe works fine in HTML, PDF, and Excel formats.We have used relative URLs for the images, so that report can be easily deployed to other environments where Cognos installation might be in a different location. However, we need to ensure that the images are copied in all environments in the folder mentioned in step 2. Handling missing image issue In the previous recipe, we saw how to add images to the report. You will be using that technique in many cases, some involving hundreds of images (For example, Product Catalogue).There will often be a case in which database has a URL or image name, whereas the corresponding image is either missing or inaccessible. In such a case, the web browser shows an error symbol. This looks quite ugly and needs to be handled properly. In this recipe, we will see how to handle this problem gracefully. Getting ready We will use the report prepared in previous recipe. We need to delete the Green.jpg file (or rename it to something else) from the server, in order to create the missing image scenario. How to do it... In the previous recipe, we added an image object and defined its conditional URLs . We need to replace that image with an HTML Item. For that, unlock the report objects and delete the image component. Add an HTML Item in the same column. Select this HTML item and from the Properties pane, set its HTML Source Variable to 'Traffic'. (Please note that we already have this conditional variable in the last recipe). Now define the HTML for different conditions. Start with 'red'. Choose 'red' from conditional explorer and define the HTML as: For 'yellow', define the HTML as: <img src="../samples/images/yellow.jpg" alt="No Change" onError="img2txt(this)"/> For 'green', define HTML as: <img src="../samples/images/green.jpg" alt="Upsell" onError="img2txt(this)"/> Now go back to the No Variable state by double clicking on the green bar, and add another HTML item on the report. Put it just before the list. Define this HTML as: <script>function img2txt(img) {txt = img.alt;img.parentNode.innerHTML=txt;}</script> Now run the report to test it. As you can see, if the image is missing, the report will now handle it gracefully and show some text instead of an error image. How it works... Here we are using our custom code to display the image, instead of using CRS's in-built Image component.We have pulled an HTML item onto the report and defined it to display different images depending on the condition using the <img border="0" /> tag. This tag allows us to define an alternative text and onError event as well. We are using the onError event to call our custom made JavaScript function called img2txt .This function replaces the HTML item with a text which was originally defined as 'alternative text'. Hence, if green.jpg is missing, this function will replace it with a text item Upsell. There's more... As we are using HTML code and JavaScript in this technique, it works in HTML format only. This technique will be useful for a lot of graphical reports (dashboards, scorecards, online product catalogues, and so on). Dynamic links to external website (Google Map example) In this recipe, we will introduce you to the 'Hyperlink' component. A report shows retailers information by products. It shows various fields like Retailer name, Contact information, City, and Postal zone. Business wants to have a link to Google maps that will show retailer's place on the map using the Postal zone information. As the addresses might change in the backend, the technique needs to be dynamic to pick up the latest postal zone. Getting ready Create a simple list report that shows Retailers information by Product lines. How to do it... From the 'Insertable Objects' toolbox, drag a hyperlink object onto the report as a new column. Change its Text property to Map. Set the URL Source Type to Report Expression and define the report expression as follows: 'http://maps.google.com/maps?q=' + [Query1].[City] Run the report to test it. As you can see, there is a link for each retailer record. If you Shift+click on the link, it will open the Google map for corresponding postal zone in a new window. How it works... Here we are using the 'Hyperlink' component of CRS. We can define the URL as any static link. However, for our requirements, we have defined a report expression. This allows us to provide a dynamic link which picks up the latest postal zone from the database. We are passing the postal zone to Google Maps as part of a URL. The hyperlink component works in HTML as well as Excel and PDF formats of report. This object currently does not have the property to define whether the link target should open in a new window or the same window. Just clicking on the link, opens the target in same window; whereas Shift+click opens in a new window. There's more... You can use this technique to call any external website that accepts parameters within a URL. You can pass multiple parameters too.   Alternating drill link In this recipe, we will learn about a limitation of drill link and overcoming it using Render Variable. There is a crosstab report which shows sales quantity by month and order method. We need to provide drill-through facility from the intersection. However, the drill-through target needs to be different, depending on the order method. If order method is e-mail, the drill-through from intersection should go to a report called 'Alternating Drill Link—Drill Report 2'. For all other order methods, it should go to 'Alternating Drill Link—Drill Report 1'. Getting ready Create a crosstab report to serve as the main report. Drag Month (shipment) on rows, Order method on columns and Sales Quantity on the intersection. Create two list reports to serve as drill reports. In the sample provided with this book, we have used two list reports for this. One accepts the Order method and Month. The other accepts only month and is designed to work for the order method 'E-mail'. How to do it... Create a drill through to first drill the report from the crosstab intersection. Now make sure that the report objects are unlocked. Select the intersection text item (which now looks like hyperlink as there is already a drill-through defined). Hold the Ctrl key down and drag the text to its right within a cell. This should create a copy of the text item within that cell and will look like the following: Now select this copy of the text item. Hit the drill-through button to open definitions. Delete the existing drill-through to first report. Create a new drill to a second report. So, now we have two text items in the cell, each going to different drill reports. Create a string type of Conditional Variable. Define it as: if ([Query1].[Order method] = 'E-mail') then ('E-mail')else ('Other')Call it OrderMethod and define the two values to be E-mail and Other. Now go back to the report page. Select the first text item from intersection. Open its Render Variable property . Choose the OrderMethod variable and select to Render for: Other. Similarly, define Render Variable for the second text item, but choose to Render for: E-mail. Run the report to test it. You will see that clicking on the intersection numbers opens first drill report for any order method other than E-mail. Whereas for the numbers under E-mail, the second drill report opens.
Read more
  • 0
  • 0
  • 4260
article-image-visualization
Packt
15 Apr 2015
29 min read
Save for later

Visualization

Packt
15 Apr 2015
29 min read
Humans are visual creatures and have evolved to be able to quickly notice the meaning when information is presented in certain ways that cause the wiring in our brains to have the light bulb of insight turn on. This "aha" can often be performed very quickly, given the correct tools, instead of through tedious numerical analysis. Tools for data analysis, such as pandas, take advantage of being able to quickly and iteratively provide the user to take data, process it, and quickly visualize the meaning. Often, much of what you will do with pandas is massaging your data to be able to visualize it in one or more visual patterns, in an attempt to get to "aha" by simply glancing at the visual representation of the information. In this article by Michael Heydt, author of the book Learning pandas we will cover common patterns in visualizing data with pandas. It is not meant to be exhaustive in coverage. The goal is to give you the required knowledge to create beautiful data visualizations on pandas data quickly and with very few lines of code. (For more resources related to this topic, see here.) This article is presented in three sections. The first introduces you to the general concepts of programming visualizations with pandas, emphasizing the process of creating time-series charts. We will also dive into techniques to label axes and create legends, colors, line styles, and markets. The second part of the article will then focus on the many types of data visualizations commonly used in pandas programs and data sciences, including: Bar plots Histograms Box and whisker charts Area plots Scatter plots Density plots Scatter plot matrixes Heatmaps The final section will briefly look at creating composite plots by dividing plots into subparts and drawing multiple plots within a single graphical canvas. Setting up the IPython notebook The first step to plot with pandas data, is to first include the appropriate libraries, primarily, matplotlib. The examples in this article will all be based on the following imports, where the plotting capabilities are from matplotlib, which will be aliased with plt: In [1]:# import pandas, numpy and datetimeimport numpy as npimport pandas as pd# needed for representing dates and timesimport datetimefrom datetime import datetime# Set some pandas options for controlling outputpd.set_option('display.notebook_repr_html', False)pd.set_option('display.max_columns', 10)pd.set_option('display.max_rows', 10)# used for seeding random number sequencesseedval = 111111# matplotlibimport matplotlib as mpl# matplotlib plotting functionsimport matplotlib.pyplot as plt# we want our plots inline%matplotlib inline The %matplotlib inline line is the statement that tells matplotlib to produce inline graphics. This will make the resulting graphs appear either inside your IPython notebook or IPython session. All examples will seed the random number generator with 111111, so that the graphs remain the same every time they run. Plotting basics with pandas The pandas library itself performs data manipulation. It does not provide data visualization capabilities itself. The visualization of data in pandas data structures is handed off by pandas to other robust visualization libraries that are part of the Python ecosystem, most commonly, matplotlib, which is what we will use in this article. All of the visualizations and techniques covered in this article can be performed without pandas. These techniques are all available independently in matplotlib. pandas tightly integrates with matplotlib, and by doing this, it is very simple to go directly from pandas data to a matplotlib visualization without having to work with intermediate forms of data. pandas does not draw the graphs, but it will tell matplotlib how to draw graphs using pandas data, taking care of many details on your behalf, such as automatically selecting Series for plots, labeling axes, creating legends, and defaulting color. Therefore, you often have to write very little code to create stunning visualizations. Creating time-series charts with .plot() One of the most common data visualizations created, is of the time-series data. Visualizing a time series in pandas is as simple as calling .plot() on a DataFrame or Series object. To demonstrate, the following creates a time series representing a random walk of values over time, akin to the movements in the price of a stock: In [2]:# generate a random walk time-seriesnp.random.seed(seedval)s = pd.Series(np.random.randn(1096),index=pd.date_range('2012-01-01','2014-12-31'))walk_ts = s.cumsum()# this plots the walk - just that easy :)walk_ts.plot(); The ; character at the end suppresses the generation of an IPython out tag, as well as the trace information. It is a common practice to execute the following statement to produce plots that have a richer visual style. This sets a pandas option that makes resulting plots have a shaded background and what is considered a slightly more pleasing style: In [3]:# tells pandas plots to use a default style# which has a background fillpd.options.display.mpl_style = 'default'walk_ts.plot(); The .plot() method on pandas objects is a wrapper function around the matplotlib libraries' plot() function. It makes plots of pandas data very easy to create. It is coded to know how to use the data in the pandas objects to create the appropriate plots for the data, handling many of the details of plot generation, such as selecting series, labeling, and axes generation. In this situation, the .plot() method determines that as Series contains dates for its index that the x axis should be formatted as dates and it selects a default color for the data. This example used a single series and the result would be the same using DataFrame with a single column. As an example, the following produces the same graph with one small difference. It has added a legend to the graph, which charts by default, generated from a DataFrame object, will have a legend even if there is only one series of data: In [4]:# a DataFrame with a single column will produce# the same plot as plotting the Series it is created fromwalk_df = pd.DataFrame(walk_ts)walk_df.plot(); The .plot() function is smart enough to know whether DataFrame has multiple columns, and it should create multiple lines/series in the plot and include a key for each, and also select a distinct color for each line. This is demonstrated with the following example: In [5]:# generate two random walks, one in each of# two columns in a DataFramenp.random.seed(seedval)df = pd.DataFrame(np.random.randn(1096, 2),index=walk_ts.index, columns=list('AB'))walk_df = df.cumsum()walk_df.head()Out [5]:A B2012-01-01 -1.878324 1.3623672012-01-02 -2.804186 1.4272612012-01-03 -3.241758 3.1653682012-01-04 -2.750550 3.3326852012-01-05 -1.620667 2.930017In [6]:# plot the DataFrame, which will plot a line# for each column, with a legendwalk_df.plot(); If you want to use one column of DataFrame as the labels on the x axis of the plot instead of the index labels, you can use the x and y parameters to the .plot() method, giving the x parameter the name of the column to use as the x axis and y parameter the names of the columns to be used as data in the plot. The following recreates the random walks as columns 'A' and 'B', creates a column 'C' with sequential values starting with 0, and uses these values as the x axis labels and the 'A' and 'B' columns values as the two plotted lines: In [7]:# copy the walkdf2 = walk_df.copy()# add a column C which is 0 .. 1096df2['C'] = pd.Series(np.arange(0, len(df2)), index=df2.index)# instead of dates on the x axis, use the 'C' column,# which will label the axis with 0..1000df2.plot(x='C', y=['A', 'B']); The .plot() functions, provided by pandas for the Series and DataFrame objects, take care of most of the details of generating plots. However, if you want to modify characteristics of the generated plots beyond their capabilities, you can directly use the matplotlib functions or one of more of the many optional parameters of the .plot() method. Adorning and styling your time-series plot The built-in .plot() method has many options that you can use to change the content in the plot. We will cover several of the common options used in most plots. Adding a title and changing axes labels The title of the chart can be set using the title parameter of the .plot() method. Axes labels are not set with .plot(), but by directly using the plt.ylabel() and plt.xlabel() functions after calling .plot(): In [8]:# create a time-series chart with a title and specific# x and y axes labels# the title is set in the .plot() method as a parameterwalk_df.plot(title='Title of the Chart')# explicitly set the x and y axes labels after the .plot()plt.xlabel('Time')plt.ylabel('Money'); The labels in this plot were added after the call to .plot(). A question that may be asked, is that if the plot is generated in the call to .plot(), then how are they changed on the plot? The answer, is that plots in matplotlib are not displayed until either .show() is called on the plot or the code reaches the end of the execution and returns to the interactive prompt. At either of these points, any plot generated by plot commands will be flushed out to the display. In this example, although .plot() is called, the plot is not generated until the IPython notebook code section finishes completion, so the changes for labels and title are added to the plot. Specifying the legend content and position To change the text used in the legend (the default is the column name from DataFrame), you can use the ax object returned from the .plot() method to modify the text using its .legend() method. The ax object is an AxesSubplot object, which is a representation of the elements of the plot, that can be used to change various aspects of the plot before it is generated: In [9]:# change the legend items to be different# from the names of the columns in the DataFrameax = walk_df.plot(title='Title of the Chart')# this sets the legend labelsax.legend(['1', '2']); The location of the legend can be set using the loc parameter of the .legend() method. By default, pandas sets the location to 'best', which tells matplotlib to examine the data and determine the best place to put the legend. However, you can also specify any of the following to position the legend more specifically (you can use either the string or the numeric code): Text Code 'best' 0 'upper right' 1 'upper left' 2 'lower left' 3 'lower right' 4 'right' 5 'center left' 6 'center right' 7 'lower center' 8 'upper center' 9 'center' 10 In our last chart, the 'best' option actually had the legend overlap the line from one of the series. We can reposition the legend in the upper center of the chart, which will prevent this and create a better chart of this data: In [10]:# change the position of the legendax = walk_df.plot(title='Title of the Chart')# put the legend in the upper center of the chartax.legend(['1', '2'], loc='upper center'); Legends can also be turned off with the legend parameter: In [11]:# omit the legend by using legend=Falsewalk_df.plot(title='Title of the Chart', legend=False); There are more possibilities for locating and actually controlling the content of the legend, but we leave that for you to do some more experimentation. Specifying line colors, styles, thickness, and markers pandas automatically sets the colors of each series on any chart. If you would like to specify your own color, you can do so by supplying style code to the style parameter of the plot function. pandas has a number of built-in single character code for colors, several of which are listed here: b: Blue g: Green r: Red c: Cyan m: Magenta y: Yellow k: Black w: White It is also possible to specify the color using a hexadecimal RGB code of the #RRGGBB format. To demonstrate both options, the following example sets the color of the first series to green using a single digit code and the second series to red using the hexadecimal code: In [12]:# change the line colors on the plot# use character code for the first line,# hex RGB for the secondwalk_df.plot(style=['g', '#FF0000']); Line styles can be specified using a line style code. These can be used in combination with the color style codes, following the color code. The following are examples of several useful line style codes: '-' = solid '--' = dashed ':' = dotted '-.' = dot-dashed '.' = points The following plot demonstrates these five line styles by drawing five data series, each with one of these styles. Notice how each style item now consists of a color symbol and a line style code: In [13]:# show off different line stylest = np.arange(0., 5., 0.2)legend_labels = ['Solid', 'Dashed', 'Dotted','Dot-dashed', 'Points']line_style = pd.DataFrame({0 : t,1 : t**1.5,2 : t**2.0,3 : t**2.5,4 : t**3.0})# generate the plot, specifying color and line style for each lineax = line_style.plot(style=['r-', 'g--', 'b:', 'm-.', 'k:'])# set the legendax.legend(legend_labels, loc='upper left'); The thickness of lines can be specified using the lw parameter of .plot(). This can be passed a thickness for multiple lines, by passing a list of widths, or a single width that is applied to all lines. The following redraws the graph with a line width of 3, making the lines a little more pronounced: In [14]:# regenerate the plot, specifying color and line style# for each line and a line width of 3 for all linesax = line_style.plot(style=['r-', 'g--', 'b:', 'm-.', 'k:'], lw=3)ax.legend(legend_labels, loc='upper left'); Markers on a line can also be specified using abbreviations in the style code. There are quite a few marker types provided and you can see them all at http://matplotlib.org/api/markers_api.html. We will examine five of them in the following chart by having each series use a different marker from the following: circles, stars, triangles, diamonds, and points. The type of marker is also specified using a code at the end of the style: In [15]:# redraw, adding markers to the linesax = line_style.plot(style=['r-o', 'g--^', 'b:*','m-.D', 'k:o'], lw=3)ax.legend(legend_labels, loc='upper left'); Specifying tick mark locations and tick labels Every plot we have seen to this point, has used the default tick marks and labels on the ticks that pandas decides are appropriate for the plot. These can also be customized using various matplotlib functions. We will demonstrate how ticks are handled by first examining a simple DataFrame. We can retrieve the locations of the ticks that were generated on the x axis using the plt.xticks() method. This method returns two values, the location, and the actual labels: In [16]:# a simple plot to use to examine ticksticks_data = pd.DataFrame(np.arange(0,5))ticks_data.plot()ticks, labels = plt.xticks()ticksOut [16]:array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ]) This array contains the locations of the ticks in units of the values along the x axis. pandas has decided that a range of 0 through 4 (the min and max) and an interval of 0.5 is appropriate. If we want to use other locations, we can provide these by passing them to plt.xticks() as a list. The following demonstrates these using even integers from -1 to 5, which will both change the extents of the axis, as well as remove non integral labels: In [17]:# resize x axis to (-1, 5), and draw ticks# only at integer valuesticks_data = pd.DataFrame(np.arange(0,5))ticks_data.plot()plt.xticks(np.arange(-1, 6)); Also, we can specify new labels at these locations by passing them as the second parameter. Just as an example, we can change the y axis ticks and labels to integral values and consecutive alpha characters using the following: In [18]:# rename y axis tick labels to A, B, C, D, and Eticks_data = pd.DataFrame(np.arange(0,5))ticks_data.plot()plt.yticks(np.arange(0, 5), list("ABCDE")); Formatting axes tick date labels using formatters The formatting of axes labels whose underlying data types is datetime is performed using locators and formatters. Locators control the position of the ticks, and the formatters control the formatting of the labels. To facilitate locating ticks and formatting labels based on dates, matplotlib provides several classes in maptplotlib.dates to help facilitate the process: MinuteLocator, HourLocator, DayLocator, WeekdayLocator, MonthLocator, and YearLocator: These are specific locators coded to determine where ticks for each type of date field will be found on the axis DateFormatter: This is a class that can be used to format date objects into labels on the axis By default, the default locator and formatter are AutoDateLocator and AutoDateFormatter, respectively. You can change these by providing different objects to use the appropriate methods on the specific axis object. To demonstrate, we will use a subset of the random walk data from earlier, which represents just the data from January through February of 2014. Plotting this gives us the following output: In [19]:# plot January-February 2014 from the random walkwalk_df.loc['2014-01':'2014-02'].plot(); The labels on the x axis of this plot have two series of labels, the minor and the major. The minor labels in this plot contain the day of the month, and the major contains the year and month (the year only for the first month). We can set locators and formatters for each of the minor and major levels. This will be demonstrated by changing the minor labels to be located at the Monday of each week and to contain the date and day of the week (right now, the chart uses weekly and only Friday's date—without the day name). On the major labels, we will use the monthly location and always include both the month name and the year: In [20]:# this import styles helps us type lessfrom matplotlib.dates import WeekdayLocator, DateFormatter, MonthLocator# plot Jan-Feb 2014ax = walk_df.loc['2014-01':'2014-02'].plot()# do the minor labelsweekday_locator = WeekdayLocator(byweekday=(0), interval=1)ax.xaxis.set_minor_locator(weekday_locator)ax.xaxis.set_minor_formatter(DateFormatter("%dn%a"))# do the major labelsax.xaxis.set_major_locator(MonthLocator())ax.xaxis.set_major_formatter(DateFormatter('nnn%bn%Y')); This is almost what we wanted. However, note that the year is being reported as 45. This, unfortunately, seems to be an issue between pandas and the matplotlib representation of values for the year. The best reference I have on this is this following link from Stack Overflow (http://stackoverflow.com/questions/12945971/pandas-timeseries-plot-setting-x-axis-major-and-minor-ticks-and-labels). So, it appears to create a plot with custom-date-based labels, we need to avoid the pandas .plot() and need to kick all the way down to using matplotlib. Fortunately, this is not too hard. The following changes the code slightly and renders what we wanted: In [21]:# this gets around the pandas / matplotlib year issue# need to reference the subset twice, so let's make a variablewalk_subset = walk_df['2014-01':'2014-02']# this gets the plot so we can use it, we can ignore figfig, ax = plt.subplots()# inform matplotlib that we will use the following as dates# note we need to convert the index to a pydatetime seriesax.plot_date(walk_subset.index.to_pydatetime(), walk_subset, '-')# do the minor labelsweekday_locator = WeekdayLocator(byweekday=(0), interval=1)ax.xaxis.set_minor_locator(weekday_locator)ax.xaxis.set_minor_formatter(DateFormatter('%dn%a'))# do the major labelsax.xaxis.set_major_locator(MonthLocator())ax.xaxis.set_major_formatter(DateFormatter('nnn%bn%Y'));ax.xaxis.set_major_formatter(DateFormatter('nnn%bn%Y')); To add grid lines for the minor axes ticks, you can use the .grid() method of the x axis object of the plot, the first parameter specifying the lines to use and the second parameter specifying the minor or major set of ticks. The following replots this graph without the major grid line and with the minor grid lines: In [22]:# this gets the plot so we can use it, we can ignore figfig, ax = plt.subplots()# inform matplotlib that we will use the following as dates# note we need to convert the index to a pydatetime seriesax.plot_date(walk_subset.index.to_pydatetime(), walk_subset, '-')# do the minor labelsweekday_locator = WeekdayLocator(byweekday=(0), interval=1)ax.xaxis.set_minor_locator(weekday_locator)ax.xaxis.set_minor_formatter(DateFormatter('%dn%a'))ax.xaxis.grid(True, "minor") # turn on minor tick grid linesax.xaxis.grid(False, "major") # turn off major tick grid lines# do the major labelsax.xaxis.set_major_locator(MonthLocator())ax.xaxis.set_major_formatter(DateFormatter('nnn%bn%Y')); The last demonstration of formatting will use only the major labels but on a weekly basis and using a YYYY-MM-DD format. However, because these would overlap, we will specify that they should be rotated to prevent the overlap. This is done using the fig.autofmt_xdate() function: In [23]:# this gets the plot so we can use it, we can ignore figfig, ax = plt.subplots()# inform matplotlib that we will use the following as dates# note we need to convert the index to a pydatetime seriesax.plot_date(walk_subset.index.to_pydatetime(), walk_subset, '-')ax.xaxis.grid(True, "major") # turn off major tick grid lines# do the major labelsax.xaxis.set_major_locator(weekday_locator)ax.xaxis.set_major_formatter(DateFormatter('%Y-%m-%d'));# informs to rotate date labelsfig.autofmt_xdate(); Common plots used in statistical analyses Having seen how to create, lay out, and annotate time-series charts, we will now look at creating a number of charts, other than time series that are commonplace in presenting statistical information. Bar plots Bar plots are useful in order to visualize the relative differences in values of non time-series data. Bar plots can be created using the kind='bar' parameter of the .plot() method: In [24]:# make a bar plot# create a small series of 10 random values centered at 0.0np.random.seed(seedval)s = pd.Series(np.random.rand(10) - 0.5)# plot the bar charts.plot(kind='bar'); If the data being plotted consists of multiple columns, a multiple series bar plot will be created: In [25]:# draw a multiple series bar chart# generate 4 columns of 10 random valuesnp.random.seed(seedval)df2 = pd.DataFrame(np.random.rand(10, 4),columns=['a', 'b', 'c', 'd'])# draw the multi-series bar chartdf2.plot(kind='bar'); If you would prefer stacked bars, you can use the stacked parameter, setting it to True: In [26]:# horizontal stacked bar chartdf2.plot(kind='bar', stacked=True); If you want the bars to be horizontally aligned, you can use kind='barh': In [27]:# horizontal stacked bar chartdf2.plot(kind='barh', stacked=True); Histograms Histograms are useful for visualizing distributions of data. The following shows you a histogram of generating 1000 values from the normal distribution: In [28]:# create a histogramnp.random.seed(seedval)# 1000 random numbersdfh = pd.DataFrame(np.random.randn(1000))# draw the histogramdfh.hist(); The resolution of a histogram can be controlled by specifying the number of bins to allocate to the graph. The default is 10, and increasing the number of bins gives finer detail to the histogram. The following increases the number of bins to 100: In [29]:# histogram again, but with more binsdfh.hist(bins = 100); If the data has multiple series, the histogram function will automatically generate multiple histograms, one for each series: In [30]:# generate a multiple histogram plot# create DataFrame with 4 columns of 1000 random valuesnp.random.seed(seedval)dfh = pd.DataFrame(np.random.randn(1000, 4),columns=['a', 'b', 'c', 'd'])# draw the chart. There are four columns so pandas draws# four historgramsdfh.hist(); If you want to overlay multiple histograms on the same graph (to give a quick visual difference of distribution), you can call the pyplot.hist() function multiple times before .show() is called to render the chart: In [31]:# directly use pyplot to overlay multiple histograms# generate two distributions, each with a different# mean and standard deviationnp.random.seed(seedval)x = [np.random.normal(3,1) for _ in range(400)]y = [np.random.normal(4,2) for _ in range(400)]# specify the bins (-10 to 10 with 100 bins)bins = np.linspace(-10, 10, 100)# generate plot x using plt.hist, 50% transparentplt.hist(x, bins, alpha=0.5, label='x')# generate plot y using plt.hist, 50% transparentplt.hist(y, bins, alpha=0.5, label='y')plt.legend(loc='upper right'); Box and whisker charts Box plots come from descriptive statistics and are a useful way of graphically depicting the distributions of categorical data using quartiles. Each box represents the values between the first and third quartiles of the data with a line across the box at the median. Each whisker reaches out to demonstrate the extent to five interquartile ranges below and above the first and third quartiles: In [32]:# create a box plot# generate the seriesnp.random.seed(seedval)dfb = pd.DataFrame(np.random.randn(10,5))# generate the plotdfb.boxplot(return_type='axes'); There are ways to overlay dots and show outliers, but for brevity, they will not be covered in this text. Area plots Area plots are used to represent cumulative totals over time, to demonstrate the change in trends over time among related attributes. They can also be "stacked" to demonstrate representative totals across all variables. Area plots are generated by specifying kind='area'. A stacked area chart is the default: In [33]:# create a stacked area plot# generate a 4-column data frame of random datanp.random.seed(seedval)dfa = pd.DataFrame(np.random.rand(10, 4),columns=['a', 'b', 'c', 'd'])# create the area plotdfa.plot(kind='area'); To produce an unstacked plot, specify stacked=False: In [34]:# do not stack the area plotdfa.plot(kind='area', stacked=False); By default, unstacked plots have an alpha value of 0.5, so that it is possible to see how the data series overlaps. Scatter plots A scatter plot displays the correlation between a pair of variables. A scatter plot can be created from DataFrame using .plot() and specifying kind='scatter', as well as specifying the x and y columns from the DataFrame source: In [35]:# generate a scatter plot of two series of normally# distributed random values# we would expect this to cluster around 0,0np.random.seed(111111)sp_df = pd.DataFrame(np.random.randn(10000, 2),columns=['a', 'b'])sp_df.plot(kind='scatter', x='a', y='b') We can easily create more elaborate scatter plots by dropping down a little lower into matplotlib. The following code gets Google stock data for the year of 2011 and calculates delta in the closing price per day, and renders close versus volume as bubbles of different sizes, derived on the size of the values in the data: In [36]:# get Google stock data from 1/1/2011 to 12/31/2011from pandas.io.data import DataReaderstock_data = DataReader("GOOGL", "yahoo",datetime(2011, 1, 1),datetime(2011, 12, 31))# % change per daydelta = np.diff(stock_data["Adj Close"])/stock_data["Adj Close"][:-1]# this calculates size of markersvolume = (15 * stock_data.Volume[:-2] / stock_data.Volume[0])**2close = 0.003 * stock_data.Close[:-2] / 0.003 * stock_data.Open[:-2]# generate scatter plotfig, ax = plt.subplots()ax.scatter(delta[:-1], delta[1:], c=close, s=volume, alpha=0.5)# add some labels and styleax.set_xlabel(r'$Delta_i$', fontsize=20)ax.set_ylabel(r'$Delta_{i+1}$', fontsize=20)ax.set_title('Volume and percent change')ax.grid(True); Note the nomenclature for the x and y axes labels, which creates a nice mathematical style for the labels. Density plot You can create kernel density estimation plots using the .plot() method and setting the kind='kde' parameter. A kernel density estimate plot, instead of being a pure empirical representation of the data, makes an attempt and estimates the true distribution of the data, and hence smoothes it into a continuous plot. The following generates a normal distributed set of numbers, displays it as a histogram, and overlays the kde plot: In [37]:# create a kde density plot# generate a series of 1000 random numbersnp.random.seed(seedval)s = pd.Series(np.random.randn(1000))# generate the plots.hist(normed=True) # shows the barss.plot(kind='kde'); The scatter plot matrix The final composite graph we'll look at in this article is one that is provided by pandas in its plotting tools subcomponent: the scatter plot matrix. A scatter plot matrix is a popular way of determining whether there is a linear correlation between multiple variables. The following creates a scatter plot matrix with random values, which then shows a scatter plot for each combination, as well as a kde graph for each variable: In [38]:# create a scatter plot matrix# import this classfrom pandas.tools.plotting import scatter_matrix# generate DataFrame with 4 columns of 1000 random numbersnp.random.seed(111111)df_spm = pd.DataFrame(np.random.randn(1000, 4),columns=['a', 'b', 'c', 'd'])# create the scatter matrixscatter_matrix(df_spm, alpha=0.2, figsize=(6, 6), diagonal='kde'); Heatmaps A heatmap is a graphical representation of data, where values within a matrix are represented by colors. This is an effective means to show relationships of values that are measured at the intersection of two variables, at each intersection of the rows and the columns of the matrix. A common scenario, is to have the values in the matrix normalized to 0.0 through 1.0 and have the intersections between a row and column represent the correlation between the two variables. Values with less correlation (0.0) are the darkest, and those with the highest correlation (1.0) are white. Heatmaps are easily created with pandas and matplotlib using the .imshow() function: In [39]:# create a heatmap# start with data for the heatmaps = pd.Series([0.0, 0.1, 0.2, 0.3, 0.4],['V', 'W', 'X', 'Y', 'Z'])heatmap_data = pd.DataFrame({'A' : s + 0.0,'B' : s + 0.1,'C' : s + 0.2,'D' : s + 0.3,'E' : s + 0.4,'F' : s + 0.5,'G' : s + 0.6})heatmap_dataOut [39]:A B C D E F GV 0.0 0.1 0.2 0.3 0.4 0.5 0.6W 0.1 0.2 0.3 0.4 0.5 0.6 0.7X 0.2 0.3 0.4 0.5 0.6 0.7 0.8Y 0.3 0.4 0.5 0.6 0.7 0.8 0.9Z 0.4 0.5 0.6 0.7 0.8 0.9 1.0In [40]:# generate the heatmapplt.imshow(heatmap_data, cmap='hot', interpolation='none')plt.colorbar() # add the scale of colors bar# set the labelsplt.xticks(range(len(heatmap_data.columns)), heatmap_data.columns)plt.yticks(range(len(heatmap_data)), heatmap_data.index); Multiple plots in a single chart It is often useful to contrast data by displaying multiple plots next to each other. This is actually quite easy to when using matplotlib. To draw multiple subplots on a grid, we can make multiple calls to plt.subplot2grid(), each time passing the size of the grid the subplot is to be located on (shape=(height, width)) and the location on the grid of the upper-left section of the subplot (loc=(row, column)). Each call to plt.subplot2grid() returns a different AxesSubplot object that can be used to reference the specific subplot and direct the rendering into. The following demonstrates this, by creating a plot with two subplots based on a two row by one column grid (shape=(2,1)). The first subplot, referred to by ax1, is located in the first row (loc=(0,0)), and the second, referred to as ax2, is in the second row (loc=(1,0)): In [41]:# create two sub plots on the new plot using a 2x1 grid# ax1 is the upper rowax1 = plt.subplot2grid(shape=(2,1), loc=(0,0))# and ax2 is in the lower rowax2 = plt.subplot2grid(shape=(2,1), loc=(1,0)) The subplots have been created, but we have not drawn into either yet. The size of any subplot can be specified using the rowspan and colspan parameters in each call to plt.subplot2grid(). This actually feels a lot like placing content in HTML tables. The following demonstrates a more complicated layout of five plots, specifying different row and column spans for each: In [42]:# layout sub plots on a 4x4 grid# ax1 on top row, 4 columns wideax1 = plt.subplot2grid((4,4), (0,0), colspan=4)# ax2 is row 2, leftmost and 2 columns wideax2 = plt.subplot2grid((4,4), (1,0), colspan=2)# ax3 is 2 cols wide and 2 rows high, starting# on second row and the third columnax3 = plt.subplot2grid((4,4), (1,2), colspan=2, rowspan=2)# ax4 1 high 1 wide, in row 4 column 0ax4 = plt.subplot2grid((4,4), (2,0))# ax4 1 high 1 wide, in row 4 column 1ax5 = plt.subplot2grid((4,4), (2,1)); To draw into a specific subplot using the pandas .plot() method, you can pass the specific axes into the plot function via the ax parameter. The following demonstrates this by extracting each series from the random walk we created at the beginning of this article, and drawing each into different subplots: In [43]:# demonstrating drawing into specific sub-plots# generate a layout of 2 rows 1 column# create the subplots, one on each rowax5 = plt.subplot2grid((2,1), (0,0))ax6 = plt.subplot2grid((2,1), (1,0))# plot column 0 of walk_df into top row of the gridwalk_df[[0]].plot(ax = ax5)# and column 1 of walk_df into bottom rowwalk_df[[1]].plot(ax = ax6); Using this technique, we can perform combinations of different series of data, such as a stock close versus volume graph. Given the data we read during a previous example for Google, the following will plot the volume versus the closing price: In [44]:# draw the close on the top charttop = plt.subplot2grid((4,4), (0, 0), rowspan=3, colspan=4)top.plot(stock_data.index, stock_data['Close'], label='Close')plt.title('Google Opening Stock Price 2001')# draw the volume chart on the bottombottom = plt.subplot2grid((4,4), (3,0), rowspan=1, colspan=4)bottom.bar(stock_data.index, stock_data['Volume'])plt.title('Google Trading Volume')# set the size of the plotplt.gcf().set_size_inches(15,8) Summary Visualizing your data is one of the best ways to quickly understand the story that is being told with the data. Python, pandas, and matplotlib (and a few other libraries) provide a means of very quickly, and with a few lines of code, getting the gist of what you are trying to discover, as well as the underlying message (and displaying it beautifully too). In this article, we examined many of the most common means of visualizing data from pandas. There are also a lot of interesting visualizations that were not covered, and indeed, the concept of data visualization with pandas and/or Python is the subject of entire texts, but I believe this article provides a much-needed reference to get up and going with the visualizations that provide most of what is needed. Resources for Article: Further resources on this subject: Prototyping Arduino Projects using Python [Article] Classifying with Real-world Examples [Article] Python functions – Avoid repeating code [Article]
Read more
  • 0
  • 0
  • 4245

article-image-creating-multivariate-charts
Packt
26 Aug 2013
10 min read
Save for later

Creating Multivariate Charts

Packt
26 Aug 2013
10 min read
(For more resources related to this topic, see here.) With increasing number of variables, any analysis can become challenging and any observations harder; however, Tableau simplifies the process for the designer and uses effective layouts for the reader even in multivariate analysis. Using various combinations of colors and charts, we can create compelling graphics that generate critical insights from our data. Among the charts covered in this article, facets and area charts are easier to understand and easier to create compared to bullet graphs and dual axes charts. Creating facets Facets are one of the powerful features in Tableau. Edward Tufte, a pioneer in the field of information graphics, championed these types of charts, also called grid or panel charts; he called them small multiples. These charts show the same measure(s) across various values of one or two variables for easier comparison. Getting ready Let's use the sample file Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data file is loaded on the new worksheet, perform the following steps to create a simple faceted chart: Drag-and-drop Market from Dimensions into the Columns shelf. Drag-and-drop Product Type from Dimensions into the Rows shelf. Drag-and-drop Profit from Measures into the Rows shelf next to Product Type. Optionally, you can drag-and-drop Market into the Color Marks box to give color to the four bars of different Market areas. The chart should look like the one in the following screenshot: How it works... When there is one dimension on one of the shelves, either Columns or Rows, and one measure on the other shelf, Tableau creates a univariate bar chart, but when we drop additional dimensions along with the measure, Tableau creates small charts or facets and displays univariate charts broken down by a dimension. There's more... A company named Juice Analytics has a great blog article on the topic of small multiples. This article lists the benefits of using small multiples as well as some examples of small multiples in practice. Find this blog at http://www.juiceanalytics.com/writing/better-know-visualization-small-multiples/. Creating area charts An area chart is an extension of a line chart. The area chart shows the line of the measure but fills the area below the line to emphasize on the value of the measure. A special case of area chart is a stacked area chart, which shows a line per measure and the area between the lines is filled. Tableau's implementation of area charts uses one date variable and one or more measures. Getting ready Let's use the sample file Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded on the new worksheet, perform the following steps to create an area chart: Click on the Show Me button to bring the Show Me toolbar on the screen. Select Order Date from Dimensions and Order Quantity from Measures by clicking and holding the Ctrl key. Click on Area charts (continuous) from the Show Me toolbar. Drag-and-drop Order Date into the Columns shelf next to YEAR(Order Date. Expand YEAR(Order Date), seen on the right-hand side, by clicking on the plus sign. Drag-and-drop Region from Dimensions into the the Rows shelf to the left of SUM(Order Quantity). The chart should look like the one in the following screenshot: How it works… When we added Order Date for the first time, Tableau, by default, aggregated the date field by year; therefore, we added Order Date again to create aggregation by quarter of the Order Date. We also added Region to create facets on the regions that provide trends of order quantity over time. There's more... A blog post by visually, an information graphics company, discusses the key differences between line charts and area charts. You can find this post at http://blog.visual.ly/line-vs-area-charts/. Creating bullet graphs Stephen Few, an information visualization consultant and author, designed this chart to solve some of the problems that the gauges and meters type of charts poses. Gauges, although simple to understand, take a lot of space to show only one measure. Bullet graphs are a combination of the bar graph and thermometer types of charts, and they show a measure of interest in the form of a bar graph (which is the bullet) and target variables. Getting ready Let's use the sample file Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data is loaded on the sheet, perform the following steps to create a bullet graph: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Type and Market from Dimensions and Budget Sales and Sales from Measures. Click on the bullet graphs icon on the Show Me toolbar. Right-click on the x axis (the Budget Sales axis) and click on Swap Reference Line Fields. The final chart should look like the one in the following screenshot: How it works... Although bullet graphs maximize the available space to show relevant information, readers require detailed explanation as to what all the components of the graphic are encoding. In this recipe, since we want to compare the budgeted sales with the actual sales, we had to swap the reference line from Sales to Budget Sales. The black bar on the graphic shows the budgeted sales and the blue bar shows the actual sales. The dark gray background color shows 60 percent of the actual sales and the lighter gray shows 80 percent of the actual sales. As we can see in this chart, blue bars crossed all the black lines, and that tells us that both the coffee types and all market regions exceeded the budgeted sales. There's more... A blog post by Data Pig Technologies discusses some of the problems with the bullet graph. The main problem is intuitive understanding of this chart. You can read about this problem and the reply by Stephen Few at http://datapigtechnologies.com/blog/index.php/the-good-and-bad-of-bullet-graphs/. Creating dual axes charts Dual axes charts are useful to compare two similar types of measures that may have different types of measurement units, such as pounds and dollars. In this recipe, we will look at the dual axes chart. Getting ready Let's use the same sample file, Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data is loaded on the sheet, perform the following steps to create a dual axes chart: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Date, Type, and Market from Dimensions and Sales and Budget Sales from Measures. Click on the dual line graph icon on the Show Me toolbar. Click-and-drag Market from the Rows shelf into the Columns shelf. Right-click on the Sales vertical axis and click on Synchronize Axis. The chart should look like the one shown in the following screenshot: How it works... Tableau will create two vertical axes and automatically place Sales on one dual axes charts vertical axis and Budget Sales on the other. The scales on both the vertical axes are different, however. By synchronizing the axes, we get the same scales on both axes for better comparison and accurate representation of the patterns. Creating Gantt charts Gantt charts are most commonly used in project management as these charts show various activities and tasks with the time required to complete those tasks. Gantt charts are even more useful when they show dependencies among various tasks. This type of chart is very helpful when the number of activities is low (around 20-30), otherwise the chart becomes too big to be understood easily. Getting ready Let's use the sample file Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded, perform the following steps to create a Gantt chart: Click on Analysis from the top menu toolbar, and if Aggregate Measures is checked, click on it again to uncheck that option. Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Order Date and Category (under Products) from Dimensions and Time to Ship from Measures. Click on the Gantt chart icon on the Show Me toolbar. Drag-and-drop Order Date into the Filters pane. Select Years from the Filter Field [Order Date] options dialog box and hit Next. Check 2012 from the list and hit OK. Right-click on YEAR(Order Date) on the Columns shelf and select the Day May 8, 2011 option. Drag-and-drop Order Date into the Filters pane. Select Months from the Filter Field [Order Date] options dialog box and hit Next. Check December from the list and hit OK. Drag-and-drop Region from Dimensions into the Color Marks input box. Drag-and-drop Region from Dimensions into the Rows shelf before Category. The generated Gantt chart should look like the one in the following screenshot: How it works... Representing time this way helps the reader to discern which activity took the longest amount of time. We added the Order Date field two times in the Filters pane to first filter for the year 2012 and then for the month of December. In this recipe, out of all the products shipped in December of 2012, we can easily see the red bars for the West region in the Office Supplies category is longer, suggesting that these products took the longest amount of time to ship. There's more... Andy Kriebel, a Tableau data visualization expert, has a great example of Gantt charts using US presidential data. The following link shows the lengths of terms in office of Presidents from various parties: http://vizwiz.blogspot.com/2010/09/tableau-tip-creating-waterfall-chart.html Creating heat maps A heat map is a visual representation of numbers in a table or a grid such that the bigger numbers are encoded by darker colors or bigger sizes and the smaller numbers by lighter colors or smaller sizes. This type of representation makes the reader's pattern detection from the data easier. Getting ready Let's use the same sample file, Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded, perform the following steps to create a heat map chart: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Sub-Category (under Products), Region, and Ship Mode from Dimensions and Profit from Measures. Click on the heat maps chart icon on the Show Me toolbar. Drag-and-drop Profit from Measures into the Color Marks box. The generated chart should look like the one in the following screenshot: Summary When we created the chart for the first time, Tableau assigned various sizes to the square boxes, but when we placed Profit as a color mark, red was used for low amounts of profit and green was used for higher amounts of profit. This made spotting of patterns very easy. Binders and Binder Accessories, shipped by Regular Air in the Central region, generated very high amounts of profit and Tables, shipped by Delivery Trucks in the East region, generated very low amounts of profit (it actually created losses for the company). Resources for Article: Further resources on this subject: Constructing and Evaluating Your Design Solution [Article] Basic use of Local Storage [Article] Creating Interactive Graphics and Animation [Article]
Read more
  • 0
  • 0
  • 4243
Modal Close icon
Modal Close icon