Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-loading-data-creating-app-and-adding-dashboards-and-reports-splunk
Packt
31 Oct 2014
13 min read
Save for later

Loading data, creating an app, and adding dashboards and reports in Splunk

Packt
31 Oct 2014
13 min read
In this article by Josh Diakun, Paul R Johnson, and Derek Mock, authors of Splunk Operational Intelligence Cookbook, we will take a look at how to load sample data into Splunk, how to create an application, and how to add dashboards and reports in Splunk. (For more resources related to this topic, see here.) Loading the sample data While most of the data you will index with Splunk will be collected in real time, there might be instances where you have a set of data that you would like to put into Splunk, either to backfill some missing or incomplete data, or just to take advantage of its searching and reporting tools. This recipe will show you how to perform one-time bulk loads of data from files located on the Splunk server. We will also use this recipe to load the data samples that will be used as we build our Operational Intelligence app in Splunk. There are two files that make up our sample data. The first is access_log, which represents data from our web layer and is modeled on an Apache web server. The second file is app_log, which represents data from our application layer and is modeled on the log4j application log data. Getting ready To step through this recipe, you will need a running Splunk server and should have a copy of the sample data generation app (OpsDataGen.spl). (This file is part of the downloadable code bundle, which is available on the book's website.) How to do it... Follow the given steps to load the sample data generator on your system: Log in to your Splunk server using your credentials. From the home launcher, select the Apps menu in the top-left corner and click on Manage Apps. Select Install App from file. Select the location of the OpsDataGen.spl file on your computer, and then click on the Upload button to install the application. After installation, a message should appear in a blue bar at the top of the screen, letting you know that the app has installed successfully. You should also now see the OpsDataGen app in the list of apps. By default, the app installs with the data-generation scripts disabled. In order to generate data, you will need to enable either a Windows or Linux script, depending on your Splunk operating system. To enable the script, select the Settings menu from the top-right corner of the screen, and then select Data inputs. From the Data inputs screen that follows, select Scripts. On the Scripts screen, locate the OpsDataGen script for your operating system and click on Enable. For Linux, it will be $SPLUNK_HOME/etc/apps/OpsDataGen/bin/AppGen.path For Windows, it will be $SPLUNK_HOMEetcappsOpsDataGenbinAppGen-win.path The following screenshot displays both the Windows and Linux inputs that are available after installing the OpsDataGen app. It also displays where to click to enable the correct one based on the operating system Splunk is installed on. Select the Settings menu from the top-right corner of the screen, select Data inputs, and then select Files & directories. On the Files & directories screen, locate the two OpsDataGen inputs for your operating system and for each click on Enable. For Linux, it will be: $SPLUNK_HOME/etc/apps/OpsDataGen/data/access_log $SPLUNK_HOME/etc/apps/OpsDataGen/data/app_log For Windows, it will be: $SPLUNK_HOMEetcappsOpsDataGendataaccess_log $SPLUNK_HOMEetcappsOpsDataGendataapp_log The following screenshot displays both the Windows and Linux inputs that are available after installing the OpsDataGen app. It also displays where to click to enable the correct one based on the operating system Splunk is installed on. The data will now be generated in real time. You can test this by navigating to the Splunk search screen and running the following search over an All time (real-time) time range: index=main sourcetype=log4j OR sourcetype=access_combined After a short while, you should see data from both source types flowing into Splunk, and the data generation is now working as displayed in the following screenshot: How it works... In this case, you installed a Splunk application that leverages a scripted input. The script we wrote generates data for two source types. The access_combined source type contains sample web access logs, and the log4j source type contains application logs. Creating an Operational Intelligence application This recipe will show you how to create an empty Splunk app that we will use as the starting point in building our Operational Intelligence application. Getting ready To step through this recipe, you will need a running Splunk Enterprise server, with the sample data loaded from the previous recipe. You should be familiar with navigating the Splunk user interface. How to do it... Follow the given steps to create the Operational Intelligence application: Log in to your Splunk server. From the top menu, select Apps and then select Manage Apps. Click on the Create app button. Complete the fields in the box that follows. Name the app Operational Intelligence and give it a folder name of operational_intelligence. Add in a version number and provide an author name. Ensure that Visible is set to Yes, and the barebones template is selected. When the form is completed, click on Save. This should be followed by a blue bar with the message, Successfully saved operational_intelligence. Congratulations, you just created a Splunk application! How it works... When an app is created through the Splunk GUI, as in this recipe, Splunk essentially creates a new folder (or directory) named operational_intelligence within the $SPLUNK_HOME/etc/apps directory. Within the $SPLUNK_HOME/etc/apps/operational_intelligence directory, you will find four new subdirectories that contain all the configuration files needed for our barebones Operational Intelligence app that we just created. The eagle-eyed among you would have noticed that there were two templates, barebones and sample_app, out of which any one could have been selected when creating the app. The barebones template creates an application with nothing much inside of it, and the sample_app template creates an application populated with sample dashboards, searches, views, menus, and reports. If you wish to, you can also develop your own custom template if you create lots of apps, which might enforce certain color schemes for example. There's more... As Splunk apps are just a collection of directories and files, there are other methods to add apps to your Splunk Enterprise deployment. Creating an application from another application It is relatively simple to create a new app from an existing app without going through the Splunk GUI, should you wish to do so. This approach can be very useful when we are creating multiple apps with different inputs.conf files for deployment to Splunk Universal Forwarders. Taking the app we just created as an example, copy the entire directory structure of the operational_intelligence app and name it copied_app. cp -r $SPLUNK_HOME$/etc/apps/operational_intelligence/* $SPLUNK_HOME$/etc/apps/copied_app Within the directory structure of copied_app, we must now edit the app.conf file in the default directory. Open $SPLUNK_HOME$/etc/apps/copied_app/default/app.conf and change the label field to My Copied App, provide a new description, and then save the conf file. ## Splunk app configuration file#[install]is_configured = 0[ui]is_visible = 1label = My Copied App[launcher]author = John Smithdescription = My Copied applicationversion = 1.0 Now, restart Splunk, and the new My Copied App application should now be seen in the application menu. $SPLUNK_HOME$/bin/splunk restart Downloading and installing a Splunk app Splunk has an entire application website with hundreds of applications, created by Splunk, other vendors, and even users of Splunk. These are great ways to get started with a base application, which you can then modify to meet your needs. If the Splunk server that you are logged in to has access to the Internet, you can click on the Apps menu as you did earlier and then select the Find More Apps button. From here, you can search for apps and install them directly. An alternative way to install a Splunk app is to visit http://apps.splunk.com and search for the app. You will then need to download the application locally. From your Splunk server, click on the Apps menu and then on the Manage Apps button. After that, click on the Install App from File button and upload the app you just downloaded, in order to install it. Once the app has been installed, go and look at the directory structure that the installed application just created. Familiarize yourself with some of the key files and where they are located. When downloading applications from the Splunk apps site, it is best practice to test and verify them in a nonproduction environment first. The Splunk apps site is community driven and, as a result, quality checks and/or technical support for some of the apps might be limited. Adding dashboards and reports Dashboards are a great way to present many different pieces of information. Rather than having lots of disparate dashboards across your Splunk environment, it makes a lot of sense to group related dashboards into a common Splunk application, for example, putting operational intelligence dashboards into a common Operational Intelligence application. In this recipe, you will learn how to move the dashboards and associated reports into our new Operational Intelligence application. Getting ready To step through this recipe, you will need a running Splunk Enterprise server, with the sample data loaded from the Loading the sample data recipe. You should be familiar with navigating the Splunk user interface. How to do it... Follow these steps to move your dashboards into the new application: Log in to your Splunk server. Select the newly created Operational Intelligence application. From the top menu, select Settings and then select the User interface menu item. Click on the Views section. In the App Context dropdown, select Searching & Reporting (search) or whatever application you were in when creating the dashboards: Locate the website_monitoring dashboard row in the list of views and click on the Move link to the right of the row. In the Move Object pop up, select the Operational Intelligence (operational_intelligence) application that was created earlier and then click on the Move button. A message bar will then be displayed at the top of the screen to confirm that the dashboard was moved successfully. Repeat from step 5 to move the product_monitoring dashboard as well. After the Website Monitoring and Product Monitoring dashboards have been moved, we now want to move all the reports that were created, as these power the dashboards and provide operational intelligence insight. From the top menu, select Settings and this time select Searches, reports, and alerts. Select the Search & Reporting (search) context and filter by cp0* to view the searches (reports) that are created. Click on the Move link of the first cp0* search in the list. Select to move the object to the Operational Intelligence (operational_intelligence) application and click on the Move button. A message bar will then be displayed at the top of the screen to confirm that the dashboard was moved successfully. Select the Search & Reporting (search) context and repeat from step 11 to move all the other searches over to the new Operational Intelligence application—this seems like a lot but will not take you long! All of the dashboards and reports are now moved over to your new Operational Intelligence application. How it works... In the previous recipe, we revealed how Splunk apps are essentially just collections of directories and files. Dashboards are XML files found within the $SPLUNK_HOME/etc/apps directory structure. When moving a dashboard from one app to another, Splunk is essentially just moving the underlying file from a directory inside one app to a directory in the other app. In this recipe, you moved the dashboards from the Search & Reporting app to the Operational Intelligence app, as represented in the following screenshot: As visualizations on the dashboards leverage the underlying saved searches (or reports), you also moved these reports to the new app so that the dashboards maintain permissions to access them. Rather than moving the saved searches, you could have changed the permissions of each search to Global such that they could be seen from all the other apps in Splunk. However, the other reason you moved the reports was to keep everything contained within a single Operational Intelligence application, which you will continue to build on going forward. It is best practice to avoid setting permissions to Global for reports and dashboards, as this makes them available to all the other applications when they most likely do not need to be. Additionally, setting global permissions can make things a little messy from a housekeeping perspective and crowd the lists of reports and views that belong to specific applications. The exception to this rule might be for knowledge objects such as tags, event types, macros, and lookups, which often have advantages to being available across all applications. There's more… As you went through this recipe, you likely noticed that the dashboards had application-level permissions, but the reports had private-level permissions. The reports are private as this is the default setting in Splunk when they are created. This private-level permission restricts access to only your user account and admin users. In order to make the reports available to other users of your application, you will need to change the permissions of the reports to Shared in App as we did when adjusting the permissions of reports. Changing the permissions of saved reports Changing the sharing permission levels of your reports from the default Private to App is relatively straightforward: Ensure that you are in your newly created Operational Intelligence application. Select the Reports menu item to see the list of reports. Click on Edit next to the report you wish to change the permissions for. Then, click on Edit Permissions from the drop-down list. An Edit Permissions pop-up box will appear. In the Display for section, change from Owner to App, and then, click on Save. The box will close, and you will see that the Sharing permissions in the table will now display App for the specific report. This report will now be available to all the users of your application. Summary In this article, we loaded the sample data into Splunk. We also saw how to organize dashboards and knowledge into a custom Splunk app. Resources for Article: Further resources on this subject: VWorking with Pentaho Mobile BI [Article] Visualization of Big Data [Article] Highlights of Greenplum [Article]
Read more
  • 0
  • 0
  • 10125

article-image-theming-highcharts
Packt
30 Oct 2014
10 min read
Save for later

Theming with Highcharts

Packt
30 Oct 2014
10 min read
Besides the charting capabilities offered by Highcharts, theming is yet another strong feature of Highcharts. With its extensive theming API, charts can be customized completely to match the branding of a website or an app. Almost all of the chart elements are customizable through this API. In this article by Bilal Shahid, author of Highcharts Essentials, we will do the following things: (For more resources related to this topic, see here.) Use different fill types and fonts Create a global theme for our charts Use jQuery easing for animations Using Google Fonts with Highcharts Google provides an easy way to include hundreds of high quality web fonts to web pages. These fonts work in all major browsers and are served by Google CDN for lightning fast delivery. These fonts can also be used with Highcharts to further polish the appearance of our charts. This section assumes that you know the basics of using Google Web Fonts. If you are not familiar with them, visit https://developers.google.com/fonts/docs/getting_started. We will style the following example with Google Fonts. We will use the Merriweather family from Google Fonts and link to its style sheet from our web page inside the <head> tag: <link href='http://fonts.googleapis.com/css?family=Merriweather:400italic,700italic' rel='stylesheet' type='text/css'> Having included the style sheet, we can actually use the font family in our code for the labels in yAxis: yAxis: [{ ... labels: {    style: {      fontFamily: 'Merriweather, sans-serif',      fontWeight: 400,      fontStyle: 'italic',      fontSize: '14px',      color: '#ffffff'    } } }, { ... labels: {    style: {      fontFamily: 'Merriweather, sans-serif',      fontWeight: 700,      fontStyle: 'italic',      fontSize: '21px',      color: '#ffffff'    },    ... } }] For the outer axis, we used a font size of 21px with font weight of 700. For the inner axis, we lowered the font size to 14px and used font weight of 400 to compensate for the smaller font size. The following is the modified speedometer: In the next section, we will continue with the same example to include jQuery UI easing in chart animations. Using jQuery UI easing for series animation Animations occurring at the point of initialization of charts can be disabled or customized. The customization requires modifying two properties: animation.duration and animation.easing. The duration property accepts the number of milliseconds for the duration of the animation. The easing property can have various values depending on the framework currently being used. For a standalone jQuery framework, the values can be either linear or swing. Using the jQuery UI framework adds a couple of more options for the easing property to choose from. In order to follow this example, you must include the jQuery UI framework to the page. You can also grab the standalone easing plugin from http://gsgd.co.uk/sandbox/jquery/easing/ and include it inside your <head> tag. We can now modify the series to have a modified animation: plotOptions: { ... series: {    animation: {      duration: 1000,      easing: 'easeOutBounce'    } } } The preceding code will modify the animation property for all the series in the chart to have duration set to 1000 milliseconds and easing to easeOutBounce. Each series can have its own different animation by defining the animation property separately for each series as follows: series: [{ ... animation: {    duration: 500,    easing: 'easeOutBounce' } }, { ... animation: {    duration: 1500,    easing: 'easeOutBounce' } }, { ... animation: {      duration: 2500,    easing: 'easeOutBounce' } }] Different animation properties for different series can pair nicely with column and bar charts to produce visually appealing effects. Creating a global theme for our charts A Highcharts theme is a collection of predefined styles that are applied before a chart is instantiated. A theme will be applied to all the charts on the page after the point of its inclusion, given that the styling options have not been modified within the chart instantiation. This provides us with an easy way to apply custom branding to charts without the need to define styles over and over again. In the following example, we will create a basic global theme for our charts. This way, we will get familiar with the fundamentals of Highcharts theming and some API methods. We will define our theme inside a separate JavaScript file to make the code reusable and keep things clean. Our theme will be contained in an options object that will, in turn, contain styling for different Highcharts components. Consider the following code placed in a file named custom-theme.js. This is a basic implementation of a Highcharts custom theme that includes colors and basic font styles along with some other modifications for axes: Highcharts.customTheme = {      colors: ['#1BA6A6', '#12734F', '#F2E85C', '#F27329', '#D95D30', '#2C3949', '#3E7C9B', '#9578BE'],      chart: {        backgroundColor: {            radialGradient: {cx: 0, cy: 1, r: 1},            stops: [                [0, '#ffffff'],                [1, '#f2f2ff']            ]        },        style: {            fontFamily: 'arial, sans-serif',            color: '#333'        }    },    title: {        style: {            color: '#222',            fontSize: '21px',            fontWeight: 'bold'        }    },    subtitle: {        style: {            fontSize: '16px',            fontWeight: 'bold'        }    },    xAxis: {        lineWidth: 1,        lineColor: '#cccccc',        tickWidth: 1,        tickColor: '#cccccc',        labels: {            style: {                fontSize: '12px'            }        }    },    yAxis: {        gridLineWidth: 1,        gridLineColor: '#d9d9d9',        labels: {           style: {                fontSize: '12px'            }        }    },    legend: {        itemStyle: {            color: '#666',            fontSize: '9px'        },        itemHoverStyle:{            color: '#222'        }      } }; Highcharts.setOptions( Highcharts.customTheme ); We start off by modifying the Highcharts object to include an object literal named customTheme that contains styles for our charts. Inside customTheme, the first option we defined is for series colors. We passed an array containing eight colors to be applied to series. In the next part, we defined a radial gradient as a background for our charts and also defined the default font family and text color. The next two object literals contain basic font styles for the title and subtitle components. Then comes the styles for the x and y axes. For the xAxis, we define lineColor and tickColor to be #cccccc with the lineWidth value of 1. The xAxis component also contains the font style for its labels. The y axis gridlines appear parallel to the x axis that we have modified to have the width and color at 1 and #d9d9d9 respectively. Inside the legend component, we defined styles for the normal and mouse hover states. These two states are stated by itemStyle and itemHoverStyle respectively. In normal state, the legend will have a color of #666 and font size of 9px. When hovered over, the color will change to #222. In the final part, we set our theme as the default Highcharts theme by using an API method Highcharts.setOptions(), which takes a settings object to be applied to Highcharts; in our case, it is customTheme. The styles that have not been defined in our custom theme will remain the same as the default theme. This allows us to partially customize a predefined theme by introducing another theme containing different styles. In order to make this theme work, include the file custom-theme.js after the highcharts.js file: <script src="js/highcharts.js"></script> <script src="js/custom-theme.js"></script> The output of our custom theme is as follows: We can also tell our theme to include a web font from Google without having the need to include the style sheet manually in the header, as we did in a previous section. For that purpose, Highcharts provides a utility method named Highcharts.createElement(). We can use it as follows by placing the code inside the custom-theme.js file: Highcharts.createElement( 'link', {    href: 'http://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,700italic,400,300,700',    rel: 'stylesheet',    type: 'text/css' }, null, document.getElementsByTagName( 'head' )[0], null ); The first argument is the name of the tag to be created. The second argument takes an object as tag attributes. The third argument is for CSS styles to be applied to this element. Since, there is no need for CSS styles on a link element, we passed null as its value. The final two arguments are for the parent node and padding, respectively. We can now change the default font family for our charts to 'Open Sans': chart: {    ...    style: {        fontFamily: "'Open Sans', sans-serif",        ...    } } The specified Google web font will now be loaded every time a chart with our custom theme is initialized, hence eliminating the need to manually insert the required font style sheet inside the <head> tag. This screenshot shows a chart with 'Open Sans' Google web font. Summary In this article, you learned about incorporating Google fonts and jQuery UI easing into our chart for enhanced styling. Resources for Article: Further resources on this subject: Integrating with other Frameworks [Article] Highcharts [Article] More Line Charts, Area Charts, and Scatter Plots [Article]
Read more
  • 0
  • 0
  • 8224

article-image-hosting-service-iis-using-tcp-protocol
Packt
30 Oct 2014
8 min read
Save for later

Hosting the service in IIS using the TCP protocol

Packt
30 Oct 2014
8 min read
In this article by Mike Liu, the author of WCF Multi-layer Services Development with Entity Framework, Fourth Edtion, we will learn how to create and host a service in IIS using the TCP protocol. (For more resources related to this topic, see here.) Hosting WCF services in IIS using the HTTP protocol gives the best interoperability to the service, because the HTTP protocol is supported everywhere today. However, sometimes interoperability might not be an issue. For example, the service may be invoked only within your network with all Microsoft clients only. In this case, hosting the service by using the TCP protocol might be a better solution. Benefits of hosting a WCF service using the TCP protocol Compared to HTTP, there are a few benefits in hosting a WCF service using the TCP protocol: It supports connection-based, stream-oriented delivery services with end-to-end error detection and correction It is the fastest WCF binding for scenarios that involve communication between different machines It supports duplex communication, so it can be used to implement duplex contracts It has a reliable data delivery capability (this is applied between two TCP/IP nodes and is not the same thing as WS-ReliableMessaging, which applies between endpoints) Preparing the folders and files First, we need to prepare the folders and files for the host application, just as we did for hosting the service using the HTTP protocol. We will use the previous HTTP hosting application as the base to create the new TCP hosting application: Create the folders: In Windows Explorer, create a new folder called HostIISTcp under C:SOAwithWCFandEFProjectsHelloWorld and a new subfolder called bin under the HostIISTcp folder. You should now have the following new folders: C:SOAwithWCFandEFProjectsHelloWorld HostIISTcp and a bin folder inside the HostIISTcp folder. Copy the files: Now, copy all the files from the HostIIS hosting application folder at C:SOAwithWCFandEFProjectsHelloWorldHostIIS to the new folder that we created at C:SOAwithWCFandEFProjectsHelloWorldHostIISTcp. Create the Visual Studio solution folder: To make it easier to be viewed and managed from the Visual Studio Solution Explorer, you can add a new solution folder, HostIISTcp, to the solution and add the Web.config file to this folder. Add another new solution folder, bin, under HostIISTcp and add the HelloWorldService.dll and HelloWorldService.pdb files under this bin folder. Add the following post-build events to the HelloWorldService project, so next time, all the files will be copied automatically when the service project is built: xcopy "$(AssemblyName).dll" "C:SOAwithWCFandEFProjectsHelloWorldHostIISTcpbin" /Y xcopy "$(AssemblyName).pdb" "C:SOAwithWCFandEFProjectsHelloWorldHostIISTcpbin" /Y Modify the Web.config file: The Web.config file that we have copied from HostIIS is using the default basicHttpBinding as the service binding. To make our service use the TCP binding, we need to change the binding to TCP and add a TCP base address. Open the Web.config file and add the following node to it under the <system.serviceModel> node: <services> <service name="HelloWorldService.HelloWorldService">    <endpoint address="" binding="netTcpBinding"    contract="HelloWorldService.IHelloWorldService"/>    <host>      <baseAddresses>        <add baseAddress=        "net.tcp://localhost/HelloWorldServiceTcp/"/>      </baseAddresses>    </host> </service> </services> In this new services node, we have defined one service called HelloWorldService.HelloWorldService. The base address of this service is net.tcp://localhost/HelloWorldServiceTcp/. Remember, we have defined the host activation relative address as ./HelloWorldService.svc, so we can invoke this service from the client application with the following URL: http://localhost/HelloWorldServiceTcp/HelloWorldService.svc. For the file-less WCF activation, if no endpoint is defined explicitly, HTTP and HTTPS endpoints will be defined by default. In this example, we would like to expose only one TCP endpoint, so we have added an endpoint explicitly (as soon as this endpoint is added explicitly, the default endpoints will not be added). If you don't add this TCP endpoint explicitly here, the TCP client that we will create in the next section will still work, but on the client config file you will see three endpoints instead of one and you will have to specify which endpoint you are using in the client program. The following is the full content of the Web.config file: <?xml version="1.0"?> <!-- For more information on how to configure your ASP.NET application, please visit http://go.microsoft.com/fwlink/?LinkId=169433 --> <configuration> <system.web>    <compilation debug="true" targetFramework="4.5"/>    <httpRuntime targetFramework="4.5" /> </system.web>   <system.serviceModel>    <serviceHostingEnvironment >      <serviceActivations>        <add factory="System.ServiceModel.Activation.ServiceHostFactory"          relativeAddress="./HelloWorldService.svc"          service="HelloWorldService.HelloWorldService"/>      </serviceActivations>    </serviceHostingEnvironment>      <behaviors>      <serviceBehaviors>        <behavior>          <serviceMetadata httpGetEnabled="true"/>        </behavior>      </serviceBehaviors>    </behaviors>    <services>      <service name="HelloWorldService.HelloWorldService">        <endpoint address="" binding="netTcpBinding"         contract="HelloWorldService.IHelloWorldService"/>        <host>          <baseAddresses>            <add baseAddress=            "net.tcp://localhost/HelloWorldServiceTcp/"/>          </baseAddresses>        </host>      </service>    </services> </system.serviceModel>   </configuration> Enabling the TCP WCF activation for the host machine By default, the TCP WCF activation service is not enabled on your machine. This means your IIS server won't be able to host a WCF service with the TCP protocol. You can follow these steps to enable the TCP activation for WCF services: Go to Control Panel | Programs | Turn Windows features on or off. Expand the Microsoft .Net Framework 3.5.1 node on Windows 7 or .Net Framework 4.5 Advanced Services on Windows 8. Check the checkbox for Windows Communication Foundation Non-HTTP Activation on Windows 7 or TCP Activation on Windows 8. The following screenshot depicts the options required to enable WCF activation on Windows 7: The following screenshot depicts the options required to enable TCP WCF activation on Windows 8: Repair the .NET Framework: After you have turned on the TCP WCF activation, you have to repair .NET. Just go to Control Panel, click on Uninstall a Program, select Microsoft .NET Framework 4.5.1, and then click on Repair. Creating the IIS application Next, we need to create an IIS application named HelloWorldServiceTcp to host the WCF service, using the TCP protocol. Follow these steps to create this application in IIS: Open IIS Manager. Add a new IIS application, HelloWorldServiceTcp, pointing to the HostIISTcp physical folder under your project's folder. Choose DefaultAppPool as the application pool for the new application. Again, make sure your default app pool is a .NET 4.0.30319 application pool. Enable the TCP protocol for the application. Right-click on HelloWorldServiceTcp, select Manage Application | Advanced Settings, and then add net.tcp to Enabled Protocols. Make sure you use all lowercase letters and separate it from the existing HTTP protocol with a comma. Now the service is hosted in IIS using the TCP protocol. To view the WSDL of the service, browse to http://localhost/HelloWorldServiceTcp/HelloWorldService.svc and you should see the service description and a link to the WSDL of the service. Testing the WCF service hosted in IIS using the TCP protocol Now, we have the service hosted in IIS using the TCP protocol; let's create a new test client to test it: Add a new console application project to the solution, named HelloWorldClientTcp. Add a reference to System.ServiceModel in the new project. Add a service reference to the WCF service in the new project, naming the reference HelloWorldServiceRef and use the URL http://localhost/HelloWorldServiceTcp/HelloWorldService.svc?wsdl. You can still use the SvcUtil.exe command-line tool to generate the proxy and config files for the service hosted with TCP, just as we did in previous sections. Actually, behind the scenes Visual Studio is also calling SvcUtil.exe to generate the proxy and config files. Add the following code to the Main method of the new project: var client = new HelloWorldServiceRef.HelloWorldServiceClient (); Console.WriteLine(client.GetMessage("Mike Liu")); Finally, set the new project as the startup project. Now, if you run the program, you will get the same result as before; however, this time the service is hosted in IIS using the TCP protocol. Summary In this article, we created and tested an IIS application to host the service with the TCP protocol. Resources for Article: Further resources on this subject: Microsoft WCF Hosting and Configuration [Article] Testing and Debugging Windows Workflow Foundation 4.0 (WF) Program [Article] Applying LINQ to Entities to a WCF Service [Article]
Read more
  • 0
  • 0
  • 17467

article-image-data-visualization
Packt
27 Oct 2014
8 min read
Save for later

Data visualization

Packt
27 Oct 2014
8 min read
Data visualization is one of the most important tasks in data science track. Through effective visualization we can easily uncover underlying pattern among variables with doing any sophisticated statistical analysis. In this cookbook we have focused on graphical analysis using R in a very simple way with each independent example. We have covered default R functionality along with more advance visualization techniques such as lattice, ggplot2, and three-dimensional plots. Readers will not only learn the code to produce the graph but also learn why certain code has been written with specific examples. R Graphs Cookbook Second Edition written by Jaynal Abedin and Hrishi V. Mittal is such a book where the user will learn how to produce various graphs using R and how to customize them and finally how to make ready for publication. This practical recipe book starts with very brief description about R graphics system and then gradually goes through basic to advance plots with examples. Beside the R default graphics this recipe book introduces advance graphic system such as lattice and ggplot2; the grammar of graphics. We have also provided examples on how to inspect large dataset using advanced visualization such as tableplot and three dimensional visualizations. We also cover the following topics: How to create various types of bar charts using default R functions, lattice and ggplot2 How to produce density plots along with histograms using lattice and ggplot2 and customized them for publication How to produce graphs of frequency tabulated data How to inspect large dataset by simultaneously visualizing numeric and categorical variables in a single plot How to annotate graphs using ggplot2 (For more resources related to this topic, see here.) This recipe book is targeted to those reader groups who already exposed to R programming and want to learn effective graphics with the power of R and its various libraries. This hands-on guide starts with very short introduction to R graphics system and then gets straight to the point – actually creating graphs, instead of just theoretical learning. Each recipe is specifically tailored to full fill reader’s appetite for visually representing the data in the best way possible. Now, we will present few examples so that you can have an idea about the content of this recipe book: The ggplot2 R package is based on The Grammar of Graphics by Leland Wilkinson, Springer). Using this package, we can produce a variety of traditional graphics, and the user can produce their customized graphs as well. The beauty of this package is in its layered graphics facilities; through the use of layered graphics utilities, we can produce almost any kind of data visualization. Recently, ggplot2 is the most searched keyword in the R community, including the most popular R blog (www.r-bloggers.com). The comprehensive theme system allows the user to produce publication quality graphs with a variety of themes of choice. If we want to explain this package in a single sentence, then we can say that if whatever we can think about data visualization can be structured in a data frame, the visualization is a matter of few seconds. In the specific chapter on ggplot2 , we will see different examples and use themes to produce publication quality graphs. However, in this introductory chapter, we will show you one of the important features of the ggplot2 package that produces various types of graphs. The main function is ggplot(), but with the help of a different geom function, we can easily produce different types of graphs, such as the following: geom_point(): This will create scatter plot geom_line(): This will create a line chart geom_bar(): This will create a bar chart geom_boxplot(): This will create a box plot geom_text(): This will write certain text inside the plot area Now, we will see a simple example of the use of different geom functions with the default R mtcars dataset: # loading ggplot2 library library(ggplot2) # creating a basic ggplot object p <- ggplot(data=mtcars) # Creating scatter plot of mpg and disp variable p1 <- p+geom_point(aes(x=disp,y=mpg)) # creating line chart from the same ggplot object but different # geom function p2 <- p+geom_line(aes(x=disp,y=mpg)) # creating bar chart of mpg variable p3 <- p+geom_bar(aes(x=mpg)) # creating boxplot of mpg over gear p4 <- p+geom_boxplot(aes(x=factor(gear),y=mpg)) # writing certain text into the scatter plot p5 <- p1+geom_text(x=200,y=25,label="Scatter plot") The visualization of the preceding five plot will look like the following figure: Visualizing an empirical Cumulative Distribution function The empirical Cumulative Distribution function (CDF) is the non-parametric maximum-likelihood estimation of the CDF. In this recipe, we will see how the empirical CDF can be produced. Getting ready To produce this plot, we need to use the latticeExtra library. We will use the simulated dataset as shown in the following code: # Set a seed value to make the data reproducible set.seed(12345) qqdata <-data.frame(disA=rnorm(n=100,mean=20,sd=3),                disB=rnorm(n=100,mean=25,sd=4),                disC=rnorm(n=100,mean=15,sd=1.5),                age=sample((c(1,2,3,4)),size=100,replace=T),                sex=sample(c("Male","Female"),size=100,replace=T),                 econ_status=sample(c("Poor","Middle","Rich"),                size=100,replace=T)) How to do it… To plot an empirical CDF, we first need to call the latticeExtra library (note that this library has a dependency on RColorBrewer). Now, to plot the empirical CDF, we can use the following simple code: library(latticeExtra) ecdfplot(~disA|sex,data=qqdata) Graph annotation with ggplot To produce publication-quality data visualization, we often need to annotate the graph with various texts, symbols, or even shapes. In this recipe, we will see how we can easily annotate an existing graph. Getting ready In this recipe, we will use the disA and disD variables from ggplotdata. Let's call ggplotdata for this recipe. We also need to call the grid and gridExtra libraries for this recipe. How to do it... In this recipe, we will execute the following annotation on an existing scatter plot. So, the whole procedure will be as follows: Create a scatter plot Add customized text within the plot Highlight certain region to indicate extreme values Draw a line segment with an arrow within the scatter plot to indicate a single extreme observation Now, we will implement each of the steps one by one: library(grid) library(gridExtra) # creating scatter plot and print it annotation_obj <- ggplot(data=ggplotdata,aes(x=disA,y=disD))+geom_point() annotation_obj # Adding custom text at (18,29) position annotation_obj1 <- annotation_obj + annotate(geom="text",x=18,y=29,label="Extreme value",size=3) annotation_obj1 # Highlight certain regions with a box annotation_obj2 <- annotation_obj1+ annotate("rect", xmin = 24, xmax = 27,ymin=17,ymax=22,alpha = .2) annotation_obj2 # Drawing line segment with arrow annotation_obj3 <- annotation_obj2+ annotate("segment",x = 16,xend=17.5,y=25,yend=27.5,colour="red", arrow = arrow(length = unit(0.5, "cm")),size=2) annotation_obj3 The preceding four steps are displayed in the following single graph: How it works... The annotate() function takes input of a geom such as “segment”, “text” etc, and then it takes another input regarding position of that geom that is where to draw or where to place.. In this particular recipe, we used three geom instances, such as text to write customized text within the plot, rect to highlight a certain region in the plot, and segment to draw an arrow. The alpha argument represents the transparency of the region and size argument to represent the size of the text and line width of the line segment. Summary This article just gives a sample recipe of what kind of recipes are included in the book, and how the structure of each recipe is. Resources for Article: Further resources on this subject: Using R for Statistics, Research, and Graphics [Article] First steps with R [Article] Aspects of Data Manipulation in R [Article]
Read more
  • 0
  • 0
  • 3747

article-image-clustering-k-means
Packt
27 Oct 2014
9 min read
Save for later

Clustering with K-Means

Packt
27 Oct 2014
9 min read
In this article by Gavin Hackeling, the author of Mastering Machine Learning with scikit-Learn, we will discuss an unsupervised learning task called clustering. Clustering is used to find groups of similar observations within a set of unlabeled data. We will discuss the K-Means clustering algorithm, apply it to an image compression problem, and learn to measure its performance. Finally, we will work through a semi-supervised learning problem that combines clustering with classification. Clustering, or cluster analysis, is the task of grouping observations such that members of the same group, or cluster, are more similar to each other by some metric than they are to the members of the other clusters. As with supervised learning, we will represent an observation as an n-dimensional vector. For example, assume that your training data consists of the samples plotted in the following figure: Clustering might reveal the following two groups, indicated by squares and circles. Clustering could also reveal the following four groups: Clustering is commonly used to explore a data set. Social networks can be clustered to identify communities, and to suggest missing connections between people. In biology, clustering is used to find groups of genes with similar expression patterns. Recommendation systems sometimes employ clustering to identify products or media that might appeal to a user. In marketing, clustering is used to find segments of similar consumers. In the following sections we will work through an example of using the K-Means algorithm to cluster a data set. Clustering with the K-Means Algorithm The K-Means algorithm is a clustering method that is popular because of its speed and scalability. K-Means is an iterative process of moving the centers of the clusters, or the centroids, to the mean position of their constituent points, and re-assigning instances to their closest clusters. The titular K is a hyperparameter that specifies the number of clusters that should be created; K-Means automatically assigns observations to clusters but cannot determine the appropriate number of clusters. K must be a positive integer that is less than the number of instances in the training set. Sometimes the number of clusters is specified by the clustering problem's context. For example, a company that manufactures shoes might know that it is able to support manufacturing three new models. To understand what groups of customers to target with each model, it surveys customers and creates three clusters from the results. That is, the value of K was specified by the problem's context. Other problems may not require a specific number of clusters, and the optimal number of clusters may be ambiguous. We will discuss a heuristic for estimating the optimal number of clusters called the elbow method later in this article. The parameters of K-Means are the positions of the clusters' centroids and the observations that are assigned to each cluster. Like generalized linear models and decision trees, the optimal values of K-Means' parameters are found by minimizing a cost function. The cost function for K-Means is given by the following equation: where µk is the centroid for cluster k. The cost function sums the distortions of the clusters. Each cluster's distortion is equal to the sum of the squared distances between its centroid and its constituent instances. The distortion is small for compact clusters, and large for clusters that contain scattered instances. The parameters that minimize the cost function are learned through an iterative process of assigning observations to clusters and then moving the clusters. First, the clusters' centroids are initialized to random positions. In practice, setting the centroids' positions equal to the positions of randomly selected observations yields the best results. During each iteration, K-Means assigns observations to the cluster that they are closest to, and then moves the centroids to their assigned observations' mean location. Let's work through an example by hand using the training data shown in the following table. Instance X0 X1 1 7 5 2 5 7 3 7 7 4 3 3 5 4 6 6 1 4 7 0 0 8 2 2 9 8 7 10 6 8 11 5 5 12 3 7 There are two explanatory variables; each instance has two features. The instances are plotted in the following figure. Assume that K-Means initializes the centroid for the first cluster to the fifth instance and the centroid for the second cluster to the eleventh instance. For each instance, we will calculate its distance to both centroids, and assign it to the cluster with the closest centroid. The initial assignments are shown in the “Cluster” column of the following table. Instance X0 X1 C1 distance C2 distance Last Cluster Cluster Changed? 1 7 5 3.16228 2 None C2 Yes 2 5 7 1.41421 2 None C1 Yes 3 7 7 3.16228 2.82843 None C2 Yes 4 3 3 3.16228 2.82843 None C2 Yes 5 4 6 0 1.41421 None C1 Yes 6 1 4 3.60555 4.12311 None C1 Yes 7 0 0 7.21110 7.07107 None C2 Yes 8 2 2 4.47214 4.24264 None C2 Yes 9 8 7 4.12311 3.60555 None C2 Yes 10 6 8 2.82843 3.16228 None C1 Yes 11 5 5 1.41421 0 None C2 Yes 12 3 7 1.41421 2.82843 None C1 Yes C1 centroid 4 6           C2 centroid 5 5           The plotted centroids and the initial cluster assignments are shown in the following graph. Instances assigned to the first cluster are marked with “Xs”, and instances assigned to the second cluster are marked with dots. The markers for the centroids are larger than the markers for the instances. Now we will move both centroids to the means of their constituent instances, re-calculate the distances of the training instances to the centroids, and re-assign the instances to the closest centroids. Instance X0 X1 C1 distance C2 distance Last Cluster New Cluster Changed? 1 7 5 3.492850 2.575394 C2 C2 No 2 5 7 1.341641 2.889107 C1 C1 No 3 7 7 3.255764 3.749830 C2 C1 Yes 4 3 3 3.492850 1.943067 C2 C2 No 5 4 6 0.447214 1.943067 C1 C1 No 6 1 4 3.687818 3.574285 C1 C2 Yes 7 0 0 7.443118 6.169378 C2 C2 No 8 2 2 4.753946 3.347250 C2 C2 No 9 8 7 4.242641 4.463000 C2 C1 Yes 10 6 8 2.720294 4.113194 C1 C1 No 11 5 5 1.843909 0.958315 C2 C2 No 12 3 7 1 3.260775 C1 C1 No C1 centroid 3.8 6.4           C2 centroid 4.571429 4.142857           The new clusters are plotted in the following graph. Note that the centroids are diverging, and several instances have changed their assignments. Now we will move the centroids to the means of their constituents' locations again, and re-assign the instances to their nearest centroids. The centroids continue to diverge, as shown in the following figure. None of the instances' centroid assignments will change in the next iteration; K-Means will continue iterating until some stopping criteria is satisfied. Usually, this criteria is either a threshold for the difference between the values of the cost function for subsequent iterations, or a threshold for the change in the positions of the centroids between subsequent iterations. If these stopping criteria are small enough, K-Means will converge on an optimum. This optimum will not necessarily be the global optimum. Local Optima Recall that K-Means initially sets the positions of the clusters' centroids to the positions of randomly selected observations. Sometimes the random initialization is unlucky, and the centroids are set to positions that cause K-Means to converge to a local optimum. For example, assume that K-Means randomly initializes two cluster centroids to the following positions: K-Means will eventually converge on a local optimum like that shown in the following figure. These clusters may be informative, but it is more likely that the top and bottom groups of observations are more informative clusters. To avoid local optima, K-Means is often repeated dozens or hundreds of times. In each iteration, it is randomly initialized to different starting cluster positions. The initialization that minimizes the cost function best is selected. The Elbow Method If K is not specified by the problem's context, the optimal number of clusters can be estimated using a technique called the elbow method. The elbow method plots the value of the cost function produced by different values of K. As K increases, the average distortion will decrease; each cluster will have fewer constituent instances, and the instances will be closer to their respective centroids. However, the improvements to the average distortion will decline as K increases. The value of K at which the improvement to the distortion declines the most is called the elbow. Let's use the elbow method to choose the number of clusters for a data set. The following scatter plot visualizes a data set with two obvious clusters. We will calculate and plot the mean distortion of the clusters for each value of K from one to ten with the following: >>> import numpy as np>>> from sklearn.cluster import KMeans>>> from scipy.spatial.distance import cdist>>> import matplotlib.pyplot as plt>>> cluster1 = np.random.uniform(0.5, 1.5, (2, 10))>>> cluster2 = np.random.uniform(3.5, 4.5, (2, 10))>>> X = np.hstack((cluster1, cluster2)).T>>> X = np.vstack((x, y)).T>>> K = range(1, 10)>>> meandistortions = []>>> for k in K:>>> kmeans = KMeans(n_clusters=k)>>> kmeans.fit(X)>>> meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])>>> plt.plot(K, meandistortions, 'bx-')>>> plt.xlabel('k')>>> plt.ylabel('Average distortion')>>> plt.title('Selecting k with the Elbow Method')>>> plt.show() The average distortion improves rapidly as we increase K from one to two. There is little improvement for values of K greater than two. Now let's use the elbow method on the following data set with three clusters: The following is the elbow plot for the data set. From this we can see that the rate of improvement to the average distortion declines the most when adding a fourth cluster. That is, the elbow method confirms that K should be set to three for this data set. Summary In this article we explained what clustering is and we talked about the two methods available for clustering Resources for Article: Further resources on this subject: Machine Learning in IPython with scikit-learn [Article] Machine Learning in Bioinformatics [Article] Specialized Machine Learning Topics [Article]
Read more
  • 0
  • 0
  • 10283

article-image-emr-architecture
Packt
27 Oct 2014
6 min read
Save for later

The EMR Architecture

Packt
27 Oct 2014
6 min read
This article is written by Amarkant Singh and Vijay Rayapati, the authors of Learning Big Data with Amazon Elastic MapReduce. The goal of this article is to introduce you to the EMR architecture and EMR use cases. (For more resources related to this topic, see here.) Traditionally, very few companies had access to large-scale infrastructure to build Big Data applications. However, cloud computing has democratized the access to infrastructure allowing developers and companies to quickly perform new experiments without worrying about the need for setting up or scaling infrastructure. A cloud provides an infrastructure as a service platform to allow businesses to build applications and host them reliably with scalable infrastructure. It includes a variety of application-level services to help developers to accelerate their development and deployment times. Amazon EMR is one of the hosted services provided by AWS and is built on top of a scalable AWS infrastructure to build Big Data applications. The EMR architecture Let's get familiar with the EMR. This section outlines the key concepts of EMR. Hadoop offers distributed processing by using the MapReduce framework for execution of tasks on a set of servers or compute nodes (also known as a cluster). One of the nodes in the Hadoop cluster will be controlling the distribution of tasks to other nodes and it's called the Master Node. The nodes executing the tasks using MapReduce are called Slave Nodes: Amazon EMR is designed to work with many other AWS services such as S3 for input/output data storage, DynamoDB, and Redshift for output data. EMR uses AWS CloudWatch metrics to monitor the cluster performance and raise notifications for user-specified alarms. We can create on-demand Hadoop clusters using EMR while storing the input and output data in S3 without worrying about managing a 24*7 cluster or HDFS for data storage. The Amazon EMR job flow is shown in the following diagram: Types of nodes Amazon EMR provides three different roles for the servers or nodes in the cluster and they map to the Hadoop roles of master and slave nodes. When you create an EMR cluster, then it's called a Job Flow, which has been created to execute a set of jobs or job steps one after the other: Master node: This node controls and manages the cluster. It distributes the MapReduce tasks to nodes in the cluster and monitors the status of task execution. Every EMR cluster will have only one master node in a master instance group. Core nodes: These nodes will execute MapReduce tasks and provide HDFS for storing the data related to task execution. The EMR cluster will have core nodes as part of it in a core instance group. The core node is related to the slave node in Hadoop. So, basically these nodes have two-fold responsibility: the first one is to execute the map and reduce tasks allocated by the master and the second is to hold the data blocks. Task nodes: These nodes are used for only MapReduce task execution and they are optional while launching the EMR cluster. The task node is related to the slave node in Hadoop and is part of a task instance group in EMR. When you scale down your clusters, you cannot remove any core nodes. This is because EMR doesn't want to let you lose your data blocks. You can remove nodes from a task group while scaling down your cluster. You should also be using only task instance groups to have spot instances, as spot instances can be taken away as per your bid price and you would not want to lose your data blocks. You can launch a cluster having just one node, that is, with just one master node and no other nodes. In that case, the same node will act as both master and core nodes. For simplicity, you can assume a node as EC2 server in EMR. EMR use cases Amazon EMR can be used to build a variety of applications such as recommendation engines, data analysis, log processing, event/click stream analysis, data transformations (ETL), fraud detection, scientific simulations, genomics, financial analysis, or data correlation in various industries. The following section outlines some of the use cases in detail. Web log processing We can use EMR to process logs to understand the usage of content such as video, file downloads, top web URLs accessed by end users, user consumption from different parts of the world, and many more. We can process any web or mobile application logs using EMR to understand specific business insights relevant for your business. We can move all our web access application or mobile logs to Amazon S3 for analysis using EMR even if we are not using AWS for running our production applications. Clickstream analysis By using clickstream analysis, we can segment users into different groups and understand their behaviors with respect to advertisements or application usage. Ad networks or advertisers can perform clickstream analysis on ad-impression logs to deliver more effective campaigns or advertisements to end users. Reports generated from this analysis can include various metrics such as source traffic distribution, purchase funnel, lead source ROI, and abandoned carts among others. Product recommendation engine Recommendation engines can be built using EMR for e-commerce, retail, or web businesses. Many of the e-commerce businesses have a large inventory of products across different categories while regularly adding new products or categories. It will be very difficult for end users to search and identify the products quickly. With recommendation engines, we can help end users to quickly find relevant products or suggest products based on what they are viewing and so on. We may also want to notify users via an e-mail based on their past purchase behavior. Scientific simulations When you need distributed processing with large-scale infrastructure for scientific or research simulations, EMR can be of great help. We can quickly launch large clusters in a matter of minutes and install specific MapReduce programs for analysis using EMR. AWS also offers genomics datasets for free on S3. Data transformations We can perform complex extract, transform, and load (ETL) processes using EMR for either data analysis or data warehousing needs. It can be as simple as transforming XML file data into JSON data for further usage or moving all financial transaction records of a bank into a common date-time format for archiving purposes. You can also use EMR to move data between different systems in AWS such as DynamoDB, Redshift, S3, and many more. Summary In this article, we learned about the EMR architecture. We understood the concepts related to EMR for various node types in detail. Resources for Article: Further resources on this subject: Introduction to MapReduce [Article] Understanding MapReduce [Article] HDFS and MapReduce [Article]
Read more
  • 0
  • 0
  • 7334
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-installing-numpy-scipy-matplotlib-ipython
Packt Editorial Staff
12 Oct 2014
7 min read
Save for later

Installing NumPy, SciPy, matplotlib, and IPython

Packt Editorial Staff
12 Oct 2014
7 min read
This article written by Ivan Idris, author of the book, Python Data Analysis, will guide you to install NumPy, SciPy, matplotlib, and IPython. We can find a mind map describing software that can be used for data analysis at https://www.xmind.net/m/WvfC/. Obviously, we can't install all of this software in this article. We will install NumPy, SciPy, matplotlib, and IPython on different operating systems. [box type="info" align="" class="" width=""]Packt has the following books that are focused on NumPy: NumPy Beginner's Guide Second Edition, Ivan Idris NumPy Cookbook, Ivan Idris Learning NumPy Array, Ivan Idris [/box] SciPy is a scientific Python library, which supplements and slightly overlaps NumPy. NumPy and SciPy, historically shared their codebase but were later separated. matplotlib is a plotting library based on NumPy. IPython provides an architecture for interactive computing. The most notable part of this project is the IPython shell. Software used The software used in this article is based on Python, so it is required to have Python installed. On some operating systems, Python is already installed. You, however, need to check whether the Python version is compatible with the software version you want to install. There are many implementations of Python, including commercial implementations and distributions. [box type="note" align="" class="" width=""]You can download Python from https://www.python.org/download/. On this website, we can find installers for Windows and Mac OS X, as well as source archives for Linux, Unix, and Mac OS X.[/box] The software we will install has binary installers for Windows, various Linux distributions, and Mac OS X. There are also source distributions, if you prefer that. You need to have Python 2.4.x or above installed on your system. Python 2.7.x is currently the best Python version to have because most Scientific Python libraries support it. Python 2.7 will be supported and maintained until 2020. After that, we will have to switch to Python 3. Installing software and setup on Windows Installing on Windows is, fortunately, a straightforward task that we will cover in detail. You only need to download an installer, and a wizard will guide you through the installation steps. We will give steps to install NumPy here. The steps to install the other libraries are similar. The actions we will take are as follows: Download installers for Windows from the SourceForge website (refer to the following table). The latest release versions may change, so just choose the one that fits your setup best. Library URL Latest Version NumPy http://sourceforge.net/projects/numpy/files/ 1.8.1 SciPy http://sourceforge.net/projects/scipy/files/ 0.14.0 matplotlib http://sourceforge.net/projects/matplotlib/files/ 1.3.1 IPython http://archive.ipython.org/release/ 2.0.0 Choose the appropriate version. In this example, we chose numpy-1.8.1-win32-superpack-python2.7.exe. Open the EXE installer by double-clicking on it. Now, we can see a description of NumPy and its features. Click on the Next button.If you have Python installed, it should automatically be detected. If it is not detected, maybe your path settings are wrong. Click on the Next button if Python is found; otherwise, click on the Cancel button and install Python (NumPy cannot be installed without Python). Click on the Next button. This is the point of no return. Well, kind of, but it is best to make sure that you are installing to the proper directory and so on and so forth. Now the real installation starts. This may take a while. [box type="note" align="" class="" width=""]The situation around installers is rapidly evolving. Other alternatives exist in various stage of maturity (see https://www.scipy.org/install.html). It might be necessary to put the msvcp71.dll file in your C:Windowssystem32 directory. You can get it from http://www.dll-files.com/dllindex/dll-files.shtml?msvcp71.[/box] Installing software and setup on Linux Installing the recommended software on Linux depends on the distribution you have. We will discuss how you would install NumPy from the command line, although, you could probably use graphical installers; it depends on your distribution (distro). The commands to install matplotlib, SciPy, and IPython are the same – only the package names are different. Installing matplotlib, SciPy, and IPython is recommended, but optional. Most Linux distributions have NumPy packages. We will go through the necessary steps for some of the popular Linux distros: Run the following instructions from the command line for installing NumPy on Red Hat: $ yum install python-numpy To install NumPy on Mandriva, run the following command-line instruction: $ urpmi python-numpy To install NumPy on Gentoo run the following command-line instruction: $ sudo emerge numpy To install NumPy on Debian or Ubuntu, we need to type the following: $ sudo apt-get install python-numpy The following table gives an overview of the Linux distributions and corresponding package names for NumPy, SciPy, matplotlib, and IPython. Linux distribution NumPy SciPy matplotlib IPython Arch Linux python-numpy python-scipy python-matplotlib Ipython Debian python-numpy python-scipy python-matplotlib Ipython Fedora numpy python-scipy python-matplotlib Ipython Gentoo dev-python/numpy scipy matplotlib ipython OpenSUSE python-numpy, python-numpy-devel python-scipy python-matplotlib ipython Slackware numpy scipy matplotlib ipython Installing software and setup on Mac OS X You can install NumPy, matplotlib, and SciPy on the Mac with a graphical installer or from the command line with a port manager such as MacPorts, depending on your preference. Prerequisite is to install XCode as it is not part of OS X releases. We will install NumPy with a GUI installer using the following steps: We can get a NumPy installer from the SourceForge website http://sourceforge.net/projects/numpy/files/. Similar files exist for matplotlib and SciPy. Just change numpy in the previous URL to scipy or matplotlib. IPython didn't have a GUI installer at the time of writing. Download the appropriate DMG file usually the latest one is the best.Another alternative is the SciPy Superpack (https://github.com/fonnesbeck/ScipySuperpack). Whichever option you choose it is important to make sure that updates which impact the system Python library don't negatively influence already installed software by not building against the Python library provided by Apple. Open the DMG file (in this example, numpy-1.8.1-py2.7-python.org-macosx10.6.dmg). Double-click on the icon of the opened box, the one having a subscript that ends with .mpkg. We will be presented with the welcome screen of the installer. Click on the Continue button to go to the Read Me screen, where we will be presented with a short description of NumPy. Click on the Continue button to the License the screen. Read the license, click on the Continue button and then on the Accept button, when prompted to accept the license. Continue through the next screens and click on the Finish button at the end. Alternatively, we can install NumPy, SciPy, matplotlib, and IPython through the MacPorts route, with Fink or Homebrew. The following installation steps shown, installs all these packages. [box type="info" align="" class="" width=""]For installing with MacPorts, type the following command: sudo port install py-numpy py-scipy py-matplotlib py- ipython [/box] Installing with setuptools If you have pip you can install NumPy, SciPy, matplotlib and IPython with the following commands. pip install numpy pip install scipy pip install matplotlib pip install ipython It may be necessary to prepend sudo to these commands, if your current user doesn't have sufficient rights on your system. Summary In this article, we installed NumPy, SciPy, matplotlib and IPython on Windows, Mac OS X and Linux. Resources for Article: Further resources on this subject: Plotting Charts with Images and Maps Importing Dynamic Data Python 3: Designing a Tasklist Application
Read more
  • 0
  • 0
  • 64635

article-image-indexing-and-performance-tuning
Packt
10 Oct 2014
40 min read
Save for later

Indexing and Performance Tuning

Packt
10 Oct 2014
40 min read
In this article by Hans-Jürgen Schönig, author of the book PostgreSQL Administration Essentials, you will be guided through PostgreSQL indexing, and you will learn how to fix performance issues and find performance bottlenecks. Understanding indexing will be vital to your success as a DBA—you cannot count on software engineers to get this right straightaway. It will be you, the DBA, who will face problems caused by bad indexing in the field. For the sake of your beloved sleep at night, this article is about PostgreSQL indexing. (For more resources related to this topic, see here.) Using simple binary trees In this section, you will learn about simple binary trees and how the PostgreSQL optimizer treats the trees. Once you understand the basic decisions taken by the optimizer, you can move on to more complex index types. Preparing the data Indexing does not change user experience too much, unless you have a reasonable amount of data in your database—the more data you have, the more indexing can help to boost things. Therefore, we have to create some simple sets of data to get us started. Here is a simple way to populate a table: test=# CREATE TABLE t_test (id serial, name text);CREATE TABLEtest=# INSERT INTO t_test (name) SELECT 'hans' FROM   generate_series(1, 2000000);INSERT 0 2000000test=# INSERT INTO t_test (name) SELECT 'paul' FROM   generate_series(1, 2000000);INSERT 0 2000000 In our example, we created a table consisting of two columns. The first column is simply an automatically created integer value. The second column contains the name. Once the table is created, we start to populate it. It's nice and easy to generate a set of numbers using the generate_series function. In our example, we simply generate two million numbers. Note that these numbers will not be put into the table; we will still fetch the numbers from the sequence using generate_series to create two million hans and rows featuring paul, shown as follows: test=# SELECT * FROM t_test LIMIT 3;id | name----+------1 | hans2 | hans3 | hans(3 rows) Once we create a sufficient amount of data, we can run a simple test. The goal is to simply count the rows we have inserted. The main issue here is: how can we find out how long it takes to execute this type of query? The timing command will do the job for you: test=# timingTiming is on. As you can see, timing will add the total runtime to the result. This makes it quite easy for you to see if a query turns out to be a problem or not: test=# SELECT count(*) FROM t_test;count---------4000000(1 row)Time: 316.628 ms As you can see in the preceding code, the time required is approximately 300 milliseconds. This might not sound like a lot, but it actually is. 300 ms means that we can roughly execute three queries per CPU per second. On an 8-Core box, this would translate to roughly 25 queries per second. For many applications, this will be enough; but do you really want to buy an 8-Core box to handle just 25 concurrent users, and do you want your entire box to work just on this simple query? Probably not! Understanding the concept of execution plans It is impossible to understand the use of indexes without understanding the concept of execution plans. Whenever you execute a query in PostgreSQL, it generally goes through four central steps, described as follows: Parser: PostgreSQL will check the syntax of the statement. Rewrite system: PostgreSQL will rewrite the query (for example, rules and views are handled by the rewrite system). Optimizer or planner: PostgreSQL will come up with a smart plan to execute the query as efficiently as possible. At this step, the system will decide whether or not to use indexes. Executor: Finally, the execution plan is taken by the executor and the result is generated. Being able to understand and read execution plans is an essential task of every DBA. To extract the plan from the system, all you need to do is use the explain command, shown as follows: test=# explain SELECT count(*) FROM t_test;                             QUERY PLAN                            ------------------------------------------------------Aggregate (cost=71622.00..71622.01 rows=1 width=0)   -> Seq Scan on t_test (cost=0.00..61622.00                         rows=4000000 width=0)(2 rows)Time: 0.370 ms In our case, it took us less than a millisecond to calculate the execution plan. Once you have the plan, you can read it from right to left. In our case, PostgreSQL will perform a sequential scan and aggregate the data returned by the sequential scan. It is important to mention that each step is assigned to a certain number of costs. The total cost for the sequential scan is 61,622 penalty points (more details about penalty points will be outlined a little later). The overall cost of the query is 71,622.01. What are costs? Well, costs are just an arbitrary number calculated by the system based on some rules. The higher the costs, the slower a query is expected to be. Always keep in mind that these costs are just a way for PostgreSQL to estimate things—they are in no way a reliable number related to anything in the real world (such as time or amount of I/O needed). In addition to the costs, PostgreSQL estimates that the sequential scan will yield around four million rows. It also expects the aggregation to return just a single row. These two estimates happen to be precise, but it is not always so. Calculating costs When in training, people often ask how PostgreSQL does its cost calculations. Consider a simple example like the one we have next. It works in a pretty simple way. Generally, there are two types of costs: I/O costs and CPU costs. To come up with I/O costs, we have to figure out the size of the table we are dealing with first: test=# SELECT pg_relation_size('t_test'),   pg_size_pretty(pg_relation_size('t_test'));pg_relation_size | pg_size_pretty------------------+----------------       177127424 | 169 MB(1 row) The pg_relation_size command is a fast way to see how large a table is. Of course, reading a large number (many digits) is somewhat hard, so it is possible to fetch the size of the table in a much prettier format. In our example, the size is roughly 170 MB. Let's move on now. In PostgreSQL, a table consists of 8,000 blocks. If we divide the size of the table by 8,192 bytes, we will end up with exactly 21,622 blocks. This is how PostgreSQL estimates I/O costs of a sequential scan. If a table is read completely, each block will receive exactly one penalty point, or any number defined by seq_page_cost: test=# SHOW seq_page_cost;seq_page_cost---------------1(1 row) To count this number, we have to send four million rows through the CPU (cpu_tuple_cost), and we also have to count these 4 million rows (cpu_operator_cost). So, the calculation looks like this: For the sequential scan: 21622*1 + 4000000*0.01 (cpu_tuple_cost) = 61622 For the aggregation: 61622 + 4000000*0.0025 (cpu_operator_cost) = 71622 This is exactly the number that we see in the plan. Drawing important conclusions Of course, you will never do this by hand. However, there are some important conclusions to be drawn: The cost model in PostgreSQL is a simplification of the real world The costs can hardly be translated to real execution times The cost of reading from a slow disk is the same as the cost of reading from a fast disk It is hard to take caching into account If the optimizer comes up with a bad plan, it is possible to adapt the costs either globally in postgresql.conf, or by changing the session variables, shown as follows: test=# SET seq_page_cost TO 10;SET This statement inflated the costs at will. It can be a handy way to fix the missed estimates, leading to bad performance and, therefore, to poor execution times. This is what the query plan will look like using the inflated costs: test=# explain SELECT count(*) FROM t_test;                     QUERY PLAN                              -------------------------------------------------------Aggregate (cost=266220.00..266220.01 rows=1 width=0)   -> Seq Scan on t_test (cost=0.00..256220.00          rows=4000000 width=0)(2 rows) It is important to understand the PostgreSQL code model in detail because many people have completely wrong ideas about what is going on inside the PostgreSQL optimizer. Offering a basic explanation will hopefully shed some light on this important topic and allow administrators a deeper understanding of the system. Creating indexes After this introduction, we can deploy our first index. As we stated before, runtimes of several hundred milliseconds for simple queries are not acceptable. To fight these unusually high execution times, we can turn to CREATE INDEX, shown as follows: test=# h CREATE INDEXCommand:     CREATE INDEXDescription: define a new indexSyntax:CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ name ]ON table_name [ USING method ]   ( { column_name | ( expression ) }[ COLLATE collation ] [ opclass ][ ASC | DESC ] [ NULLS { FIRST | LAST } ]   [, ...] )   [ WITH ( storage_parameter = value [, ... ] ) ]   [ TABLESPACE tablespace_name ]   [ WHERE predicate ] In the most simplistic case, we can create a normal B-tree index on the ID column and see what happens: test=# CREATE INDEX idx_id ON t_test (id);CREATE INDEXTime: 3996.909 ms B-tree indexes are the default index structure in PostgreSQL. Internally, they are also called B+ tree, as described by Lehman-Yao. On this box (AMD, 4 Ghz), we can build the B-tree index in around 4 seconds, without any database side tweaks. Once the index is in place, the SELECT command will be executed at lightning speed: test=# SELECT * FROM t_test WHERE id = 423423;   id   | name--------+------423423 | hans(1 row)Time: 0.384 ms The query executes in less than a millisecond. Keep in mind that this already includes displaying the data, and the query is a lot faster internally. Analyzing the performance of a query How do we know that the query is actually a lot faster? In the previous section, you saw EXPLAIN in action already. However, there is a little more to know about this command. You can add some instructions to EXPLAIN to make it a lot more verbose, as shown here: test=# h EXPLAINCommand:     EXPLAINDescription: show the execution plan of a statementSyntax:EXPLAIN [ ( option [, ...] ) ] statementEXPLAIN [ ANALYZE ] [ VERBOSE ] statement In the preceding code, the term option can be one of the following:    ANALYZE [ boolean ]   VERBOSE [ boolean ]   COSTS [ boolean ]   BUFFERS [ boolean ]   TIMING [ boolean ]   FORMAT { TEXT | XML | JSON | YAML } Consider the following example: test=# EXPLAIN (ANALYZE TRUE, VERBOSE true, COSTS TRUE,   TIMING true) SELECT * FROM t_test WHERE id = 423423;               QUERY PLAN        ------------------------------------------------------Index Scan using idx_id on public.t_test(cost=0.43..8.45 rows=1 width=9)(actual time=0.016..0.018 rows=1 loops=1)   Output: id, name   Index Cond: (t_test.id = 423423)Total runtime: 0.042 ms(4 rows)Time: 0.536 ms The ANALYZE function does a special form of execution. It is a good way to figure out which part of the query burned most of the time. Again, we can read things inside out. In addition to the estimated costs of the query, we can also see the real execution time. In our case, the index scan takes 0.018 milliseconds. Fast, isn't it? Given these timings, you can see that displaying the result actually takes a huge fraction of the time. The beauty of EXPLAIN ANALYZE is that it shows costs and execution times for every step of the process. This is important for you to familiarize yourself with this kind of output because when a programmer hits your desk complaining about bad performance, it is necessary to dig into this kind of stuff quickly. In many cases, the secret to performance is hidden in the execution plan, revealing a missing index or so. It is recommended to pay special attention to situations where the number of expected rows seriously differs from the number of rows really processed. Keep in mind that the planner is usually right, but not always. Be cautious in case of large differences (especially if this input is fed into a nested loop). Whenever a query feels slow, we always recommend to take a look at the plan first. In many cases, you will find missing indexes. The internal structure of a B-tree index Before we dig further into the B-tree indexes, we can briefly discuss what an index actually looks like under the hood. Understanding the B-tree internals Consider the following image that shows how things work: In PostgreSQL, we use the so-called Lehman-Yao B-trees (check out http://www.cs.cmu.edu/~dga/15-712/F07/papers/Lehman81.pdf). The main advantage of the B-trees is that they can handle concurrency very nicely. It is possible that hundreds or thousands of concurrent users modify the tree at the same time. Unfortunately, there is not enough room in this book to explain precisely how this works. The two most important issues of this tree are the facts that I/O is done in 8,000 chunks and that the tree is actually a sorted structure. This allows PostgreSQL to apply a ton of optimizations. Providing a sorted order As we stated before, a B-tree provides the system with sorted output. This can come in quite handy. Here is a simple query to make use of the fact that a B-tree provides the system with sorted output: test=# explain SELECT * FROM t_test ORDER BY id LIMIT 3;                   QUERY PLAN                                    ------------------------------------------------------Limit (cost=0.43..0.67 rows=3 width=9)   -> Index Scan using idx_id on t_test(cost=0.43..320094.43 rows=4000000 width=9)(2 rows) In this case, we are looking for the three smallest values. PostgreSQL will read the index from left to right and stop as soon as enough rows have been returned. This is a very common scenario. Many people think that indexes are only about searching, but this is not true. B-trees are also present to help out with sorting. Why do you, the DBA, care about this stuff? Remember that this is a typical use case where a software developer comes to your desk, pounds on the table, and complains. A simple index can fix the problem. Combined indexes Combined indexes are one more source of trouble if they are not used properly. A combined index is an index covering more than one column. Let's drop the existing index and create a combined index (make sure your seq_page_cost variable is set back to default to make the following examples work): test=# DROP INDEX idx_combined;DROP INDEXtest=# CREATE INDEX idx_combined ON t_test (name, id);CREATE INDEX We defined a composite index consisting of two columns. Remember that we put the name before the ID. A simple query will return the following execution plan: test=# explain analyze SELECT * FROM t_test   WHERE id = 10;               QUERY PLAN                                              -------------------------------------------------Seq Scan on t_test (cost=0.00..71622.00 rows=1   width=9)(actual time=181.502..351.439 rows=1 loops=1)   Filter: (id = 10)   Rows Removed by Filter: 3999999Total runtime: 351.481 ms(4 rows) There is no proper index for this, so the system will fall back to a sequential scan. Why is there no proper index? Well, try to look up for first names only in the telephone book. This is not going to work because a telephone book is sorted by location, last name, and first name. The same applies to our index. A B-tree works basically on the same principles as an ordinary paper phone book. It is only useful if you look up the first couple of values, or simply all of them. Here is an example: test=# explain analyze SELECT * FROM t_test   WHERE id = 10 AND name = 'joe';     QUERY PLAN                                                    ------------------------------------------------------Index Only Scan using idx_combined on t_test   (cost=0.43..6.20 rows=1 width=9)(actual time=0.068..0.068 rows=0 loops=1)   Index Cond: ((name = 'joe'::text) AND (id = 10))   Heap Fetches: 0Total runtime: 0.108 ms(4 rows) In this case, the combined index comes up with a high speed result of 0.1 ms, which is not bad. After this small example, we can turn to an issue that's a little bit more complex. Let's change the costs of a sequential scan to 100-times normal: test=# SET seq_page_cost TO 100;SET Don't let yourself be fooled into believing that an index is always good: test=# explain analyze SELECT * FROM t_testWHERE id = 10;                   QUERY PLAN                ------------------------------------------------------Index Only Scan using idx_combined on t_test(cost=0.43..91620.44 rows=1 width=9)(actual time=0.362..177.952 rows=1 loops=1)   Index Cond: (id = 10)   Heap Fetches: 1Total runtime: 177.983 ms(4 rows) Just look at the execution times. We are almost as slow as a sequential scan here. Why does PostgreSQL use the index at all? Well, let's assume we have a very broad table. In this case, sequentially scanning the table is expensive. Even if we have to read the entire index, it can be cheaper than having to read the entire table, at least if there is enough hope to reduce the amount of data by using the index somehow. So, in case you see an index scan, also take a look at the execution times and the number of rows used. The index might not be perfect, but it's just an attempt by PostgreSQL to avoid the worse to come. Keep in mind that there is no general rule (for example, more than 25 percent of data will result in a sequential scan) for sequential scans. The plans depend on a couple of internal issues, such as physical disk layout (correlation) and so on. Partial indexes Up to now, an index covered the entire table. This is not always necessarily the case. There are also partial indexes. When is a partial index useful? Consider the following example: test=# CREATE TABLE t_invoice (   id     serial,   d     date,   amount   numeric,   paid     boolean);CREATE TABLEtest=# CREATE INDEX idx_partial   ON   t_invoice (paid)   WHERE   paid = false;CREATE INDEX In our case, we create a table storing invoices. We can safely assume that the majority of the invoices are nicely paid. However, we expect a minority to be pending, so we want to search for them. A partial index will do the job in a highly space efficient way. Space is important because saving on space has a couple of nice side effects, such as cache efficiency and so on. Dealing with different types of indexes Let's move on to an important issue: not everything can be sorted easily and in a useful way. Have you ever tried to sort circles? If the question seems odd, just try to do it. It will not be easy and will be highly controversial, so how do we do it best? Would we sort by size or coordinates? Under any circumstances, using a B-tree to store circles, points, or polygons might not be a good idea at all. A B-tree does not do what you want it to do because a B-tree depends on some kind of sorting order. To provide end users with maximum flexibility and power, PostgreSQL provides more than just one index type. Each index type supports certain algorithms used for different purposes. The following list of index types is available in PostgreSQL (as of Version 9.4.1): btree: These are the high-concurrency B-trees gist: This is an index type for geometric searches (GIS data) and for KNN-search gin: This is an index type optimized for Full-Text Search (FTS) sp-gist: This is a space-partitioned gist As we mentioned before, each type of index serves different purposes. We highly encourage you to dig into this extremely important topic to make sure that you can help software developers whenever necessary. Unfortunately, we don't have enough room in this book to discuss all the index types in greater depth. If you are interested in finding out more, we recommend checking out information on my website at http://www.postgresql-support.de/slides/2013_dublin_indexing.pdf. Alternatively, you can look up the official PostgreSQL documentation, which can be found at http://www.postgresql.org/docs/9.4/static/indexes.html. Detecting missing indexes Now that we have covered the basics and some selected advanced topics of indexing, we want to shift our attention to a major and highly important administrative task: hunting down missing indexes. When talking about missing indexes, there is one essential query I have found to be highly valuable. The query is given as follows: test=# xExpanded display (expanded) is on.test=# SELECT   relname, seq_scan, seq_tup_read,     idx_scan, idx_tup_fetch,     seq_tup_read / seq_scanFROM   pg_stat_user_tablesWHERE   seq_scan > 0ORDER BY seq_tup_read DESC;-[ RECORD 1 ]-+---------relname       | t_user  seq_scan     | 824350      seq_tup_read | 2970269443530idx_scan     | 0      idx_tup_fetch | 0      ?column?     | 3603165 The pg_stat_user_tables option contains statistical information about tables and their access patterns. In this example, we found a classic problem. The t_user table has been scanned close to 1 million times. During these sequential scans, we processed close to 3 trillion rows. Do you think this is unusual? It's not nearly as unusual as you might think. In the last column, we divided seq_tup_read through seq_scan. Basically, this is a simple way to figure out how many rows a typical sequential scan has used to finish. In our case, 3.6 million rows had to be read. Do you remember our initial example? We managed to read 4 million rows in a couple of hundred milliseconds. So, it is absolutely realistic that nobody noticed the performance bottleneck before. However, just consider burning, say, 300 ms for every query thousands of times. This can easily create a heavy load on a totally unnecessary scale. In fact, a missing index is the key factor when it comes to bad performance. Let's take a look at the table description now: test=# d t_user                         Table "public.t_user"Column | Type   |       Modifiers                    ----------+---------+-------------------------------id      | integer | not null default               nextval('t_user_id_seq'::regclass)email   | text   |passwd   | text   |Indexes:   "t_user_pkey" PRIMARY KEY, btree (id) This is really a classic example. It is hard to tell how often I have seen this kind of example in the field. The table was probably called customer or userbase. The basic principle of the problem was always the same: we got an index on the primary key, but the primary key was never checked during the authentication process. When you log in to Facebook, Amazon, Google, and so on, you will not use your internal ID, you will rather use your e-mail address. Therefore, it should be indexed. The rules here are simple: we are searching for queries that needed many expensive scans. We don't mind sequential scans as long as they only read a handful of rows or as long as they show up rarely (caused by backups, for example). We need to keep expensive scans in mind, however ("expensive" in terms of "many rows needed"). Here is an example code snippet that should not bother us at all: -[ RECORD 1 ]-+---------relname       | t_province  seq_scan     | 8345345      seq_tup_read | 100144140idx_scan     | 0      idx_tup_fetch | 0      ?column?     | 12 The table has been read 8 million times, but in an average, only 12 rows have been returned. Even if we have 1 million indexes defined, PostgreSQL will not use them because the table is simply too small. It is pretty hard to tell which columns might need an index from inside PostgreSQL. However, taking a look at the tables and thinking about them for a minute will, in most cases, solve the riddle. In many cases, things are pretty obvious anyway, and developers will be able to provide you with a reasonable answer. As you can see, finding missing indexes is not hard, and we strongly recommend checking this system table once in a while to figure out whether your system works nicely. There are a couple of tools, such as pgbadger, out there that can help us to monitor systems. It is recommended that you make use of such tools. There is not only light, there is also some shadow. Indexes are not always good. They can also cause considerable overhead during writes. Keep in mind that when you insert, modify, or delete data, you have to touch the indexes as well. The overhead of useless indexes should never be underestimated. Therefore, it makes sense to not just look for missing indexes, but also for spare indexes that don't serve a purpose anymore. Detecting slow queries Now that we have seen how to hunt down tables that might need an index, we can move on to the next example and try to figure out the queries that cause most of the load on your system. Sometimes, the slowest query is not the one causing a problem; it is a bunch of small queries, which are executed over and over again. In this section, you will learn how to track down such queries. To track down slow operations, we can rely on a module called pg_stat_statements. This module is available as part of the PostgreSQL contrib section. Installing a module from this section is really easy. Connect to PostgreSQL as a superuser, and execute the following instruction (if contrib packages have been installed): test=# CREATE EXTENSION pg_stat_statements;CREATE EXTENSION This module will install a system view that will contain all the relevant information we need to find expensive operations: test=# d pg_stat_statements         View "public.pg_stat_statements"       Column       |       Type       | Modifiers---------------------+------------------+-----------userid             | oid             |dbid               | oid             |queryid             | bigint          |query               | text             |calls               | bigint           |total_time         | double precision |rows               | bigint           |shared_blks_hit     | bigint           |shared_blks_read   | bigint           |shared_blks_dirtied | bigint           |shared_blks_written | bigint           |local_blks_hit     | bigint           |local_blks_read     | bigint           |local_blks_dirtied | bigint           |local_blks_written | bigint           |temp_blks_read     | bigint           |temp_blks_written   | bigint           |blk_read_time       | double precision |blk_write_time     | double precision | In this view, we can see the queries we are interested in, the total execution time (total_time), the number of calls, and the number of rows returned. Then, we will get some information about the I/O behavior (more on caching later) of the query as well as information about temporary data being read and written. Finally, the last two columns will tell us how much time we actually spent on I/O. The final two fields are active when track_timing in postgresql.conf has been enabled and will give vital insights into potential reasons for disk wait and disk-related speed problems. The blk_* prefix will tell us how much time a certain query has spent reading and writing to the operating system. Let's see what happens when we want to query the view: test=# SELECT * FROM pg_stat_statements;ERROR: pg_stat_statements must be loaded via   shared_preload_libraries The system will tell us that we have to enable this module; otherwise, data won't be collected. All we have to do to make this work is to add the following line to postgresql.conf: shared_preload_libraries = 'pg_stat_statements' Then, we have to restart the server to enable it. We highly recommend adding this module to the configuration straightaway to make sure that a restart can be avoided and that this data is always around. Don't worry too much about the performance overhead of this module. Tests have shown that the impact on performance is so low that it is even too hard to measure. Therefore, it might be a good idea to have this module activated all the time. If you have configured things properly, finding the most time-consuming queries should be simple: SELECT *FROM   pg_stat_statementsORDER   BY total_time DESC; The important part here is that PostgreSQL can nicely group queries. For instance: SELECT * FROM foo WHERE bar = 1;SELECT * FROM foo WHERE bar = 2; PostgreSQL will detect that this is just one type of query and replace the two numbers in the WHERE clause with a placeholder indicating that a parameter was used here. Of course, you can also sort by any other criteria: highest I/O time, highest number of calls, or whatever. The pg_stat_statement function has it all, and things are available in a way that makes the data very easy and efficient to use. How to reset statistics Sometimes, it is necessary to reset the statistics. If you are about to track down a problem, resetting can be very beneficial. Here is how it works: test=# SELECT pg_stat_reset();pg_stat_reset---------------(1 row)test=# SELECT pg_stat_statements_reset();pg_stat_statements_reset--------------------------(1 row) The pg_stat_reset command will reset the entire system statistics (for example, pg_stat_user_tables). The second call will wipe out pg_stat_statements. Adjusting memory parameters After we find the slow queries, we can do something about them. The first step is always to fix indexing and make sure that sane requests are sent to the database. If you are requesting stupid things from PostgreSQL, you can expect trouble. Once the basic steps have been performed, we can move on to the PostgreSQL memory parameters, which need some tuning. Optimizing shared buffers One of the most essential memory parameters is shared_buffers. What are shared buffers? Let's assume we are about to read a table consisting of 8,000 blocks. PostgreSQL will check if the buffer is already in cache (shared_buffers), and if it is not, it will ask the underlying operating system to provide the database with the missing 8,000 blocks. If we are lucky, the operating system has a cached copy of the block. If we are not so lucky, the operating system has to go to the disk system and fetch the data (worst case). So, the more data we have in cache, the more efficient we will be. Setting shared_buffers to the right value is more art than science. The general guideline is that shared_buffers should consume 25 percent of memory, but not more than 16 GB. Very large shared buffer settings are known to cause suboptimal performance in some cases. It is also not recommended to starve the filesystem cache too much on behalf of the database system. Mentioning the guidelines does not mean that it is eternal law—you really have to see this as a guideline you can use to get started. Different settings might be better for your workload. Remember, if there was an eternal law, there would be no setting, but some autotuning magic. However, a contrib module called pg_buffercache can give some insights into what is in cache at the moment. It can be used as a basis to get started on understanding what is going on inside the PostgreSQL shared buffer. Changing shared_buffers can be done in postgresql.conf, shown as follows: shared_buffers = 4GB In our example, shared buffers have been set to 4GB. A database restart is needed to activate the new value. In PostgreSQL 9.4, some changes were introduced. Traditionally, PostgreSQL used a classical System V shared memory to handle the shared buffers. Starting with PostgreSQL 9.3, mapped memory was added, and finally, it was in PostgreSQL 9.4 that a config variable was introduced to configure the memory technique PostgreSQL will use, shown as follows: dynamic_shared_memory_type = posix# the default is the first option     # supported by the operating system:     #   posix     #   sysv     #   windows     #   mmap     # use none to disable dynamic shared memory The default value on the most common operating systems is basically fine. However, feel free to experiment with the settings and see what happens performance wise. Considering huge pages When a process uses RAM, the CPU marks this memory as used by this process. For efficiency reasons, the CPU usually allocates RAM by chunks of 4,000 bytes. These chunks are called pages. The process address space is virtual, and the CPU and operating system have to remember which process belongs to which page. The more pages you have, the more time it takes to find where the memory is mapped. When a process uses 1 GB of memory, it means that 262.144 blocks have to be looked up. Most modern CPU architectures support bigger pages, and these pages are called huge pages (on Linux). To tell PostgreSQL that this mechanism can be used, the following config variable can be changed in postgresql.conf: huge_pages = try                     # on, off, or try Of course, your Linux system has to know about the use of huge pages. Therefore, you can do some tweaking, as follows: grep Hugepagesize /proc/meminfoHugepagesize:     2048 kB In our case, the size of the huge pages is 2 MB. So, if there is 1 GB of memory, 512 huge pages are needed. The number of huge pages can be configured and activated by setting nr_hugepages in the proc filesystem. Consider the following example: echo 512 > /proc/sys/vm/nr_hugepages Alternatively, we can use the sysctl command or change things in /etc/sysctl.conf: sysctl -w vm.nr_hugepages=512 Huge pages can have a significant impact on performance. Tweaking work_mem There is more to PostgreSQL memory configuration than just shared buffers. The work_mem parameter is widely used for operations such as sorting, aggregating, and so on. Let's illustrate the way work_mem works with a short, easy-to-understand example. Let's assume it is an election day and three parties have taken part in the elections. The data is as follows: test=# CREATE TABLE t_election (id serial, party text);test=# INSERT INTO t_election (party)SELECT 'socialists'   FROM generate_series(1, 439784);test=# INSERT INTO t_election (party)SELECT 'conservatives'   FROM generate_series(1, 802132);test=# INSERT INTO t_election (party)SELECT 'liberals'   FROM generate_series(1, 654033); We add some data to the table and try to count how many votes each party has: test=# explain analyze SELECT party, count(*)   FROM   t_election   GROUP BY 1;       QUERY PLAN                                                        ------------------------------------------------------HashAggregate (cost=39461.24..39461.26 rows=3     width=11) (actual time=609.456..609.456   rows=3 loops=1)     Group Key: party   -> Seq Scan on t_election (cost=0.00..29981.49     rows=1895949 width=11)   (actual time=0.007..192.934 rows=1895949   loops=1)Planning time: 0.058 msExecution time: 609.481 ms(5 rows) First of all, the system will perform a sequential scan and read all the data. This data is passed on to a so-called HashAggregate. For each party, PostgreSQL will calculate a hash key and increment counters as the query moves through the tables. At the end of the operation, we will have a chunk of memory with three values and three counters. Very nice! As you can see, the explain analyze statement does not take more than 600 ms. Note that the real execution time of the query will be a lot faster. The explain analyze statement does have some serious overhead. Still, it will give you valuable insights into the inner workings of the query. Let's try to repeat this same example, but this time, we want to group by the ID. Here is the execution plan: test=# explain analyze SELECT id, count(*)   FROM   t_election     GROUP BY 1;       QUERY PLAN                                                          ------------------------------------------------------GroupAggregate (cost=253601.23..286780.33 rows=1895949     width=4) (actual time=1073.769..1811.619     rows=1895949 loops=1)     Group Key: id   -> Sort (cost=253601.23..258341.10 rows=1895949   width=4) (actual time=1073.763..1288.432   rows=1895949 loops=1)         Sort Key: id       Sort Method: external sort Disk: 25960kB         -> Seq Scan on t_election         (cost=0.00..29981.49 rows=1895949 width=4)     (actual time=0.013..235.046 rows=1895949     loops=1)Planning time: 0.086 msExecution time: 1928.573 ms(8 rows) The execution time rises by almost 2 seconds and, more importantly, the plan changes. In this scenario, there is no way to stuff all the 1.9 million hash keys into a chunk of memory because we are limited by work_mem. Therefore, PostgreSQL has to find an alternative plan. It will sort the data and run GroupAggregate. How does it work? If you have a sorted list of data, you can count all equal values, send them off to the client, and move on to the next value. The main advantage is that we don't have to keep the entire result set in memory at once. With GroupAggregate, we can basically return aggregations of infinite sizes. The downside is that large aggregates exceeding memory will create temporary files leading to potential disk I/O. Keep in mind that we are talking about the size of the result set and not about the size of the underlying data. Let's try the same thing with more work_mem: test=# SET work_mem TO '1 GB';SETtest=# explain analyze SELECT id, count(*)   FROM t_election   GROUP BY 1;         QUERY PLAN                                                        ------------------------------------------------------HashAggregate (cost=39461.24..58420.73 rows=1895949     width=4) (actual time=857.554..1343.375   rows=1895949 loops=1)   Group Key: id   -> Seq Scan on t_election (cost=0.00..29981.49   rows=1895949 width=4)   (actual time=0.010..201.012   rows=1895949 loops=1)Planning time: 0.113 msExecution time: 1478.820 ms(5 rows) In this case, we adapted work_mem for the current session. Don't worry, changing work_mem locally does not change the parameter for other database connections. If you want to change things globally, you have to do so by changing things in postgresql.conf. Alternatively, 9.4 offers a command called ALTER SYSTEM SET work_mem TO '1 GB'. Once SELECT pg_reload_conf() has been called, the config parameter is changed as well. What you see in this example is that the execution time is around half a second lower than before. PostgreSQL switches back to the more efficient plan. However, there is more; work_mem is also in charge of efficient sorting: test=# explain analyze SELECT * FROM t_election ORDER BY id DESC;     QUERY PLAN                                                          ------------------------------------------------------Sort (cost=227676.73..232416.60 rows=1895949 width=15)   (actual time=695.004..872.698 rows=1895949   loops=1)   Sort Key: id   Sort Method: quicksort Memory: 163092kB   -> Seq Scan on t_election (cost=0.00..29981.49   rows=1895949 width=15) (actual time=0.013..188.876rows=1895949 loops=1)Planning time: 0.042 msExecution time: 995.327 ms(6 rows) In our example, PostgreSQL can sort the entire dataset in memory. Earlier, we had to perform a so-called "external sort Disk", which is way slower because temporary results have to be written to disk. The work_mem command is used for some other operations as well. However, sorting and aggregation are the most common use cases. Keep in mind that work_mem should not be abused, and work_mem can be allocated to every sorting or grouping operation. So, more than just one work_mem amount of memory might be allocated by a single query. Improving maintenance_work_mem To control the memory consumption of administrative tasks, PostgreSQL offers a parameter called maintenance_work_mem. It is used to handle index creations as well as VACUUM. Usually, creating an index (B-tree) is mostly related to sorting, and the idea of maintenance_work_mem is to speed things up. However, things are not as simple as they might seem. People might assume that increasing the parameter will always speed things up, but this is not necessarily true; in fact, smaller values might even be beneficial. We conducted some research to solve this riddle. The in-depth results of this research can be found at http://www.cybertec.at/adjusting-maintenance_work_mem/. However, indexes are not the only beneficiaries. The maintenance_work_mem command is also here to help VACUUM clean out indexes. If maintenance_work_mem is too low, you might see VACUUM scanning tables repeatedly because dead items cannot be stored in memory during VACUUM. This is something that should basically be avoided. Just like all other memory parameters, maintenance_work_mem can be set per session, or it can be set globally in postgresql.conf. Adjusting effective_cache_size The number of shared_buffers assigned to PostgreSQL is not the only cache in the system. The operating system will also cache data and do a great job of improving speed. To make sure that the PostgreSQL optimizer knows what to expect from the operation system, effective_cache_size has been introduced. The idea is to tell PostgreSQL how much cache there is going to be around (shared buffers + operating system side cache). The optimizer can then adjust its costs and estimates to reflect this knowledge. It is recommended to always set this parameter; otherwise, the planner might come up with suboptimal plans. Summary In this article, you learned how to detect basic performance bottlenecks. In addition to this, we covered the very basics of the PostgreSQL optimizer and indexes. At the end of the article, some important memory parameters were presented. Resources for Article: Further resources on this subject: PostgreSQL 9: Reliable Controller and Disk Setup [article] Running a PostgreSQL Database Server [article] PostgreSQL: Tips and Tricks [article]
Read more
  • 0
  • 0
  • 3541

article-image-machine-learning-ipython-scikit-learn-0
Packt
25 Sep 2014
13 min read
Save for later

Machine Learning in IPython with scikit-learn

Packt
25 Sep 2014
13 min read
This article written by Daniele Teti, the author of Ipython Interactive Computing and Visualization Cookbook, the basics of the machine learning scikit-learn package (http://scikit-learn.org) is introduced. Its clean API makes it really easy to define, train, and test models. Plus, scikit-learn is specifically designed for speed and (relatively) big data. (For more resources related to this topic, see here.) A very basic example of linear regression in the context of curve fitting is shown here. This toy example will allow to illustrate key concepts such as linear models, overfitting, underfitting, regularization, and cross-validation. Getting ready You can find all instructions to install scikit-learn on the main documentation. For more information, refer to http://scikit-learn.org/stable/install.html. With anaconda, you can type conda install scikit-learn in a terminal. How to do it... We will generate a one-dimensional dataset with a simple model (including some noise), and we will try to fit a function to this data. With this function, we can predict values on new data points. This is a curve fitting regression problem. First, let's make all the necessary imports: In [1]: import numpy as np        import scipy.stats as st        import sklearn.linear_model as lm        import matplotlib.pyplot as plt        %matplotlib inline We now define a deterministic nonlinear function underlying our generative model: In [2]: f = lambda x: np.exp(3 * x) We generate the values along the curve on [0,2]: In [3]: x_tr = np.linspace(0., 2, 200)        y_tr = f(x_tr) Now, let's generate data points within [0,1]. We use the function f and we add some Gaussian noise: In [4]: x = np.array([0, .1, .2, .5, .8, .9, 1])        y = f(x) + np.random.randn(len(x)) Let's plot our data points on [0,1]: In [5]: plt.plot(x_tr[:100], y_tr[:100], '--k')        plt.plot(x, y, 'ok', ms=10) In the image, the dotted curve represents the generative model. Now, we use scikit-learn to fit a linear model to the data. There are three steps. First, we create the model (an instance of the LinearRegression class). Then, we fit the model to our data. Finally, we predict values from our trained model. In [6]: # We create the model.        lr = lm.LinearRegression()        # We train the model on our training dataset.        lr.fit(x[:, np.newaxis], y)        # Now, we predict points with our trained model.        y_lr = lr.predict(x_tr[:, np.newaxis]) We need to convert x and x_tr to column vectors, as it is a general convention in scikit-learn that observations are rows, while features are columns. Here, we have seven observations with one feature. We now plot the result of the trained linear model. We obtain a regression line in green here: In [7]: plt.plot(x_tr, y_tr, '--k')        plt.plot(x_tr, y_lr, 'g')        plt.plot(x, y, 'ok', ms=10)        plt.xlim(0, 1)        plt.ylim(y.min()-1, y.max()+1)        plt.title("Linear regression") The linear fit is not well-adapted here, as the data points are generated according to a nonlinear model (an exponential curve). Therefore, we are now going to fit a nonlinear model. More precisely, we will fit a polynomial function to our data points. We can still use linear regression for this, by precomputing the exponents of our data points. This is done by generating a Vandermonde matrix, using the np.vander function. We will explain this trick in How it works…. In the following code, we perform and plot the fit: In [8]: lrp = lm.LinearRegression()        plt.plot(x_tr, y_tr, '--k')        for deg in [2, 5]:            lrp.fit(np.vander(x, deg + 1), y)            y_lrp = lrp.predict(np.vander(x_tr, deg + 1))            plt.plot(x_tr, y_lrp,                      label='degree ' + str(deg))             plt.legend(loc=2)            plt.xlim(0, 1.4)            plt.ylim(-10, 40)            # Print the model's coefficients.            print(' '.join(['%.2f' % c for c in                            lrp.coef_]))        plt.plot(x, y, 'ok', ms=10)      plt.title("Linear regression") 25.00 -8.57 0.00 -132.71 296.80 -211.76 72.80 -8.68 0.00 We have fitted two polynomial models of degree 2 and 5. The degree 2 polynomial appears to fit the data points less precisely than the degree 5 polynomial. However, it seems more robust; the degree 5 polynomial seems really bad at predicting values outside the data points (look for example at the x 1 portion). This is what we call overfitting; by using a too complex model, we obtain a better fit on the trained dataset, but a less robust model outside this set. Note the large coefficients of the degree 5 polynomial; this is generally a sign of overfitting. We will now use a different learning model, called ridge regression. It works like linear regression except that it prevents the polynomial's coefficients from becoming too big. This is what happened in the previous example. By adding a regularization term in the loss function, ridge regression imposes some structure on the underlying model. We will see more details in the next section. The ridge regression model has a meta-parameter, which represents the weight of the regularization term. We could try different values with trials and errors, using the Ridge class. However, scikit-learn includes another model called RidgeCV, which includes a parameter search with cross-validation. In practice, it means that we don't have to tweak this parameter by hand—scikit-learn does it for us. As the models of scikit-learn always follow the fit-predict API, all we have to do is replace lm.LinearRegression() by lm.RidgeCV() in the previous code. We will give more details in the next section. In [9]: ridge = lm.RidgeCV()        plt.plot(x_tr, y_tr, '--k')               for deg in [2, 5]:             ridge.fit(np.vander(x, deg + 1), y);            y_ridge = ridge.predict(np.vander(x_tr, deg+1))            plt.plot(x_tr, y_ridge,                      label='degree ' + str(deg))            plt.legend(loc=2)            plt.xlim(0, 1.5)           plt.ylim(-5, 80)            # Print the model's coefficients.            print(' '.join(['%.2f' % c                            for c in ridge.coef_]))               plt.plot(x, y, 'ok', ms=10)        plt.title("Ridge regression") 11.36 4.61 0.00 2.84 3.54 4.09 4.14 2.67 0.00 This time, the degree 5 polynomial seems more precise than the simpler degree 2 polynomial (which now causes underfitting). Ridge regression reduces the overfitting issue here. Observe how the degree 5 polynomial's coefficients are much smaller than in the previous example. How it works... In this section, we explain all the aspects covered in this article. The scikit-learn API scikit-learn implements a clean and coherent API for supervised and unsupervised learning. Our data points should be stored in a (N,D) matrix X, where N is the number of observations and D is the number of features. In other words, each row is an observation. The first step in a machine learning task is to define what the matrix X is exactly. In a supervised learning setup, we also have a target, an N-long vector y with a scalar value for each observation. This value is continuous or discrete, depending on whether we have a regression or classification problem, respectively. In scikit-learn, models are implemented in classes that have the fit() and predict() methods. The fit() method accepts the data matrix X as input, and y as well for supervised learning models. This method trains the model on the given data. The predict() method also takes data points as input (as a (M,D) matrix). It returns the labels or transformed points as predicted by the trained model. Ordinary least squares regression Ordinary least squares regression is one of the simplest regression methods. It consists of approaching the output values yi with a linear combination of Xij: Here, w = (w1, ..., wD) is the (unknown) parameter vector. Also, represents the model's output. We want this vector to match the data points y as closely as possible. Of course, the exact equality cannot hold in general (there is always some noise and uncertainty—models are always idealizations of reality). Therefore, we want to minimize the difference between these two vectors. The ordinary least squares regression method consists of minimizing the following loss function: This sum of the components squared is called the L2 norm. It is convenient because it leads to differentiable loss functions so that gradients can be computed and common optimization procedures can be performed. Polynomial interpolation with linear regression Ordinary least squares regression fits a linear model to the data. The model is linear both in the data points Xiand in the parameters wj. In our example, we obtain a poor fit because the data points were generated according to a nonlinear generative model (an exponential function). However, we can still use the linear regression method with a model that is linear in wj, but nonlinear in xi. To do this, we need to increase the number of dimensions in our dataset by using a basis of polynomial functions. In other words, we consider the following data points: Here, D is the maximum degree. The input matrix X is therefore the Vandermonde matrix associated to the original data points xi. For more information on the Vandermonde matrix, refer to http://en.wikipedia.org/wiki/Vandermonde_matrix. Here, it is easy to see that training a linear model on these new data points is equivalent to training a polynomial model on the original data points. Ridge regression Polynomial interpolation with linear regression can lead to overfitting if the degree of the polynomials is too large. By capturing the random fluctuations (noise) instead of the general trend of the data, the model loses some of its predictive power. This corresponds to a divergence of the polynomial's coefficients wj. A solution to this problem is to prevent these coefficients from growing unboundedly. With ridge regression (also known as Tikhonov regularization) this is done by adding a regularization term to the loss function. For more details on Tikhonov regularization, refer to http://en.wikipedia.org/wiki/Tikhonov_regularization. By minimizing this loss function, we not only minimize the error between the model and the data (first term, related to the bias), but also the size of the model's coefficients (second term, related to the variance). The bias-variance trade-off is quantified by the hyperparameter , which specifies the relative weight between the two terms in the loss function. Here, ridge regression led to a polynomial with smaller coefficients, and thus a better fit. Cross-validation and grid search A drawback of the ridge regression model compared to the ordinary least squares model is the presence of an extra hyperparameter . The quality of the prediction depends on the choice of this parameter. One possibility would be to fine-tune this parameter manually, but this procedure can be tedious and can also lead to overfitting problems. To solve this problem, we can use a grid search; we loop over many possible values for , and we evaluate the performance of the model for each possible value. Then, we choose the parameter that yields the best performance. How can we assess the performance of a model with a given value? A common solution is to use cross-validation. This procedure consists of splitting the dataset into a train set and a test set. We fit the model on the train set, and we test its predictive performance on the test set. By testing the model on a different dataset than the one used for training, we reduce overfitting. There are many ways to split the initial dataset into two parts like this. One possibility is to remove one sample to form the train set and to put this one sample into the test set. This is called Leave-One-Out cross-validation. With N samples, we obtain N sets of train and test sets. The cross-validated performance is the average performance on all these set decompositions. As we will see later, scikit-learn implements several easy-to-use functions to do cross-validation and grid search. In this article, there exists a special estimator called RidgeCV that implements a cross-validation and grid search procedure that is specific to the ridge regression model. Using this model ensures that the best hyperparameter is found automatically for us. There's more… Here are a few references about least squares: Ordinary least squares on Wikipedia, available at http://en.wikipedia.org/wiki/Ordinary_least_squares Linear least squares on Wikipedia, available at http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics) Here are a few references about cross-validation and grid search: Cross-validation in scikit-learn's documentation, available at http://scikit-learn.org/stable/modules/cross_validation.html Grid search in scikit-learn's documentation, available at http://scikit-learn.org/stable/modules/grid_search.html Cross-validation on Wikipedia, available at http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 Here are a few references about scikit-learn: scikit-learn basic tutorial available at http://scikit-learn.org/stable/tutorial/basic/tutorial.html scikit-learn tutorial given at the SciPy 2013 conference available at https://github.com/jakevdp/sklearn_scipy2013 Summary Using the scikit-learn Python package, this article illustrates fundamental data mining and machine learning concepts such as supervised and unsupervised learning, classification, regression, feature selection, feature extraction, overfitting, regularization, cross-validation, and grid search. Resources for Article: Further resources on this subject: Driving Visual Analyses with Automobile Data (Python) [Article] Fast Array Operations with NumPy [Article] Python 3: Designing a Tasklist Application [Article]
Read more
  • 0
  • 0
  • 3237

article-image-how-to-build-a-recommender-by-running-mahout-on-spark
Pat Ferrel
24 Sep 2014
7 min read
Save for later

How to Build a Recommender by Running Mahout on Spark

Pat Ferrel
24 Sep 2014
7 min read
Mahout on Spark: Recommenders There are big changes happening in Apache Mahout. For several years it was the go-to machine learning library for Hadoop. It contained most of the best-in-class algorithms for scalable machine learning, which means clustering, classification, and recommendations. But it was written for Hadoop and MapReduce. Today a number of new parallel execution engines show great promise in speeding calculations by 10-100x (Spark, H2O, Flink). That means that instead of buying 10 computers for a cluster, one will do. That should get your manager’s attention. After releasing Mahout 0.9, the team decided to begin an aggressive retool using Spark, building in the flexibility to support other engines, and both H2O and Flink have shown active interest. This post is about moving the heart of Mahout’s item-based collaborative filtering recommender to Spark. Where we are Mahout is currently on the 1.0 snapshot version, meaning we are working on what will be released as 1.0. For the past year or, some of the team has been working on a Scala-based DSL (Domain Specific Language), which looks like Scala with R-like algebraic expressions. Since Scala supports not only operator overloading but functional programming, it is a natural choice for building distributed code with rich linear algebra expressions. Currently we have an interactive shell that runs Scala with all of the R-like expression support. Think of it as R but supporting truly huge data in a completely distributed way. Many algorithms—the ones that can be expressed as simple linear algebra equations—are implemented with relative ease (SSVD, PCA). Scala also has lazy evaluation, which allows Mahout to slide a modern optimizer underneath the DSL. When an end product of a calculation is needed, the optimizer figures out the best path to follow and spins off the most efficient Spark jobs to accomplish the whole. Recommenders One of the first things we want to implement is the popular item-based recommenders. But here, too, we’ll introduce many innovations. It still starts from some linear algebra. Let’s take the case of recommending purchases on an e-commerce site. The problem can be defined in the following code example: r_p = recommendations for purchases for a given user. This is a vector of item-ids and strengths of recommendation. h_p = history of purchases for a given user A = the matrix of all purchases by all users. Rows are users, columns items, for now we will just flag a purchase so the matrix is all ones and zeros. r_p = [A’A]h_p A’A is the matrix A transposed then multiplied by A. This is the core cooccurrence or indicator matrix used in this style of recommender. Using the Mahout Scala DSL we could write the recommender as: val recs = (A.t %*% A) * userHist This would produce a reasonable recommendation, but from experience we know that A’A is better calculated using a method called the Log Likelihood Ratio, which is a probabilistic measure of the importance of a cooccurrence (http://en.wikipedia.org/wiki/Likelihood-ratio_test). In general, when you see something like A’A, it can be replaced with a similarity comparison for each row with every other row. This will produce a matrix whose rows are items and whose columns are the same items. The magnitude of the value in the matrix determines the strength of similarity of row item to the column item. In recommenders, the more similar the items, the more they were purchased by similar people. The previous line of code is replaced by the following: val recs = CooccurrenceAnalysis.cooccurence(A)* userHist However, this would take time to execute for each user as they visit the e-commerce site, so we’ll handle that outside of Mahout. First let’s talk about data preparation. Item Similarity Driver Creating the indicator matrix (A’A) is the core of this type of recommender. We have a quick flexible way to create this using text log files and creating output that is in an easy form to digest. The job of data prep is greatly streamlined in the Mahout 1.0 snapshot. In the past a user would have to do all the data prep themselves. This requires translating their own user and item IDs into Mahout IDs, putting the data into text tuple files and feeding them to the recommender. Out the other end you’d get a Hadoop binary file called a sequence file and you’d have to translate the Mahout IDs into something your application could understand. This is no longer required. To make this process much simpler we created the spark-itemsimilarity command line tool. After installing Mahout, Hadoop, and Spark, and assuming you have logged user purchases in some directories in HDFS, we can probably read them in, calculate the indicator matrix, and write it out with no other prep required. The spark-itemsimilarity command line takes in text-delimited files, extracts the user ID and item ID, runs the cooccurrence analysis, and outputs a text file with your application’s user and item IDs restored. Here is the sample input file where we’ve specified a simple comma-delimited format that field holds the user ID, the item ID, and the filter—purchase: Thu Jul 10 10:52:10.996,u1,purchase,iphone Fri Jul 11 13:52:51.018,u1,purchase,ipad Fri Jul 11 21:42:26.232,u2,purchase,nexus Sat Jul 12 09:43:12.434,u2,purchase,galaxy Sat Jul 12 19:58:09.975,u3,purchase,surface Sat Jul 12 23:08:03.457,u4,purchase,iphone Sun Jul 13 14:43:38.363,u4,purchase,galaxy spark-itemsimilarity will create a Spark distributed dataset (rdd) to back the Mahout DRM (distributed row matrix) that holds this data: User/item iPhone IPad Nexus Galaxy Surface u1 1 1 0 0 0 u2 0 0 1 1 0 u3 0 0 0 0 1 u4 1 0 0 1 0 The output of the job is the LLR computed “indicator matrix” and will contain this data: Item/item iPhone iPad Nexus Galaxy Surface iPhone 0 1.726092435 0 0 0 iPad 1.726092435 0 0 0 0 Nexus 0 0 0 1.726092435 0 Galaxy 0 0 1.726092435 0 0 Surface 0 0 0 0 0 Reading this, we see that self-similarities have been removed so the diagonal is all zeros. The iPhone is similar to the iPad and the Galaxy is similar to the Nexus. The output of the spark-itemsimilarity job can be formatted in various ways but by default it looks like this: galaxy<tab>nexus:1.7260924347106847 ipad<tab>iphone:1.7260924347106847 nexus<tab>galaxy:1.7260924347106847 iphone<tab>ipad:1.7260924347106847 surface On the e-commerce site for the page displaying the Nexus we can show that the Galaxy was purchased by the same people. Notice that application specific IDs are preserved here and the text file is very easy to parse in text-delimited format. The numeric values are the strength of similarity, and for the cases where there are many similar products, you can sort on that value if you want to show only the highest weighted recommendations. Still, this is only part way to an individual recommender. We have done the [A’A] part, but now we need to do the [A’A]h_p. Using the current user’s purchase history will personalize the recommendations. The next post in this series will talk about using a search engine to take this last step. About the author Pat is a serial entrepreneur, consultant, and Apache Mahout committer working on the next generation of Spark-based recommenders. He lives in Seattle and can be contacted through his site https://finderbots.com or @occam on Twitter. Want more Spark? We've got you covered - click here.
Read more
  • 0
  • 0
  • 5915
article-image-creating-our-first-universe
Packt
22 Sep 2014
18 min read
Save for later

Creating Our First Universe

Packt
22 Sep 2014
18 min read
In this article, by Taha M. Mahmoud, the author of the book, Creating Universes with SAP BusinessObjects, we will learn how to run SAP BO Information Design Tool (IDT), and we will have an overview of the different views that we have in the main IDT window. This will help us understand the main function and purpose for each part of the IDT main window. Then, we will use SAP BO IDT to create our first Universe. In this article, we will create a local project to contain our Universe and other resources related to it. After that, we will use the ODBC connection. Then, we will create a simple Data Foundation layer that will contain only one table (Customers). After that, we will create the corresponding Business layer by creating the associated business objects. The main target of this article is to make you familiar with the Universe creation process from start to end. Then, we will detail each part of the Universe creation process as well as other Universe features. At the end, we will talk about how to get help while creating a new Universe, using the Universe creation wizard or Cheat Sheets. In this article, we will cover the following topics: Running the IDT Getting familiar with SAP BO IDT's interface and views Creating a local project and setting up a relational connection Creating a simple Data Foundation layer Creating a simple Business layer Publishing our first Universe Getting help using the Universe wizard and Cheat Sheets (For more resources related to this topic, see here.) Information Design Tool The Information Design Tool is a client tool that is used to develop BO Universes. It is a new tool released by SAP in BO release 4. There are many SAP BO tools that we can use to create our Universe, such as SAP BO Universe Designer Tool (UDT), SAP BO Universe Builder, and SAP BO IDT. SAP BO Universe designer was the main tool to create Universe since the release of BO 6.x. This tool is still supported in the current SAP BI 4.x release, and you can still use it to create UNV Universes. You need to plan which tool you will use to build your Universe based on the target solution. For example, if you need to connect to a BEX query, you should use the UDT, as the IDT can't do this. On the other hand, if you want to create a Universe query from SAP Dashboard Designer, then you should use the IDT. The BO Universe Builder used to build a Universe from a supported XML metadata file. You can use the Universe conversion wizard to convert the UNV Universe created by the UDT to the UNX Universe created by the IDT. Sometimes, you might get errors or warnings while converting a Universe from .unv to .unx. You need to resolve this manually. It is preferred that you convert a Universe from the previous SAP BO release XI 3.x instead of converting a Universe from an earlier release such as BI XI R2 and BO 6.5. There will always be complete support for the previous release. The main features of the IDT IDT is one of the major new features introduced in SAP BI 4.0. We can now build a Universe that combines data from multiple data sources and also build a dimensional universe on top of an OLAP connection. We can see also a major enhancement in the design environment by empowering the multiuser development environment. This will help designers work in teams and share Universe resources as well as maintain the Universe version control. For more information on the new features introduced in the IDT, refer to the SAP community network at http://wiki.scn.sap.com/ and search for SAP BI 4.0 new features and changes. The Information Design Tool interface We need to cover the following requirements before we create our first Universe: BO client tools are installed on your machine, or you have access to a PC with client tools already installed We have access to a SAP BO server We have a valid username and password to connect to this server We have created an ODBC connection for the Northwind Microsoft Access database Now, to run the IDT, perform the following steps: Click on the Start menu and navigate to All Programs. Click on the SAP BusinessObjects BI platform 4 folder to expand it. Click on the Information Design Tool icon, as shown in the following screenshot: The IDT will open and then we can move on and create our new Universe. In this section, we will get to know the different views that we have in the IDT. We can show or hide any view from the Window menu, as shown in the following screenshot: You can also access the same views from the main window toolbar, as displayed in the following screenshot: Local Projects The Local Projects view is used to navigate to and maintain local project resources, so you can edit and update any project resource, such as the relation connection, Data Foundation, and Business layers from this view. A project is a new concept introduced in the IDT, and there is no equivalent for it in the UDT. We can see the Local Projects main window in the following screenshot: Repository Resources You can access more than one repository using the IDT. However, usually, we work with only one repository at a time. This view will help you initiate a session with the required repository and will keep a list of all the available repositories. You can use repository resources to access and modify the secured connection stored on the BO server. You can also manage and organize published Universes. We can see the Repository Resources main window in the following screenshot: Security Editor Security Editor is used to create data and business security profiles. This can be used to add some security restrictions to be applied on BO users and groups. Security Editor is equivalent to Manage Security under Tools in the UDT. We can see the main Security Editor window in the following screenshot: Project Synchronization The Project Synchronization view is used to synchronize shared projects stored on the repository with your local projects. From this view, you will be able to see the differences between your local projects and shared projects, such as added, deleted, or updated project resources. Project Synchronization is one of the major enhancements introduced in the IDT to overcome the lack of the multiuser development environment in the UDT. We can see the Project Synchronization window in the following screenshot: Check Integrity Problems The Check Integrity Problems view is used to check the Universe's integrity. Check Integrity Problems is equivalent to Check Integrity under Tools in the UDT. Check Integrity Problems is an automatic test for your foundation layer as well as Business layer that will check the Universe's integrity. This wizard will display errors or warnings discovered during the test, and we need to fix them to avoid having any wrong data or errors in our reports. Check Integrity Problems is part of the BO best practices to always check and correct the integrity problems before publishing the Universe. We can see the Check Integrity window in the following screenshot: Creating your first Universe step by step After we've opened the IDT, we want to start creating our NorthWind Universe. We need to create the following three main resources to build a Universe: Data connection: This resource is used to establish a connection with the data source. There are two main types of connections that we can create: relational connection and OLAP connection. Data Foundation: This resource will store the metadata, such as tables, joins, and cardinalities, for the physical layer. The Business layer: This resource will store the metadata for the business model. Here, we will create our business objects such as dimensions, measures, attributes, and filters. This layer is our Universe's interface and end users should be able to access it to build their own reports and analytics by dragging-and-dropping the required objects. We need to create a local project to hold all the preceding Universe resources. The local project is just a container that will store the Universe's contents locally on your machine. Finally, we need to publish our Universe to make it ready to be used. Creating a new project You can think about a project such as a folder that will contain all the resources required by your Universe. Normally, we will start any Universe by creating a local project. Then, later on, we might need to share the entire project and make it available for the other Universe designers and developers as well. This is a folder that will be stored locally on your machine, and you can access it any time from the IDT Local Projects window or using the Open option from the File menu. The resources inside this project will be available only for the local machine users. Let's try to create our first local project using the following steps: Go to the File menu and select New Project, or click on the New icon on the toolbar. Select Project, as shown in the following screenshot: The New Project creation wizard will open. Enter NorthWind in the Project Name field, and leave the Project Location field as default. Note that your project will be stored locally in this folder. Click on Finish, as shown in the following screenshot: Now, you can see the NorthWind empty project in the Local Projects window. You can add resources to your local project by performing the following actions: Creating new resources Converting a .unv Universe Importing a published Universe Creating a new data connection Data connection will store all the required information such as IP address, username, and password to access a specific data source. A data connection will connect to a specific type of data source, and you can use the same data connection to create multiple Data Foundation layers. There are two types of data connection: relational data connection, which is used to connect to the relational database such as Teradata and Oracle, and OLAP connection, which is used to connect to an OLAP cube. To create a data connection, we need to do the following: Right-click on the NorthWind Universe. Select a new Relational Data Connection. Enter NorthWind as the connection name, and write a brief description about this connection. The best practice is to always add a description for each created object. For example, code comments will help others understand why this object has been created, how to use it, and for which purpose they should use it. We can see the first page of the New Relational Connection wizard in the following screenshot: On the second page, expand the MS Access 2007 driver and select ODBC Drivers. Use the NorthWind ODBC connection. Click on Test Connection to make sure that the connection to the data source is successfully established. Click on Next to edit the connection's advanced options or click on Finish to use the default settings, as shown in the following screenshot: We can see the first parameters page of the MS Access 2007 connection in the following screenshot: You can now see the NorthWind connection under the NorthWind project in the Local Projects window. The local relational connection is stored as the .cnx file, while the shared secured connection is stored as a shortcut with the .cns extension. The local connection can be used in your local projects only, and you need to publish it to the BO repository to share it with other Universe designers. Creating a new Data Foundation After we successfully create a relation connection to the Northwind Microsoft Access database, we can now start creating our foundation. Data Foundation is a physical model that will store tables as well as the relations between them (joins). Data Foundation in the IDT is equivalent to the physical data layer in the UDT. To create a new Data Foundation, right-click on the NorthWind project in the Local Projects window, and then select New Data Foundation and perform the following steps: Enter NorthWind as a resource name, and enter a brief description on the NorthWind Data Foundation. Select the Single Source Data Foundation. Select the NorthWind.cnx connection. After that, expand the NorthWind connection, navigate to NorthWind.accdb, and perform the following steps: Navigate to the Customers table and then drag it to an empty area in the Master view window on the right-hand side. Save your Data Foundation. An asterisk (*) will be displayed beside the resource name to indicate that it was modified but not saved. We can see the Connection panel in the NorthWind.dfx Universe resource in the following screenshot: Creating a new Business layer Now, we will create a simple Business layer based on the Customer table that we already added to the NorthWind Data Foundation. Each Business layer should map to one Data Foundation at the end. The Business layer in the IDT is equivalent to the business model in the UDT. To create a new Business layer, right-click on the NorthWind project and then select New Business Layer from the menu. Then, we need to perform the following steps: The first step to create a Business layer is to select the type of the data source that we will use. In our case, select Relational Data Foundation as shown in the following screenshot: Enter NorthWind as the resource name and a brief description for our Business layer. In the next Select Data Foundation window, select the NorthWind Data Foundation from the list. Make sure that the Automatically create folders and objects option is selected, as shown in the following screenshot: Now, you should be able to see the Customer folder under the NorthWind Business layer. If not, just drag it from the NorthWind Data Foundation and drop it under the NorthWind Business layer. Then, save the NorthWind Business Layer, as shown in the following screenshot: A new folder will be created automatically for the Customers table. This folder is also populated with the corresponding dimensions. The Business layer now needs to be published to the BO server, and then, the end users will be able to access it and build their own reports on top of our Universe. If you successfully completed all the steps from the previous sections, the project folder should contain the relational data connection (NorthWind.cnx), the Data Foundation layer (NorthWind.dfx), and the Business layer (NorthWind.blx). The project should appear as displayed in the following screenshot: Saving and publishing the NorthWind Universe We need to perform one last step before we publish our first simple Universe and make it available for the other Universe designers. We need to publish our relational data connection and save it on the repository instead of on our local machine. Publishing a connection will make it available for everyone on the server. Before publishing the Universe, we will replace the NorthWind.cnx resource in our project with a shortcut to the NorthWind secured connection stored on the SAP BO server. After publishing a Universe, other developers as well as business users will be able to see and access it from the SAP BO repository. Publishing a Universe from the IDT is equivalent to exporting a Universe from the UDT (navigate to File | Export). To publish the NorthWind connection, we need to right-click on the NorthWind.cnx resource in the Local Projects window. Then, select Publish Connection to a Repository. As we don't have an active session with the BO server, you will need to initiate one by performing the following steps: Create a new session. Type your <system name: port number> in the System field. Select the Authentication type. Enter your username and password. We have many authentication types such as Enterprise, LDAP, and Windows Active Directory (AD). Enterprise authentication will store user security information inside the BO server. The user credential can only be used to log in to BO, while on the other hand, LDAP will store user security information in the LDAP server, and the user credential can be used to log in to multiple systems in this case. The BO server will send user information to the LDAP server to authenticate the user, and then, it will allow them to access the system in case of successful authentication. The last authentication type is Windows AD, which can also authenticate users using the security information stored inside. There are many authentication types such as Enterprise, LDAP, Windows AD, and SAP. We can see the Open Session window in the following screenshot: The default port number is 6400. A pop-up window will inform you about the connection status (successful here), and it will ask you whether you want to create a shortcut for this connection in the same project folder or not. We should select Yes in our case, because we need to link to the secured published connection instead of the local one. We will not be able to publish our Universe to the BO repository with a local connection. We can see the Publish Connection window in the following screenshot: Finally, we need to link our Data Foundation layer with the secured connection instead of the local connection. To do this, you need to open NorthWind.dfx and replace NorthWind.cnx with the NorthWind.cnc connection. Then, save your Data Foundation resource and right-click on NorthWind.blx. After that, navigate to Publish | To a Repository.... The Check Integrity window will be displayed. Just select Finish. We can see how to change connection in NorthWind.dfx in the following screenshot: After redirecting our Data Foundation layer to the newly created shortcut connection, we need to go to the Local Projects window again, right-click on NorthWind.blx, and publish it to the repository. Our Universe will be saved on the repository with the same name assigned to the Business layer. Congratulations! We have created our first Universe. Finding help while creating a Universe In most cases, you will use the step-by-step approach to create a Universe. However, we have two other ways that we can use to create a universe. In this section, we will try to create the NorthWind Universe again, but using the Universe wizard and Cheat Sheets. The Universe wizard The Universe wizard is just a wizard that will launch the project, connection, Data Foundation, and Business layer wizards in a sequence. We already explained each wizard individually in an earlier section. Each wizard will collect the required information to create the associated Universe resource. For example, the project wizard will end after collecting the required information to create a project, and the project folder will be created as an output. The Universe wizard will launch all the mentioned wizards, and it will end after collecting all the information required to create the Universe. A Universe with all the required resources will be created after finishing this wizard. The Universe wizard is equivalent to the Quick Design wizard in the UDT. You can open the Universe wizard from the welcome screen or from the File menu. As a practice, we can create the NorthWind2 Universe using the Universe wizard: The Universe wizard and welcome screen are new features in SAP BO 4.1. Cheat Sheets Cheat Sheets is another way of getting help while you are building your Universe. They provide step-by-step guidance and detailed descriptions that will help you create your relational Universe. We need to perform the following steps to use Cheat Sheets to build the NorthWind3 Universe, which is exactly the same as the NorthWind Universe that we created earlier in the step-by-step approach: Go to the Help menu and select Cheat Sheets. Follow the steps in the Cheat Sheets window to create the NorthWind3 Universe using the same information that we used to complete the NorthWind Universe. If you face any difficulties in completing any steps, just click on the Click to perform button to guide you. Click on the Click when completed link to move to the next step. Cheat Sheets is a new help method introduced in the IDT, and there is no equivalent for it in the UDT. We can see the Cheat Sheets window in the following screenshot: Summary In this article, we discussed the difference between IDT views, and we tried to get familiar with the IDT user interface. Then, we had an overview of the Universe creation process from start to end. In real-life project environments, the first step is to create a local project to hold all the related Universe resources. Then, we initiated the project by adding the main three resources that are required by each universe. These resources are data connection, Data Foundation, and Business layer. After that, we published our Universe to make it available to other Universe designers and users. This is done by publishing our data connection first and then by redirecting our foundation layer to refer to a shortcut for the shared secured published connection. At this point, we will be able to publish and share our Universe. We also learned how to use the Universe wizard and Cheat Sheets to create a Universe. Resources for Article: Further resources on this subject: Report Data Filtering [Article] Exporting SAP BusinessObjects Dashboards into Different Environments [Article] SAP BusinessObjects: Customizing the Dashboard [Article]
Read more
  • 0
  • 0
  • 4864

article-image-visualization-tool-understand-data
Packt
22 Sep 2014
23 min read
Save for later

Visualization as a Tool to Understand Data

Packt
22 Sep 2014
23 min read
In this article by Nazmus Saquib, the author of Mathematica Data Visualization, we will look at a few simple examples that demonstrate the importance of data visualization. We will then discuss the types of datasets that we will encounter over the course of this book, and learn about the Mathematica interface to get ourselves warmed up for coding. (For more resources related to this topic, see here.) In the last few decades, the quick growth in the volume of information we produce and the capacity of digital information storage have opened a new door for data analytics. We have moved on from the age of terabytes to that of petabytes and exabytes. Traditional data analysis is now augmented with the term big data analysis, and computer scientists are pushing the bounds for analyzing this huge sea of data using statistical, computational, and algorithmic techniques. Along with the size, the types and categories of data have also evolved. Along with the typical and popular data domain in Computer Science (text, image, and video), graphs and various categorical data that arise from Internet interactions have become increasingly interesting to analyze. With the advances in computational methods and computing speed, scientists nowadays produce an enormous amount of numerical simulation data that has opened up new challenges in the field of Computer Science. Simulation data tends to be structured and clean, whereas data collected or scraped from websites can be quite unstructured and hard to make sense of. For example, let's say we want to analyze some blog entries in order to find out which blogger gets more follows and referrals from other bloggers. This is not as straightforward as getting some friends' information from social networking sites. Blog entries consist of text and HTML tags; thus, a combination of text analytics and tag parsing, coupled with a careful observation of the results would give us our desired outcome. Regardless of whether the data is simulated or empirical, the key word here is observation. In order to make intelligent observations, data scientists tend to follow a certain pipeline. The data needs to be acquired and cleaned to make sure that it is ready to be analyzed using existing tools. Analysis may take the route of visualization, statistics, and algorithms, or a combination of any of the three. Inference and refining the analysis methods based on the inference is an iterative process that needs to be carried out several times until we think that a set of hypotheses is formed, or a clear question is asked for further analysis, or a question is answered with enough evidence. Visualization is a very effective and perceptive method to make sense of our data. While statistics and algorithmic techniques provide good insights about data, an effective visualization makes it easy for anyone with little training to gain beautiful insights about their datasets. The power of visualization resides not only in the ease of interpretation, but it also reveals visual trends and patterns in data, which are often hard to find using statistical or algorithmic techniques. It can be used during any step of the data analysis pipeline—validation, verification, analysis, and inference—to aid the data scientist. How have you visualized your data recently? If you still have not, it is okay, as this book will teach you exactly that. However, if you had the opportunity to play with any kind of data already, I want you to take a moment and think about the techniques you used to visualize your data so far. Make a list of them. Done? Do you have 2D and 3D plots, histograms, bar charts, and pie charts in the list? If yes, excellent! We will learn how to style your plots and make them more interactive using Mathematica. Do you have chord diagrams, graph layouts, word cloud, parallel coordinates, isosurfaces, and maps somewhere in that list? If yes, then you are already familiar with some modern visualization techniques, but if you have not had the chance to use Mathematica as a data visualization language before, we will explore how visualization prototypes can be built seamlessly in this software using very little code. The aim of this book is to teach a Mathematica beginner the data-analysis and visualization powerhouse built into Mathematica, and at the same time, familiarize the reader with some of the modern visualization techniques that can be easily built with Mathematica. We will learn how to load, clean, and dissect different types of data, visualize the data using Mathematica's built-in tools, and then use the Mathematica graphics language and interactivity functions to build prototypes of a modern visualization. The importance of visualization Visualization has a broad definition, and so does data. The cave paintings drawn by our ancestors can be argued as visualizations as they convey historical data through a visual medium. Map visualizations were commonly used in wars since ancient times to discuss the past, present, and future states of a war, and to come up with new strategies. Astronomers in the 17th century were believed to have built the first visualization of their statistical data. In the 18th century, William Playfair invented many of the popular graphs we use today (line, bar, circle, and pie charts). Therefore, it appears as if many, since ancient times, have recognized the importance of visualization in giving some meaning to data. To demonstrate the importance of visualization in a simple mathematical setting, consider fitting a line to a given set of points. Without looking at the data points, it would be unwise to try to fit them with a model that seemingly lowers the error bound. It should also be noted that sometimes, the data needs to be changed or transformed to the correct form that allows us to use a particular tool. Visualizing the data points ensures that we do not fall into any trap. The following screenshot shows the visualization of a polynomial as a circle: Figure1.1 Fitting a polynomial In figure 1.1, the points are distributed around a circle. Imagine we are given these points in a Cartesian space (orthogonal x and y coordinates), and we are asked to fit a simple linear model. There is not much benefit if we try to fit these points to any polynomial in a Cartesian space; what we really need to do is change the parameter space to polar coordinates. A 1-degree polynomial in polar coordinate space (essentially a circle) would nicely fit these points when they are converted to polar coordinates, as shown in figure 1.1. Visualizing the data points in more complicated but similar situations can save us a lot of trouble. The following is a screenshot of Anscombe's quartet: Figure1.2 Anscombe's quartet, generated using Mathematica Downloading the color images of this book We also provide you a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from: https://www.packtpub.com/sites/default/files/downloads/2999OT_coloredimages.PDF. Anscombe's quartet (figure 1.2), named after the statistician Francis Anscombe, is a classic example of how simple data visualization like plotting can save us from making wrong statistical inferences. The quartet consists of four datasets that have nearly identical statistical properties (such as mean, variance, and correlation), and gives rise to the same linear model when a regression routine is run on these datasets. However, the second dataset does not really constitute a linear relationship; a spline would fit the points better. The third dataset (at the bottom-left corner of figure 1.2) actually has a different regression line, but the outlier exerts enough influence to force the same regression line on the data. The fourth dataset is not even a linear relationship, but the outlier enforces the same regression line again. These two examples demonstrate the importance of "seeing" our data before we blindly run algorithms and statistics. Fortunately, for visualization scientists like us, the world of data types is quite vast. Every now and then, this gives us the opportunity to create new visual tools other than the traditional graphs, plots, and histograms. These visual signatures and tools serve the same purpose that the graph plotting examples previously just did—spy and investigate data to infer valuable insights—but on different types of datasets other than just point clouds. Another important use of visualization is to enable the data scientist to interactively explore the data. Two features make today's visualization tools very attractive—the ability to view data from different perspectives (viewing angles) and at different resolutions. These features facilitate the investigator in understanding both the micro- and macro-level behavior of their dataset. Types of datasets There are many different types of datasets that a visualization scientist encounters in their work. This book's aim is to prepare an enthusiastic beginner to delve into the world of data visualization. Certainly, we will not comprehensively cover each and every visualization technique out there. Our aim is to learn to use Mathematica as a tool to create interactive visualizations. To achieve that, we will focus on a general classification of datasets that will determine which Mathematica functions and programming constructs we should learn in order to visualize the broad class of data covered in this book. Tables The table is one of the most common data structures in Computer Science. You might have already encountered this in a computer science, database, or even statistics course, but for the sake of completeness, we will describe the ways in which one could use this structure to represent different kinds of data. Consider the following table as an example:   Attribute 1 Attribute 2 … Item 1       Item 2       Item 3       When storing datasets in tables, each row in the table represents an instance of the dataset, and each column represents an attribute of that data point. For example, a set of two-dimensional Cartesian vectors can be represented as a table with two attributes, where each row represents a vector, and the attributes are the x and y coordinates relative to an origin. For three-dimensional vectors or more, we could just increase the number of attributes accordingly. Tables can be used to store more advanced forms of scientific, time series, and graph data. We will cover some of these datasets over the course of this book, so it is a good idea for us to get introduced to them now. Here, we explain the general concepts. Scalar fields There are many kinds of scientific dataset out there. In order to aid their investigations, scientists have created their own data formats and mathematical tools to analyze the data. Engineers have also developed their own visualization language in order to convey ideas in their community. In this book, we will cover a few typical datasets that are widely used by scientists and engineers. We will eventually learn how to create molecular visualizations and biomedical dataset exploration tools when we feel comfortable manipulating these datasets. In practice, multidimensional data (just like vectors in the previous example) is usually augmented with one or more characteristic variable values. As an example, let's think about how a physicist or an engineer would keep track of the temperature of a room. In order to tackle the problem, they would begin by measuring the geometry and the shape of the room, and put temperature sensors at certain places to measure the temperature. They will note the exact positions of those sensors relative to the room's coordinate system, and then, they will be all set to start measuring the temperature. Thus, the temperature of a room can be represented, in a discrete sense, by using a set of points that represent the temperature sensor locations and the actual temperature at those points. We immediately notice that the data is multidimensional in nature (the location of a sensor can be considered as a vector), and each data point has a scalar value associated with it (temperature). Such a discrete representation of multidimensional data is quite widely used in the scientific community. It is called a scalar field. The following screenshot shows the representation of a scalar field in 2D and 3D: Figure1.3 In practice, scalar fields are discrete and ordered Figure 1.3 depicts how one would represent an ordered scalar field in 2D or 3D. Each point in the 2D field has a well-defined x and y location, and a single temperature value gets associated with it. To represent a 3D scalar field, we can think of it as a set of 2D scalar field slices placed at a regular interval along the third dimension. Each point in the 3D field is a point that has {x, y, z} values, along with a temperature value. A scalar field can be represented using a table. We will denote each {x, y} point (for 2D) or {x, y, z} point values (for 3D) as a row, but this time, an additional attribute for the scalar value will be created in the table. Thus, a row will have the attributes {x, y, z, T}, where T is the temperature associated with the point defined by the x, y, and z coordinates. This is the most common representation of scalar fields. A widely used visualization technique to analyze scalar fields is to find out the isocontours or isosurfaces of interest. However, for now, let's take a look at the kind of application areas such analysis will enable one to pursue. Instead of temperature, one could think of associating regularly spaced points with any relevant scalar value to form problem-specific scalar fields. In an electrochemical simulation, it is important to keep track of the charge density in the simulation space. Thus, the chemist would create a scalar field with charge values at specific points. For an aerospace engineer, it is quite important to understand how air pressure varies across airplane wings; they would keep track of the pressure by forming a scalar field of pressure values. Scalar field visualization is very important in many other significant areas, ranging from from biomedical analysis to particle physics. In this book, we will cover how to visualize this type of data using Mathematica. Time series Another widely used data type is the time series. A time series is a sequence of data points that are measured usually over a uniform interval of time. Time series arise in many fields, but in today's world, they are mostly known for their applications in Economics and Finance. Other than these, they are frequently used in statistics, weather prediction, signal processing, astronomy, and so on. It is not the purpose of this book to describe the theory and mathematics of time series data. However, we will cover some of Mathematica's excellent capabilities for visualizing time series, and in the course of this book, we will construct our own visualization tool to view time series data. Time series can be easily represented using tables. Each row of the time series table will represent one point in the series, with one attribute denoting the time stamp—the time at which the data point was recorded, and the other attribute storing the actual data value. If the starting time and the time interval are known, then we can get rid of the time attribute and simply store the data value in each row. The actual timestamp of each value can be calculated using the initial time and time interval. Images and videos can be represented as tables too, with pixel-intensity values occupying each entry of the table. As we focus on visualization and not image processing, we will skip those types of data. Graphs Nowadays, graphs arise in all contexts of computer science and social science. This particular data structure provides a way to convert real-world problems into a set of entities and relationships. Once we have a graph, we can use a plethora of graph algorithms to find beautiful insights about the dataset. Technically, a graph can be stored as a table. However, Mathematica has its own graph data structure, so we will stick to its norm. Sometimes, visualizing the graph structure reveals quite a lot of hidden information. Graph visualization itself is a challenging problem, and is an active research area in computer science. A proper visualization layout, along with proper color maps and size distribution, can produce very useful outputs. Text The most common form of data that we encounter everywhere is text. Mathematica does not provide any specific visualization package for state-of-the-art text visualization methods. Cartographic data As mentioned before, map visualization is one of the ancient forms of visualization known to us. Nowadays, with the advent of GPS, smartphones, and publicly available country-based data repositories, maps are providing an excellent way to contrast and compare different countries, cities, or even communities. Cartographic data comes in various forms. A common form of a single data item is one that includes latitude, longitude, location name, and an attribute (usually numerical) that records a relevant quantity. However, instead of a latitude and longitude coordinate, we may be given a set of polygons that describe the geographical shape of the place. The attributable quantity may not be numerical, but rather something qualitative, like text. Thus, there is really no standard form that one can expect when dealing with cartographic data. Fortunately, Mathematica provides us with excellent data-mining and dissecting capabilities to build custom formats out of the data available to us. . Mathematica as a tool for visualization At this point, you might be wondering why Mathematica is suited for visualizing all the kinds of datasets that we have mentioned in the preceding examples. There are many excellent tools and packages out there to visualize data. Mathematica is quite different from other languages and packages because of the unique set of capabilities it presents to its user. Mathematica has its own graphics language, with which graphics primitives can be interactively rendered inside the worksheet. This makes Mathematica's capability similar to many widely used visualization languages. Mathematica provides a plethora of functions to combine these primitives and make them interactive. Speaking of interactivity, Mathematica provides a suite of functions to interactively display any of its process. Not only visualization, but any function or code evaluation can be interactively visualized. This is particularly helpful when managing and visualizing big datasets. Mathematica provides many packages and functions to visualize the kinds of datasets we have mentioned so far. We will learn to use the built-in functions to visualize structured and unstructured data. These functions include point, line, and surface plots; histograms; standard statistical charts; and so on. Other than these, we will learn to use the advanced functions that will let us build our own visualization tools. Another interesting feature is the built-in datasets that this software provides to its users. This feature provides a nice playground for the user to experiment with different datasets and visualization functions. From our discussion so far, we have learned that visualization tools are used to analyze very large datasets. While Mathematica is not really suited for dealing with petabytes or exabytes of data (and many other popularly used visualization tools are not suited for that either), often, one needs to build quick prototypes of such visualization tools using smaller sample datasets. Mathematica is very well suited to prototype such tools because of its efficient and fast data-handling capabilities, along with its loads of convenient functions and user-friendly interface. It also supports GPU and other high-performance computing platforms. Although it is not within the scope of this book, a user who knows how to harness the computing power of Mathematica can couple that knowledge with visualization techniques to build custom big data visualization solutions. Another feature that Mathematica presents to a data scientist is the ability to keep the workflow within one worksheet. In practice, many data scientists tend to do their data analysis with one package, visualize their data with another, and export and present their findings using something else. Mathematica provides a complete suite of a core language, mathematical and statistical functions, a visualization platform, and versatile data import and export features inside a single worksheet. This helps the user focus on the data instead of irrelevant details. By now, I hope you are convinced that Mathematica is worth learning for your data-visualization needs. If you still do not believe me, I hope I will be able to convince you again at the end of the book, when we will be done developing several visualization prototypes, each requiring only few lines of code! Getting started with Mathematica We will need to know a few basic Mathematica notebook essentials. Assuming you already have Mathematica installed on your computer, let's open a new notebook by navigating to File|New|Notebook, and do the following experiments. Creating and selecting cells In Mathematica, a chunk of code or any number of mathematical expressions can be written within a cell. Each cell in the notebook can be evaluated to see the output immediately below it. To start a new cell, simply start typing at the position of the blinking cursor. Each cell can be selected by clicking on the respective rightmost bracket. To select multiple cells, press Ctrl + right-mouse button in Windows or Linux (or cmd + right-mouse button on a Mac) on each of the cells. The following screenshot shows several cells selected together, along with the output from each cell: Figure1.4 Selecting and evaluating cells in Mathematica We can place a new cell in between any set of cells in order to change the sequence of instruction execution. Use the mouse to place the cursor in between two cells, and start typing your commands to create a new cell. We can also cut, copy, and paste cells by selecting them and applying the usual shortcuts (for example, Ctrl + C, Ctrl + X, and Ctrl + V in Windows/Linux, or cmd + C, cmd + X, and cmd + V in Mac) or using the Edit menu bar. In order to delete cell(s), select the cell(s) and press the Delete key. Evaluating a cell A cell can be evaluated by pressing Shift + Enter. Multiple cells can be selected and evaluated in the same way. To evaluate the full notebook, press Ctrl + A (to select all the cells) and then press Shift + Enter. In this case, the cells will be evaluated one after the other in the sequence in which they appear in the notebook. To see examples of notebooks filled with commands, code, and mathematical expressions, you can open the notebooks supplied with this article, which are the polar coordinates fitting and Anscombe's quartet examples, and select each cell (or all of them) and evaluate them. If we evaluate a cell that uses variables declared in a previous cell, and the previous cell was not already evaluated, then we may get errors. It is possible that Mathematica will treat the unevaluated variables as a symbolic expression; in that case, no error will be displayed, but the results will not be numeric anymore. Suppressing output from a cell If we don't wish to see the intermediate output as we load data or assign values to variables, we can add a semicolon (;) at the end of each line that we want to leave out from the output. Cell formatting Mathematica input cells treat everything inside them as mathematical and/or symbolic expressions. By default, every new cell you create by typing at the horizontal cursor will be an input expression cell. However, you can convert the cell to other formats for convenient typesetting. In order to change the format of cell(s), select the cell(s) and navigate to Format|Style from the menu bar, and choose a text format style from the available options. You can add mathematical symbols to your text by selecting Palettes|Basic Math Assistant. Note that evaluating a text cell will have no effect/output. Commenting We can write any comment in a text cell as it will be ignored during the evaluation of our code. However, if we would like to write a comment inside an input cell, we use the (* operator to open a comment and the *) operator to close it, as shown in the following code snippet: (* This is a comment *) The shortcut Ctrl + / (cmd + / in Mac) is used to comment/uncomment a chunk of code too. This operation is also available in the menu bar. Downloading the example code You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. Aborting evaluation We can abort the currently running evaluation of a cell by navigating to Evaluation|Abort Evaluation in the menu bar, or simply by pressing Alt + . (period). This is useful when you want to end a time-consuming process that you suddenly realize will not give you the correct results at the end of the evaluation, or end a process that might use up the available memory and shut down the Mathematica kernel. Further reading The history of visualization deserves a separate book, as it is really fascinating how the field has matured over the centuries, and it is still growing very strongly. Michael Friendly, from York University, published a historical development paper that is freely available online, titled Milestones in History of Data Visualization: A Case Study in Statistical Historiography. This is an entertaining compilation of the history of visualization methods. The book The Visual Display of Quantitative Information by Edward R. Tufte published by Graphics Press USA, is an excellent resource and a must-read for every data visualization practitioner. This is a classic book on the theory and practice of data graphics and visualization. Since we will not have the space to discuss the theory of visualization, the interested reader can consider reading this book for deeper insights. Summary In this article, we discussed the importance of data visualization in different contexts. We also introduced the types of dataset that will be visualized over the course of this book. The flexibility and power of Mathematica as a visualization package was discussed, and we will see the demonstration of these properties throughout the book with beautiful visualizations. Finally, we have taken the first step to writing code in Mathematica. Resources for Article: Further resources on this subject: Driving Visual Analyses with Automobile Data (Python) [article] Importing Dynamic Data [article] Interacting with Data for Dashboards [article]
Read more
  • 0
  • 0
  • 9283

article-image-driving-visual-analyses-automobile-data-python
Packt
19 Sep 2014
19 min read
Save for later

Driving Visual Analyses with Automobile Data (Python)

Packt
19 Sep 2014
19 min read
This article written by Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, and Abhijit Dasgupta, authors of the book Practical Data Science Cookbook, will cover the following topics: Getting started with IPython Exploring IPython Notebook Preparing to analyze automobile fuel efficiencies Exploring and describing the fuel efficiency data with Python (For more resources related to this topic, see here.) The dataset, available at http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip, contains fuel efficiency performance metrics over time for all makes and models of automobiles in the United States of America. This dataset also contains numerous other features and attributes of the automobile models other than fuel economy, providing an opportunity to summarize and group the data so that we can identify interesting trends and relationships. We will perform the entire analysis using Python. However, we will ask the same questions and follow the same sequence of steps as before, again following the data science pipeline. With study, this will allow you to see the similarities and differences between the two languages for a mostly identical analysis. In this article, we will take a very different approach using Python as a scripting language in an interactive fashion that is more similar to R. We will introduce the reader to the unofficial interactive environment of Python, IPython, and the IPython notebook, showing how to produce readable and well-documented analysis scripts. Further, we will leverage the data analysis capabilities of the relatively new but powerful pandas library and the invaluable data frame data type that it offers. pandas often allows us to complete complex tasks with fewer lines of code. The drawback to this approach is that while you don't have to reinvent the wheel for common data manipulation tasks, you do have to learn the API of a completely different package, which is pandas. The goal of this article is not to guide you through an analysis project that you have already completed but to show you how that project can be completed in another language. More importantly, we want to get you, the reader, to become more introspective with your own code and analysis. Think not only about how something is done but why something is done that way in that particular language. How does the language shape the analysis? Getting started with IPython IPython is the interactive computing shell for Python that will change the way you think about interactive shells. It brings to the table a host of very useful functionalities that will most likely become part of your default toolbox, including magic functions, tab completion, easy access to command-line tools, and much more. We will only scratch the surface here and strongly recommend that you keep exploring what can be done with IPython. Getting ready If you have completed the installation, you should be ready to tackle the following recipes. Note that IPython 2.0, which is a major release, was launched in 2014. How to do it… The following steps will get you up and running with the IPython environment: Open up a terminal window on your computer and type ipython. You should be immediately presented with the following text: Python 2.7.5 (default, Mar 9 2014, 22:15:05)Type "copyright", "credits" or "license" for more information. IPython 2.1.0 -- An enhanced Interactive Python.?         -> Introduction and overview of IPython's features.%quickref -> Quick reference.help     -> Python's own help system.object?   -> Details about 'object', use 'object??' for extra details.In [1]: Note that your version might be slightly different than what is shown in the preceding command-line output. Just to show you how great IPython is, type in ls, and you should be greeted with the directory listing! Yes, you have access to common Unix commands straight from your Python prompt inside the Python interpreter. Now, let's try changing directories. Type cd at the prompt, hit space, and now hit Tab. You should be presented with a list of directories available from within the current directory. Start typing the first few letters of the target directory, and then, hit Tab again. If there is only one option that matches, hitting the Tab key automatically will insert that name. Otherwise, the list of possibilities will show only those names that match the letters that you have already typed. Each letter that is entered acts as a filter when you press Tab. Now, type ?, and you will get a quick introduction to and overview of IPython's features. Let's take a look at the magic functions. These are special functions that IPython understands and will always start with the % symbol. The %paste function is one such example and is amazing for copying and pasting Python code into IPython without losing proper indentation. We will try the %timeit magic function that intelligently benchmarks Python code. Enter the following commands: n = 100000%timeit range(n)%timeit xrange(n) We should get an output like this: 1000 loops, best of 3: 1.22 ms per loop1000000 loops, best of 3: 258 ns per loop This shows you how much faster xrange is than range (1.22 milliseconds versus 2.58 nanoseconds!) and helps show you the utility of generators in Python. You can also easily run system commands by prefacing the command with an exclamation mark. Try the following command: !ping www.google.com You should see the following output: PING google.com (74.125.22.101): 56 data bytes64 bytes from 74.125.22.101: icmp_seq=0 ttl=38 time=40.733 ms64 bytes from 74.125.22.101: icmp_seq=1 ttl=38 time=40.183 ms64 bytes from 74.125.22.101: icmp_seq=2 ttl=38 time=37.635 ms Finally, IPython provides an excellent command history. Simply press the up arrow key to access the previously entered command. Continue to press the up arrow key to walk backwards through the command list of your session and the down arrow key to come forward. Also, the magic %history command allows you to jump to a particular command number in the session. Type the following command to see the first command that you entered: %history 1 Now, type exit to drop out of IPython and back to your system command prompt. How it works… There isn't much to explain here and we have just scratched the surface of what IPython can do. Hopefully, we have gotten you interested in diving deeper, especially with the wealth of new features offered by IPython 2.0, including dynamic and user-controllable data visualizations. See also IPython at http://ipython.org/ The IPython Cookbook at https://github.com/ipython/ipython/wiki?path=Cookbook IPython: A System for Interactive Scientific Computing at http://fperez.org/papers/ipython07_pe-gr_cise.pdf Learning IPython for Interactive Computing and Data Visualization, Cyrille Rossant, Packt Publishing, available at http://www.packtpub.com/learning-ipython-for-interactive-computing-and-data-visualization/book The future of IPython at http://www.infoworld.com/print/236429 Exploring IPython Notebook IPython Notebook is the perfect complement to IPython. As per the IPython website: "The IPython Notebook is a web-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document." While this is a bit of a mouthful, it is actually a pretty accurate description. In practice, IPython Notebook allows you to intersperse your code with comments and images and anything else that might be useful. You can use IPython Notebooks for everything from presentations (a great replacement for PowerPoint) to an electronic laboratory notebook or a textbook. Getting ready If you have completed the installation, you should be ready to tackle the following recipes. How to do it… These steps will get you started with exploring the incredibly powerful IPython Notebook environment. We urge you to go beyond this simple set of steps to understand the true power of the tool. Type ipython notebook --pylab=inline in the command prompt. The --pylab=inline option should allow your plots to appear inline in your notebook. You should see some text quickly scroll by in the terminal window, and then, the following screen should load in the default browser (for me, this is Chrome). Note that the URL should be http://127.0.0.1:8888/, indicating that the browser is connected to a server running on the local machine at port 8888. You should not see any notebooks listed in the browser (note that IPython Notebook files have a .ipynb extension) as IPython Notebook searches the directory you launched it from for notebook files. Let's create a notebook now. Click on the New Notebook button in the upper right-hand side of the page. A new browser tab or window should open up, showing you something similar to the following screenshot: From the top down, you can see the text-based menu followed by the toolbar for issuing common commands, and then, your very first cell, which should resemble the command prompt in IPython. Place the mouse cursor in the first cell and type 5+5. Next, either navigate to Cell | Run or press Shift + Enter as a keyboard shortcut to cause the contents of the cell to be interpreted. You should now see something similar to the following screenshot. Basically, we just executed a simple Python statement within the first cell of our first IPython Notebook. Click on the second cell, and then, navigate to Cell | Cell Type | Markdown. Now, you can easily write markdown in the cell for documentation purposes. Close the two browser windows or tabs (the notebook and the notebook browser). Go back to the terminal in which you typed ipython notebook, hit Ctrl + C, then hit Y, and press Enter. This will shut down the IPython Notebook server. How it works… For those of you coming from either more traditional statistical software packages, such as Stata, SPSS, or SAS, or more traditional mathematical software packages, such as MATLAB, Mathematica, or Maple, you are probably used to the very graphical and feature-rich interactive environments provided by the respective companies. From this background, IPython Notebook might seem a bit foreign but hopefully much more user friendly and less intimidating than the traditional Python prompt. Further, IPython Notebook offers an interesting combination of interactivity and sequential workflow that is particularly well suited for data analysis, especially during the prototyping phases. R has a library called Knitr (http://yihui.name/knitr/) that offers the report-generating capabilities of IPython Notebook. When you type in ipython notebook, you are launching a server running on your local machine, and IPython Notebook itself is really a web application that uses a server-client architecture. The IPython Notebook server, as per ipython.org, uses a two-process kernel architecture with ZeroMQ (http://zeromq.org/) and Tornado. ZeroMQ is an intelligent socket library for high-performance messaging, helping IPython manage distributed compute clusters among other tasks. Tornado is a Python web framework and asynchronous networking module that serves IPython Notebook's HTTP requests. The project is open source and you can contribute to the source code if you are so inclined. IPython Notebook also allows you to export your notebooks, which are actually just text files filled with JSON, to a large number of alternative formats using the command-line tool called nbconvert (http://ipython.org/ipython-doc/rel-1.0.0/interactive/nbconvert.html). Available export formats include HTML, LaTex, reveal.js HTML slideshows, Markdown, simple Python scripts, and reStructuredText for the Sphinx documentation. Finally, there is IPython Notebook Viewer (nbviewer), which is a free web service where you can both post and go through static, HTML versions of notebook files hosted on remote servers (these servers are currently donated by Rackspace). Thus, if you create an amazing .ipynb file that you want to share, you can upload it to http://nbviewer.ipython.org/ and let the world see your efforts. There's more… We will try not to sing too loudly the praises of Markdown, but if you are unfamiliar with the tool, we strongly suggest that you try it out. Markdown is actually two different things: a syntax for formatting plain text in a way that can be easily converted to a structured document and a software tool that converts said text into HTML and other languages. Basically, Markdown enables the author to use any desired simple text editor (VI, VIM, Emacs, Sublime editor, TextWrangler, Crimson Editor, or Notepad) that can capture plain text yet still describe relatively complex structures such as different levels of headers, ordered and unordered lists, and block quotes as well as some formatting such as bold and italics. Markdown basically offers a very human-readable version of HTML that is similar to JSON and offers a very human-readable data format. See also IPython Notebook at http://ipython.org/notebook.html The IPython Notebook documentation at http://ipython.org/ipython-doc/stable/interactive/notebook.html An interesting IPython Notebook collection at https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks The IPython Notebook development retrospective at http://blog.fperez.org/2012/01/ipython-notebook-historical.html Setting up a remote IPython Notebook server at http://nbviewer.ipython.org/github/Unidata/tds-python-workshop/blob/master/ipython-notebook-server.ipynb The Markdown home page at https://daringfireball.net/projects/markdown/basics Preparing to analyze automobile fuel efficiencies In this recipe, we are going to start our Python-based analysis of the automobile fuel efficiencies data. Getting ready If you completed the first installation successfully, you should be ready to get started. How to do it… The following steps will see you through setting up your working directory and IPython for the analysis for this article: Create a project directory called fuel_efficiency_python. Download the automobile fuel efficiency dataset from http://fueleconomy.gov/feg/epadata/vehicles.csv.zip and store it in the preceding directory. Extract the vehicles.csv file from the zip file into the same directory. Open a terminal window and change the current directory (cd) to the fuel_efficiency_python directory. At the terminal, type the following command: ipython notebook Once the new page has loaded in your web browser, click on New Notebook. Click on the current name of the notebook, which is untitled0, and enter in a new name for this analysis (mine is fuel_efficiency_python). Let's use the top-most cell for import statements. Type in the following commands: import pandas as pdimport numpy as npfrom ggplot import *%matplotlib inline Then, hit Shift + Enter to execute the cell. This imports both the pandas and numpy libraries, assigning them local names to save a few characters while typing commands. It also imports the ggplot library. Please note that using the from ggplot import * command line is not a best practice in Python and pours the ggplot package contents into our default namespace. However, we are doing this so that our ggplot syntax most closely resembles the R ggplot2 syntax, which is strongly not Pythonic. Finally, we use a magic command to tell IPython Notebook that we want matploblib graphs to render in the notebook. In the next cell, let's import the data and look at the first few records: vehicles = pd.read_csv("vehicles.csv")vehicles.head Then, press Shift + Enter. The following text should be shown: However, notice that a red warning message appears as follows: /Library/Python/2.7/site-packages/pandas/io/parsers.py:1070: DtypeWarning: Columns (22,23,70,71,72,73) have mixed types. Specify dtype option on import or set low_memory=False.   data = self._reader.read(nrows) This tells us that columns 22, 23, 70, 71, 72, and 73 contain mixed data types. Let's find the corresponding names using the following commands: column_names = vehicles.columns.values column_names[[22, 23, 70, 71, 72, 73]]   array([cylinders, displ, fuelType2, rangeA, evMotor, mfrCode], dtype=object) Mixed data types sounds like it could be problematic so make a mental note of these column names. Remember, data cleaning and wrangling often consume 90 percent of project time. How it works… With this recipe, we are simply setting up our working directory and creating a new IPython Notebook that we will use for the analysis. We have imported the pandas library and very quickly read the vehicles.csv data file directly into a data frame. Speaking from experience, pandas' robust data import capabilities will save you a lot of time. Although we imported data directly from a comma-separated value file into a data frame, pandas is capable of handling many other formats, including Excel, HDF, SQL, JSON, Stata, and even the clipboard using the reader functions. We can also write out the data from data frames in just as many formats using writer functions accessed from the data frame object. Using the bound method head that is part of the Data Frame class in pandas, we have received a very informative summary of the data frame, including a per-column count of non-null values and a count of the various data types across the columns. There's more… The data frame is an incredibly powerful concept and data structure. Thinking in data frames is critical for many data analyses yet also very different from thinking in array or matrix operations (say, if you are coming from MATLAB or C as your primary development languages). With the data frame, each column represents a different variable or characteristic and can be a different data type, such as floats, integers, or strings. Each row of the data frame is a separate observation or instance with its own set of values. For example, if each row represents a person, the columns could be age (an integer) and gender (a category or string). Often, we will want to select the set of observations (rows) that match a particular characteristic (say, all males) and examine this subgroup. The data frame is conceptually very similar to a table in a relational database. See also Data structures in pandas at http://pandas.pydata.org/pandas-docs/stable/dsintro.html Data frames in R at http://www.r-tutor.com/r-introduction/data-frame Exploring and describing the fuel efficiency data with Python Now that we have imported the automobile fuel efficiency dataset into IPython and witnessed the power of pandas, the next step is to replicate the preliminary analysis performed in R, getting your feet wet with some basic pandas functionality. Getting ready We will continue to grow and develop the IPython Notebook that we started in the previous recipe. If you've completed the previous recipe, you should have everything you need to continue. How to do it… First, let's find out how many observations (rows) are in our data using the following command: len(vehicles) 34287 If you switch back and forth between R and Python, remember that in R, the function is length and in Python, it is len. Next, let's find out how many variables (columns) are in our data using the following command: len(vehicles.columns) 74 Let's get a list of the names of the columns using the following command: print(vehicles.columns) Index([u'barrels08', u'barrelsA08', u'charge120', u'charge240', u'city08', u'city08U', u'cityA08', u'cityA08U', u'cityCD', u'cityE', u'cityUF', u'co2', u'co2A', u'co2TailpipeAGpm', u'co2TailpipeGpm', u'comb08', u'comb08U', u'combA08', u'combA08U', u'combE', u'combinedCD', u'combinedUF', u'cylinders', u'displ', u'drive', u'engId', u'eng_dscr', u'feScore', u'fuelCost08', u'fuelCostA08', u'fuelType', u'fuelType1', u'ghgScore', u'ghgScoreA', u'highway08', u'highway08U', u'highwayA08', u'highwayA08U', u'highwayCD', u'highwayE', u'highwayUF', u'hlv', u'hpv', u'id', u'lv2', u'lv4', u'make', u'model', u'mpgData', u'phevBlended', u'pv2', u'pv4', u'range', u'rangeCity', u'rangeCityA', u'rangeHwy', u'rangeHwyA', u'trany', u'UCity', u'UCityA', u'UHighway', u'UHighwayA', u'VClass', u'year', u'youSaveSpend', u'guzzler', u'trans_dscr', u'tCharger', u'sCharger', u'atvType', u'fuelType2', u'rangeA', u'evMotor', u'mfrCode'], dtype=object) The u letter in front of each string indicates that the strings are represented in Unicode (http://docs.python.org/2/howto/unicode.html) Let's find out how many unique years of data are included in this dataset and what the first and last years are using the following command: len(pd.unique(vehicles.year)) 31 min(vehicles.year) 1984 max(vehicles["year"]) 2014 Note that again, we have used two different syntaxes to reference individual columns within the vehicles data frame. Next, let's find out what types of fuel are used as the automobiles' primary fuel types. In R, we have the table function that will return a count of the occurrences of a variable's various values. In pandas, we use the following: pd.value_counts(vehicles.fuelType1)Regular Gasoline     24587Premium Gasoline     8521Diesel                    1025Natural Gas               57Electricity                 56Midgrade Gasoline      41dtype: int64 Now if we want to explore what types of transmissions these automobiles have, we immediately try the following command: pd.value_counts(vehicles.trany) However, this results in a bit of unexpected and lengthy output: What we really want to know is the number of cars with automatic and manual transmissions. We notice that the trany variable always starts with the letter A when it represents an automatic transmission and M for manual transmission. Thus, we create a new variable, trany2, that contains the first character of the trany variable, which is a string: vehicles["trany2"] = vehicles.trany.str[0]pd.value_counts(vehicles.trany2) The preceding command yields the answer that we wanted or twice as many automatics as manuals: A   22451M   11825dtype: int64 How it works… In this recipe, we looked at some basic functionality in Python and pandas. We have used two different syntaxes (vehicles['trany'] and vehicles.trany) to access variables within the data frame. We have also used some of the core pandas functions to explore the data, such as the incredibly useful unique and the value_counts function. There's more... In terms of the data science pipeline, we have touched on two stages in a single recipe: data cleaning and data exploration. Often, when working with smaller datasets where the time to complete a particular action is quite short and can be completed on our laptop, we will very quickly go through multiple stages of the pipeline and then loop back, depending on the results. In general, the data science pipeline is a highly iterative process. The faster we can accomplish steps, the more iterations we can fit into a fixed time, and often, we can create a better final analysis. See also The pandas API overview at http://pandas.pydata.org/pandas-docs/stable/api.html Summary This article took you through the process of analyzing and visualizing automobile data to identify trends and patterns in fuel efficiency over time using the powerful programming language, Python. Resources for Article: Further resources on this subject: Importing Dynamic Data [article] MongoDB data modeling [article] Report Data Filtering [article]
Read more
  • 0
  • 0
  • 9021
article-image-caches
Packt
18 Sep 2014
18 min read
Save for later

Caches

Packt
18 Sep 2014
18 min read
In this article, by Federico Razzoli, author of the book Mastering MariaDB, we will see that how in order to avoid accessing disks, MariaDB and storage engines have several caches that a DBA should know about. (For more resources related to this topic, see here.) InnoDB caches Since InnoDB is the recommended engine for most use cases, configuring it is very important. The InnoDB buffer pool is a cache that should speed up most read and write operations. Thus, every DBA should know how it works. The doublewrite buffer is an important mechanism that guarantees that a row is never half-written to a file. For heavy-write workloads, we may want to disable it to obtain more speed. InnoDB pages Tables, data, and indexes are organized in pages, both in the caches and in the files. A page is a package of data that contains one or two rows and usually some empty space. The ratio between the used space and the total size of pages is called the fill factor. By changing the page size, the fill factor changes inevitably. InnoDB tries to keep the pages 15/16 full. If a page's fill factor is lower than 1/2, InnoDB merges it with another page. If the rows are written sequentially, the fill factor should be about 15/16. If the rows are written randomly, the fill factor is between 1/2 and 15/16. A low fill factor represents a memory waste. With a very high fill factor, when pages are updated and their content grows, they often need to be reorganized, which negatively affects the performance. The columns with a variable length type (TEXT, BLOB, VARCHAR, or VARBIT) are written into separate data structures called overlow pages. Such columns are called off-page columns. They are better handled by the DYNAMIC row format, which can be used for most tables when backward compatibility is not a concern. A page never changes its size, and the size is the same for all pages. The page size, however, is configurable: it can be 4 KB, 8 KB, or 16 KB. The default size is 16 KB, which is appropriate for many workloads and optimizes full table scans. However, smaller sizes can improve the performance of some OLTP workloads involving many small insertions because of lower memory allocation, or storage devices with smaller blocks (old SSD devices). Another reason to change the page size is that this can greatly affect the InnoDB compression. The page size can be changed by setting the innodb_page_size variable in the configuration file and restarting the server. The InnoDB buffer pool On servers that mainly use InnoDB tables (the most common case), the buffer pool is the most important cache to consider. Ideally, it should contain all the InnoDB data and indexes to allow MariaDB to execute queries without accessing the disks. Changes to data are written into the buffer pool first. They are flushed to the disks later to reduce the number of I/O operations. Of course, if the data does not fit the server's memory, only a subset of them can be in the buffer pool. In this case, that subset should be the so-called working set: the most frequently accessed data. The default size of the buffer pool is 128 MB and should always be changed. On production servers, this value is too low. On a developer's computer, usually, there is no need to dedicate so much memory to InnoDB. The minimum size, 5 MB, is usually more than enough when developing a simple application. Old and new pages We can think of the buffer pool as a list of data pages that are sorted with a variation of the classic Last Recently Used (LRU) algorithm. The list is split into two sublists: the new list contains the most used pages, and the old list contains the less used pages. The first page in each sublist is called the head. The head of the old list is called the midpoint. When a page is accessed that is not in the buffer pool, it is inserted into the midpoint. The other pages in the old list shift by one position, and the last one is evicted. When a page from the old list is accessed, it is moved from the old list to the head of the new list. When a page in the new list is accessed, it goes to the head of the list. The following variables affect the previously described algorithm: innodb_old_blocks_pct: This variable defines the percentage of the buffer pool reserved to the old list. The allowed range is 5 to 95, and it is 37 (3/5) by default. innodb_old_blocks_time: If this value is not 0, it represents the minimum age (in milliseconds) the old pages must reach before they can be moved into the new list. If an old page is accessed that did not reach this age, it goes to the head of the old list. innodb_max_dirty_pages_pct: This variable defines the maximum percentage of pages that were modified in-memory. This mechanism will be discussed in the Dirty pages section later in this article. This value is not a hard limit, but InnoDB tries not to exceed it. The allowed range is 0 to 100, and the default is 75. Increasing this value can reduce the rate of writes, but the shutdown will take longer (because dirty pages need to be written onto the disk before the server can be stopped in a clean way). innodb_flush_neighbors: If set to 1, when a dirty page is flushed from memory to a disk, even the contiguous pages are flushed. If set to 2, all dirty pages from the same extent (the portion of memory whose size is 1 MB) are flushed. With 0, only dirty pages are flushed when their number exceeds innodb_max_dirty_pages_pct or when they are evicted from the buffer pool. The default is 1. This optimization is only useful for spinning disks. Write-incentive workloads may need an aggressive flushing strategy; however, if the pages are written too often, they degrade the performance. Buffer pool instances On MariaDB versions older than 5.5, InnoDB creates only one instance of the buffer pool. However, concurrent threads are blocked by a mutex, and this may become a bottleneck. This is particularly true if the concurrency level is high and the buffer pool is very big. Splitting the buffer pool into multiple instances can solve the problem. Multiple instances represent an advantage only if the buffer pool size is at least 2 GB. Each instance should be of size 1 GB. InnoDB will ignore the configuration and will maintain only one instance if the buffer pool size is less than 1 GB. Furthermore, this feature is more useful on 64-bit systems. The following variables control the instances and their size: innodb_buffer_pool_size: This variable defines the total size of the buffer pool (no single instances). Note that the real size will be about 10 percent bigger than this value. A percentage of this amount of memory is dedicated to the change buffer. innodb_buffer_pool_instances: This variable defines the number of instances. If the value is -1, InnoDB will automatically decide the number of instances. The maximum value is 64. The default value is 8 on Unix and depends on the innodb_buffer_pool_size variable on Windows. Dirty pages When a user executes a statement that modifies data in the buffer pool, InnoDB initially modifies the data that is only in memory. The pages that are only modified in the buffer pool are called dirty pages. Pages that have not been modified or whose changes have been written on the disk are called as clean pages. Note that changes to data are also written to the redo log. If a crash occurs before those changes are applied to data files, InnoDB is usually able to recover the data, including the last modifications, by reading the redo log and the doublewrite buffer. The doublewrite buffer will be discussed later, in the Explaining the doublewrite buffer section. At some point, the data needs to be flushed to the InnoDB data files (the .ibd files). In MariaDB 10.0, this is done by a dedicated thread called the page cleaner. In older versions, this was done by the master thread, which executes several InnoDB maintenance operations. The flushing is not only concerned with the buffer pool, but also with the InnoDB redo and undo log. The list of dirty pages is frequently updated when transactions write data at the physical level. It has its own mutex that does not lock the whole buffer pool. The maximum number of dirty pages is determined by innodb_max_dirty_pages_pct as a percentage. When this maximum limit is reached, dirty pages are flushed. The innodb_flush_neighbor_pages value determines how InnoDB selects the pages to flush. If it is set to none, only selected pages are written. If it is set to area, even the neighboring dirty pages are written. If it is set to cont, all contiguous blocks of the dirty pages are flushed. On shutdown, a complete page flushing is only done if innodb_fast_shutdown is 0. Normally, this method should be preferred, because it leaves data in a consistent state. However, if many changes have been requested but still not written to disk, this process could be very slow. It is possible to speed up the shutdown by specifying a higher value for innodb_fast_shutdown. In this case, a crash recovery will be performed on the next restart. The read ahead optimization The read ahead feature is designed to reduce the number of read operations from the disks. It tries to guess which data will be needed in the near future and reads it with one operation. Two algorithms are available to choose the pages to read in advance: linear read ahead random read ahead The linear read ahead is used by default. It counts the pages in the buffer pool that are read sequentially. If their number is greater than or equal to innodb_read_ahead_threshold, InnoDB will read all data from the same extent (a portion of data whose size is always 1 MB). The innodb_read_ahead_threshold value must be a number from 0 to 64. The value 0 disables the linear read ahead but does not enable the random read ahead. The default value is 56. The random read ahead is only used if the innodb_random_read_ahead server variable is set to ON. By default, it is set to OFF. This algorithm checks whether at least 13 pages in the buffer pool have been read to the same extent. In this case, it does not matter whether they were read sequentially. With this variable enabled, the full extent will be read. The 13-page threshold is not configurable. If innodb_read_ahead_threshold is set to 0 and innodb_random_read_ahead is set to OFF, the read ahead optimization is completely turned off. Diagnosing the buffer pool performance MariaDB provides some tools to monitor the activities of the buffer pool and the InnoDB main thread. By inspecting these activities, a DBA can tune the relevant server variables to improve the performance. In this section, we will discuss the SHOW ENGINE INNODB STATUS SQL statement and the INNODB_BUFFER_POOL_STATS table in the information_schema database. While the latter provides more information about the buffer pool, the SHOW ENGINE INNODB STATUS output is easier to read. The INNODB_BUFFER_POOL_STATS table contains the following columns: Column name Description POOL_ID Each InnoDB buffer pool instance has a different ID. POOL_SIZE Size (in pages) of the instance. FREE_BUFFERS Number of free pages. DATABASE_PAGES Total number of data pages. OLD_DATABASE_PAGES Pages in the old list. MODIFIED_DATABASE_PAGES Dirty pages. PENDING_DECOMPRESS Number of pages that need to be decompressed. PENDING_READS Pending read operations. PENDING_FLUSH_LRU Pages in the old or new lists that need to be flushed. PENDING_FLUSH_LIST Pages in the flush list that need to flushed. PAGES_MADE_YOUNG Number of pages moved into the new list. PAGES_NOT_MADE_YOUNG Old pages that did not become young. PAGES_MADE_YOUNG_RATE Pages made young per second. This value is reset each time it is shown. PAGES_MADE_NOT_YOUNG_RATE Pages read but not made young (this happens because they do not reach the minimum age) per second. This value is reset each time it is shown. NUMBER_PAGES_READ Number of pages read from disk. NUMBER_PAGES_CREATED Number of pages created in the buffer pool. NUMBER_PAGES_WRITTEN Number of pages written to disk. PAGES_READ_RATE Pages read from disk per second. PAGES_CREATE_RATE Pages created in the buffer pool per second. PAGES_WRITTEN_RATE Pages written to disk per second. NUMBER_PAGES_GET Requests of pages that are not in the buffer pool. HIT_RATE Rate of page hits. YOUNG_MAKE_PER_THOUSAND_GETS Pages made young per thousand physical reads. NOT_YOUNG_MAKE_PER_THOUSAND_GETS Pages that remain in the old list per thousand reads. NUMBER_PAGES_READ_AHEAD Number of pages read with a read ahead operation. NUMBER_READ_AHEAD_EVICTED The number of pages read with a read ahead operation that were never used and then were evicted. READ_AHEAD_RATE Similar to NUMBER_PAGES_READ_AHEAD, but this is a per second rate. READ_AHEAD_EVICTED_RATE Similar to NUMBER_READ_AHEAD_EVICTED, but this is a per-second rate. LRU_IO_TOTAL Total number of pages read or written to disk. LRU_IO_CURRENT Pages read or written to disk within the last second. UNCOMPRESS_TOTAL Pages that have been uncompressed. UNCOMPRESS_CURRENT Pages that have been uncompressed within the last second. The per-second values are reset after they are shown. The PAGES_MADE_YOUNG_RATE and PAGES_NOT_MADE_YOUNG_RATE values show us, respectively, how often old pages become new and how much old pages are never accessed in a reasonable amount of time. If the former value is too high, the old list is probably not big enough and vice versa. Comparing READ_AHEAD_RATE and READ_AHEAD_EVICTED_RATE is useful to tune the read ahead feature. The READ_AHEAD_EVICTED_RATE value should be low, because it indicates which pages read with the read ahead operations were not useful. If their ratio is good but READ_AHEAD_RATE is low, probably the read ahead should be used more often. In this case, if the linear read ahead is used, we can try to increase or decrease innodb_read_ahead_threshold. Or, we can change the used algorithm (linear or random read ahead). The columns whose names end with _RATE better describe the current server activities. They should be examined several times a day, and during the whole week or month, perhaps with the help of one of more monitoring tools. Good, free software monitoring tools include Cacti and Nagios. The Percona Monitoring Tools package includes MariaDB (and MySQL) plugins that provide an interface to these tools. Dumping and loading the buffer pool In some cases, one may want to save the current contents of the buffer pool and reload it later. The most common case is when the server is stopped. Normally, on startup, the buffer pool is empty, and InnoDB needs to fill it with useful data. This process is called warm-up. Until the warm-up is complete, the InnoDB performance is lower than usual. Two variables help avoid the warm-up phase: innodb_buffer_pool_dump_at_shutdown and innodb_buffer_pool_load_at_startup. If their value is ON, InnoDB automatically saves the buffer pool into a file at shut down and restores it at startup. Their default value is OFF. Turning them ON can be very useful, but remember the caveats: The startup and shutdown time might be longer. In some cases, we might prefer MariaDB to start more quickly even if it is slower during warm-up. We need the disk space necessary to store the buffer pool. The user may also want to dump the buffer pool at any moment and restore it without restarting the server. This is advisable when the buffer pool is optimal and some statements are going to heavily change its contents. A common example is when a big InnoDB table is fully scanned. This happens, for example, during logical backups. A full table scan will fill the old list with non-frequently accessed data. A good way to solve the problem is to dump the buffer pool before the table scan and reload it later. This operation can be performed by setting two special variables: innodb_buffer_pool_dump_now and innodb_buffer_pool_load_now. Reading the values of these variables always returns OFF. Setting the first variable to ON forces InnoDB to immediately dump the buffer pool into a file. Setting the latter variable to ON forces InnoDB to load the buffer pool from that file. In both cases, the progress of the dump or load operation is indicated by the Innodb_buffer_pool_dump_status and Innodb_buffer_pool_load_status status variables. If loading the buffer pool takes too long, it is possible to stop it by setting innodb_buffer_pool_load_abort to ON. The name and path of the dump file is specified in the innodb_buffer_pool_filename server variable. Of course, we should be sure that the chosen directory can contain the file, but it is much smaller than the memory used by the buffer pool. InnoDB change buffer The change buffer is a cache that is a part of the buffer pool. It contains dirty pages related to secondary indexes (not primary keys) that are not stored in the main part of the buffer pool. If the modified data is read later, it will be merged into the buffer pool. In older versions, this buffer was called the insert buffer, but now it is renamed, because it can handle deletions. The change buffer speeds up the following write operations: insertions: When new rows are written. deletions: When existing row are marked for deletion but not yet physically erased for performance reasons. purges: The physical elimination of previously marked rows and obsolete index values. This is periodically done by a dedicated thread. In some cases, we may want to disable the change buffer. For example, we may have a working set that only fits the memory if the change buffer is discarded. In this case, even after disabling it, we will still have all the frequently accessed secondary indexes in the buffer pool. Also, DML statements may be rare for our database, or we may have just a few secondary indexes: in these cases, the change buffer does not help. The change buffer can be configured using the following variables: innodb_change_buffer_max_size: This is the maximum size of the change buffer, expressed as a percentage of the buffer pool. The allowed range is 0 to 50, and the default value is 25. innodb_change_buffering: This determines which types of operations are cached by the change buffer. The allowed values are none (to disable the buffer), all, inserts, deletes, purges, and changes (to cache inserts and deletes, but not purges). The all value is the default value. Explaining the doublewrite buffer When InnoDB writes a page to disk, at least two events can interrupt the operation after it is started: a hardware failure or an OS failure. In the case of an OS failure, this should not be possible if the pages are not bigger than the blocks written by the system. In this case, the InnoDB redo and undo logs are not sufficient to recover the half-written page, because they only contain pages ID's, not their data. This improves the performance. To avoid half-written pages, InnoDB uses the doublewrite buffer. This mechanism involves writing every page twice. A page is valid after the second write is complete. When the server restarts, if a recovery occurs, half-written pages are discarded. The doublewrite buffer has a small impact on performance, because the writes are sequential, and are flushed to disk together. However, it is still possible to disable the doublewrite buffer by setting the innodb_doublewrite variable to OFF in the configuration file or by starting the server with the --skip-innodb-doublewrite parameter. This can be done if data correctness is not important. If performance is very important, and we use a fast storage device, we may note the overhead caused by the additional disk writes. But if data correctness is important to us, we do not want to simply disable it. MariaDB provides an alternative mechanism called atomic writes. These writes are like a transaction: they completely succeed or they completely fail. Half-written data is not possible. However, MariaDB does not directly implement this mechanism, so it can only be used on FusionIO storage devices using the DirectFS filesystem. FusionIO flash memories are very fast flash memories that can be used as block storage or DRAM memory. To enable this alternative mechanism, we can set innodb_use_atomic_writes to ON. This automatically disables the doublewrite buffer. Summary In this article, we discussed the main MariaDB buffers. The most important ones are the caches used by the storage engine. We dedicated much space to the InnoDB buffer pool, because it is more complex and, usually, InnoDB is the most used storage engine. Resources for Article:  Further resources on this subject: Building a Web Application with PHP and MariaDB – Introduction to caching [article] Installing MariaDB on Windows and Mac OS X [article] Using SHOW EXPLAIN with running queries [article]
Read more
  • 0
  • 0
  • 2186

article-image-using-r-statistics-research-and-graphics
Packt
16 Sep 2014
12 min read
Save for later

Using R for Statistics, Research, and Graphics

Packt
16 Sep 2014
12 min read
In this article by David Alexander Lillis, author of the R Graph Essentials, we will talk about R. Developed by Professor Ross Ihaka and Dr. Robert Gentleman at Auckland University (New Zealand) during the early 1990s, the R statistics environment is a real success story. R is open source software, which you can download in a couple of minutes from the Comprehensive R Network (CRAN) website (http://cran.r-project.org/), and combines a powerful programming language, outstanding graphics, and a comprehensive range of useful statistical functions. If you need a statistics environment that includes a programming language, R is ideal. It's true that the learning curve is longer than for spreadsheet-based packages, but once you master the R programming syntax, you can develop your own very powerful analytic tools. Many contributed packages are available on the web for use with R, and very often the analytic tools you need can be downloaded at no cost. (For more resources related to this topic, see here.) The main problem for those new to R is the time required to master the programming language, but several nice graphical user interfaces, such as John Fox's R Commander package, are available, which make it much easier for the newcomer to develop proficiency in R than it used to be. For many statisticians and researchers, R is the package of choice because of its powerful programming language, the easy availability of code, and because it can import Excel spreadsheets, comma separated variable (.csv) spreadsheets, and text files, as well as SPSS files, STATA files, and files produced within other statistical packages. You may be looking for a tool for your own data analysis. If so, let's take a brief look at what R can do for you. Some basic R syntax Data can be created in R or else read in from .csv or other files as objects. For example, you can read in the data contained within a .csv file called mydata.csv as follows: A <- read.csv(mydata.csv, h=T) A The object A now contains all the data of the original file. The syntax A[3,7] picks out the element in row 3 and column 7. The syntax A[14, ] selects the fourteenth row and A[,6] selects the sixth column. The functions mean(A) and sd(A) find the mean and standard deviation of each column. The syntax 3*A + 7 would triple each element of A and add 7 to each element and store the new array as the object B Now you could save this array as a .csv file called Outputfile.csv as follows: write.csv(B, file="Outputfile.csv") Statistical modeling R provides a comprehensive range of basic statistical functions relating to the commonly-used distributions (normal distribution, t-distribution, Poisson, gamma, and so on), and many less-well known distributions. It also provides a range of non-parametric tests that are appropriate when your data are not distributed normally. Linear and non-linear regressions are easy to perform, and finding the optimum model (that is, by eliminating non-significant independent variables and non-significant factor interactions) is particularly easy. Implementing Generalized Linear Models and other commonly-used models such as Analysis of Variance, Multivariate Analysis of Variance, and Analysis of Covariance is also straightforward and, once you know the syntax, you may find that such tasks can be done more quickly in R than in other packages. The usual post-hoc tests for identifying factor levels that are significantly different from the other levels (for example, Tukey and Sheffe tests) are available, and testing for interactions between factors is easy. Factor Analysis, and the related Principal Components Analysis, are well known data reduction techniques that enable you to explain your data in terms of smaller sets of independent variables (or factors). Both methods are available in R, and code for complex designs, including One and Two Way Repeated Measures, and Four Way ANOVA (for example, two repeated measures and two between-subjects), can be written relatively easily or downloaded from various websites (for example, http://www.personality-project.org/r/). Other analytic tools include Cluster Analysis, Discriminant Analysis, Multidimensional Scaling, and Correspondence Analysis. R also provides various methods for fitting analytic models to data and smoothing (for example, lowess and spline-based methods). Miscellaneous packages for specialist methods You can find some very useful packages of R code for fields as diverse as biometry, epidemiology, astrophysics, econometrics, financial and actuarial modeling, the social sciences, and psychology. For example, if you are interested in Astrophysics, Penn State Astrophysics School offers a nice website that includes both tutorials and code (http://www.iiap.res.in/astrostat/RTutorials.html). Here I'll mention just a few of the popular techniques: Monte Carlo methods A number of sources give excellent accounts of how to perform Monte Carlo simulations in R (that is, drawing samples from multidimensional distributions and estimating expected values). A valuable text is Christian Robert's book Introducing Monte Carlo Methods with R. Murali Haran gives another interesting Astrophysical example in the CAStR website (http://www.stat.psu.edu/~mharan/MCMCtut/MCMC.html). Structural Equation Modeling Structural Equation Modelling (SEM) is becoming increasingly popular in the social sciences and economics as an alternative to other modeling techniques such as multiple regression, factor analysis and analysis of covariance. Essentially, SEM is a kind of multiple regression that takes account of factor interactions, nonlinearities, measurement error, multiple latent independent variables, and latent dependent variables. Useful references for conducting SEM in R include those of Revelle, Farnsworth (2008), and Fox (2002 and 2006). Data mining A number of very useful resources are available for anyone contemplating data mining using R. For example, Luis Torgo has just published a book on data mining using R, and presents case studies, along with the datasets and code, which the interested student can work through. Torgo's book provides the usual analytic and graphical techniques used every day by data miners, including visualization techniques, dealing with missing values, developing prediction models, and methods for evaluating the performance of your models. Also of interest to the data miner is the Rattle GUI (R Analytical Tool to Learn Easily). Rattle is a data mining facility for analyzing very large data sets. It provides many useful statistical and graphical data summaries, presents mechanisms for developing a variety of models, and summarizes the performance of your models. Graphics in R Quite simply, the quality and range of graphics available through R is superb and, in my view, vastly superior to those of any other package I have encountered. Of course, you have to write the necessary code, but once you have mastered this skill, you have access to wonderful graphics. You can write your own code from scratch, but many websites provide helpful examples, complete with code, which you can download and modify to suit your own needs. R's base graphics (graphics created without the use of any additional contributed packages) are superb, but various graphics packages such as ggplot2 (and the associated qplot function) help you to create wonderful graphs. R's graphics capabilities include, but are not limited to, the following: Base graphics in R Basic graphics techniques and syntax Creating scatterplots and line plots Customizing axes, colors, and symbols Adding text – legends, titles, and axis labels Adding lines – interpolation lines, regression lines, and curves Increasing complexity – graphing three variables, multiple plots, or multiple axes Saving your plots to multiple formats – PDF, postscript, and JPG Including mathematical expressions on your plots Making graphs clear and pretty – including a grid, point labels, and shading Shading and coloring your plot Creating bar charts, histograms, boxplots, pie charts, and dotcharts Adding loess smoothers Scatterplot matrices R's color palettes Adding error bars Creating graphs using qplot Using basic qplot graphics techniques and syntax to customize in easy steps Creating scatterplots and line plots in qplot Mapping symbol size, symbol type and symbol color to categorical data Including regressions and confidence intervals on your graphs Shading and coloring your graph Creating bar charts, histograms, boxplots, pie charts, and dotcharts Labelling points on your graph Creating graphs using ggplot Ploting options – backgrounds, sizes, transparencies, and colors Superimposing points Controlling symbol shapes and using pretty color schemes Stacked, clustered, and paneled bar charts Methods for detailed customization of lines, point labels, smoothers, confidence bands, and error bars The following graph records information on the heights in centimeters and weights in kilograms of patients in a medical study. The curve in red gives a smoothed version of the data, created using locally weighted scatterplot smoothing. Both the graph and the modelling required to produce the smoothed curve, were performed in R. Here is another graph. It gives the heights and body masses of female patients receiving treatment in a hospital. Each patient is identified by name. This graph was created very easily using ggplot, and shows the default background produced by ggplot (a grey plotting background and white grid lines). Next, we see a histogram of patients' heights and body masses, partitioned by gender. The bars are given in an orange and an ivory color. The ggplot package provides a wide range of colors and hues, as well as a wide range of color palettes. Finally, we see a line graph of height against age for a group of four children. The graph includes both points and lines and we have a unique color for each child. The ggplot package makes it possible to create attractive and effective graphs for research and data analysis. Summary For many scientists and data analysts, mastery of R could be an investment for the future, particularly for those who are beginning their careers. The technology for handling scientific computation is advancing very quickly, and is a major impetus for scientific advance. Some level of mastery of R has become, for many applications, essential for taking advantage of these developments. Spatial analysis, where R provides an integrated framework access to abilities that are spread across many different computer programs, is a good example. A few years ago, I would not have recommended R as a statistics environment for generalist data analysts or postgraduate students, except those working directly in areas involving statistical modeling. However, many tutorials are downloadable from the Internet and a number of organizations provide online tutorials and/or face-to-face workshops (for example, The Analysis Factor http://www.theanalysisfactor.com/). In addition, the appearance of GUIs, such as R Commander and the new iNZight GUI33 (designed for use in schools), makes it easier for non-specialists to learn and use R effectively. I am most happy to provide advice to anyone contemplating learning to use this outstanding statistical and research tool. References Some useful material on R are as follows: L'analyse des donn´ees. Tome 1: La taxinomie, Tome 2: L'analyse des correspondances, Dunod, Paris, Benz´ecri, J. P (1973). Computation of Correspondence Analysis, Blasius J, Greenacre, M. J (1994). In M J Greenacre, J Blasius (eds.), Correspondence Analysis in the Social Sciences, pp. 53–75, Academic Press, London. Statistics: An Introduction using R, Crawley, M. J. (m.crawley@imperial.ac.uk), Imperial College, Silwood Park, Ascot, Berks, Published in 2005 by John Wiley & Sons, Ltd. http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470022973,subjectCd-ST05.html (ISBN 0-470-02297-3). http://www3.imperial.ac.uk/naturalsciences/research/statisticsusingr. Structural Equation Models Appendix to An R and S-PLUS Companion to Applied Regression, Fox, John, http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-sems.pdf. Getting Started with the R Commander, Fox, John, 26 August 2006. The R Commander: A Basic-Statistics Graphical User Interface to R, Fox, John, Journal of Statistical Software, September 2005, Volume 14, Issue 9. http://www.jstatsoft.org/. Structural Equation Modeling With the sem Package in R, Fox, John, Structural Equation Modeling, 13(3), 465–486. Lawrence Erlbaum Associates, Inc. 2006. Biplots in Biomedical Research, Gabriel, K, R and Odoroff, C, 9, 469–485, Statistics in Medicine, 1990. Theory and Applications of Correspondence Analysis, Greenacre M. J., Academic Press, London, 1984. Using R for Data Analysis and Graphics Introduction, Code and Commentary, Maindonald, J. H, Centre for Mathematics and its Applications, Australian National University. Introducing Monte Carlo Methods with R, Series Use R, Robert, Christian P., Casella, George, 2010, XX, 284 p., Softcover, ISBN 978-1-4419-1575-7. <p>Useful tutorials available on the web are as follows:</p> An Introduction to R: examples for Actuaries, De Silva, N, 2006, http://toolkit.pbworks.com/f/R%20Examples%20for%20Actuaries%20v0.1-1.pdf. Econometrics in R, Farnsworth, Grant, V, October 26, 2008, http://cran.r-project.org/doc/contrib/Farnsworth-EconometricsInR.pdf. An Introduction to the R Language, Harte, David, Statistics Research Associates Limited, www.statsresearch.co.nz. Quick R, Kabakoff, Rob, http://www.statmethods.net/index.html. R for SAS and SPSS Users, Muenchen, Bob, http://RforSASandSPSSusers.com. Statistical Analysis with R - a quick start, Nenadi´,C and Zucchini, Walter. R for Beginners, Paradis, Emannuel (paradis@isem.univ-montp2.fr), Institut des Sciences de l' Evolution, Universite Montpellier II, F-34095 Montpellier c_edex 05, France. Data Mining with R learning by case studies, Torgo, Luis, http://www.liaad.up.pt/~ltorgo/DataMiningWithR/. SimpleR - Using R for Introductory Statistics, Verzani, John, http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf. Time Series Analysis and Its Applications: With R Examples, http://www.stat.pitt.edu/stoffer/tsa2/textRcode.htm#ch2. The irises of the Gaspé peninsula, E. Anderson, Bulletin of the American Iris Society, 59, 2-5. 1935. Introducing Monte Carlo Methods with R, Series Use R, Robert, Christian P., Casella, George. 2010, XX, 284 p., Softcover, ISBN: 978-1-4419-1575-7. Resources for Article: Further resources on this subject: Aspects of Data Manipulation in R [Article] Learning Data Analytics with R and Hadoop [Article] First steps with R [Article]
Read more
  • 0
  • 0
  • 4073
Modal Close icon
Modal Close icon