Data | Tech News, Tutorials & Expert Insights

article-image-integrating-d3js-visualization-simple-angularjs-application

27 Apr 2015

19 min read

Integrating a D3.js visualization into a simple AngularJS application

27 Apr 2015

In this article by Christoph Körner, author of the book Data Visualization with D3 and AngularJS, we will apply the acquired knowledge to integrate a D3.js visualization into a simple AngularJS application. First, we will set up an AngularJS template that serves as a boilerplate for the examples and the application. We will see a typical directory structure for an AngularJS project and initialize a controller. Similar to the previous example, the controller will generate random data that we want to display in an autoupdating chart. Next, we will wrap D3.js in a factory and create a directive for the visualization. You will learn how to isolate the components from each other. We will create a simple AngularJS directive and write a custom compile function to create and update the chart. (For more resources related to this topic, see here.) Setting up an AngularJS application To get started with this article, I assume that you feel comfortable with the main concepts of AngularJS: the application structure, controllers, directives, services, dependency injection, and scopes. I will use these concepts without introducing them in great detail, so if you do not know about one of these topics, first try an intermediate AngularJS tutorial. Organizing the directory To begin with, we will create a simple AngularJS boilerplate for the examples and the visualization application. We will use this boilerplate during the development of the sample application. Let's create a project root directory that contains the following files and folders: bower_components/: This directory contains all third-party components src/: This directory contains all source files src/app.js: This file contains source of the application src/app.css: CSS layout of the application test/: This directory contains all test files (test/config/ contains all test configurations, test/spec/ contains all unit tests, and test/e2e/ contains all integration tests) index.html: This is the starting point of the application Installing AngularJS In this article, we use the AngularJS version 1.3.14, but different patch versions (~1.3.0) should also work fine with the examples. Let's first install AngularJS with the Bower package manager. Therefore, we execute the following command in the root directory of the project: bower install angular#1.3.14 Now, AngularJS is downloaded and installed to the bower_components/ directory. If you don't want to use Bower, you can also simply download the source files from the AngularJS website and put them in a libs/ directory. Note that—if you develop large AngularJS applications—you most likely want to create a separate bower.json file and keep track of all your third-party dependencies. Bootstrapping the index file We can move on to the next step and code the index.html file that serves as a starting point for the application and all examples of this section. We need to include the JavaScript application files and the corresponding CSS layouts, the same for the chart component. Then, we need to initialize AngularJS by placing an ng-app attribute to the html tag; this will create the root scope of the application. Here, we will call the AngularJS application myApp, as shown in the following code: <html ng-app="myApp"> <head>  <script src="bower_components/d3/d3.js" charset="UTF- 8"></script> <script src="bower_components/angular/angular.js" charset="UTF-8"></script>  <script src="src/app.js"></script> <link href="src/app.css" rel="stylesheet">  <script src="src/chart.js"></script> <link href="src/chart.css" rel="stylesheet"> </head> <body>  </body> </html> For all the examples in this section, I will use the exact same setup as the preceding code. I will only change the body of the HTML page or the JavaScript or CSS sources of the application. I will indicate to which file the code belongs to with a comment for each code snippet. If you are not using Bower and previously downloaded D3.js and AngularJS in a libs/ directory, refer to this directory when including the JavaScript files. Adding a module and a controller Next, we initialize the AngularJS module in the app.js file and create a main controller for the application. The controller should create random data (that represent some simple logs) in a fixed interval. Let's generate some random number of visitors every second and store all data points on the scope as follows: /* src/app.js */ // Application Module angular.module('myApp', []) // Main application controller .controller('MainCtrl', ['$scope', '$interval', function ($scope, $interval) { var time = new Date('2014-01-01 00:00:00 +0100'); // Random data point generator var randPoint = function() { var rand = Math.random; return { time: time.toString(), visitors: rand()*100 }; } // We store a list of logs $scope.logs = [ randPoint() ]; $interval(function() { time.setSeconds(time.getSeconds() + 1); $scope.logs.push(randPoint()); }, 1000); }]); In the preceding example, we define an array of logs on the scope that we initialize with a random point. Every second, we will push a new random point to the logs. The points contain a number of visitors and a timestamp—starting with the date 2014-01-01 00:00:00 (timezone GMT+01) and counting up a second on each iteration. I want to keep it simple for now; therefore, we will use just a very basic example of random access log entries. Consider to use the cleaner controller as syntax for larger AngularJS applications because it makes the scopes in HTML templates explicit! However, for compatibility reasons, I will use the standard controller and $scope notation. Integrating D3.js into AngularJS We bootstrapped a simple AngularJS application in the previous section. Now, the goal is to integrate a D3.js component seamlessly into an AngularJS application—in an Angular way. This means that we have to design the AngularJS application and the visualization component such that the modules are fully encapsulated and reusable. In order to do so, we will use a separation on different levels: Code of different components goes into different files Code of the visualization library goes into a separate module Inside a module, we divide logics into controllers, services, and directives Using this clear separation allows you to keep files and modules organized and clean. If at anytime we want to replace the D3.js backend with a canvas pixel graphic, we can just implement it without interfering with the main application. This means that we want to use a new module of the visualization component and dependency injection. These modules enable us to have full control of the separate visualization component without touching the main application and they will make the component maintainable, reusable, and testable. Organizing the directory First, we add the new files for the visualization component to the project: src/: This is the default directory to store all the file components for the project src/chart.js: This is the JS source of the chart component src/chart.css: This is the CSS layout for the chart component test/test/config/: This directory contains all test configurations test/spec/test/spec/chart.spec.js: This file contains the unit tests of the chart component test/e2e/chart.e2e.js: This file contains the integration tests of the chart component If you develop large AngularJS applications, this is probably not the folder structure that you are aiming for. Especially in bigger applications, you will most likely want to have components in separate folders and directives and services in separate files. Then, we will encapsulate the visualization from the main application and create the new myChart module for it. This will make it possible to inject the visualization component or parts of it—for example just the chart directive—to the main application. Wrapping D3.js In this module, we will wrap D3.js—which is available via the global d3 variable—in a service; actually, we will use a factory to just return the reference to the d3 variable. This enables us to pass D3.js as a dependency inside the newly created module wherever we need it. The advantage of doing so is that the injectable d3 component—or some parts of it—can be mocked for testing easily. Let's assume we are loading data from a remote resource and do not want to wait for the time to load the resource every time we test the component. Then, the fact that we can mock and override functions without having to modify anything within the component will become very handy. Another great advantage will be defining custom localization configurations directly in the factory. This will guarantee that we have the proper localization wherever we use D3.js in the component. Moreover, in every component, we use the injected d3 variable in a private scope of a function and not in the global scope. This is absolutely necessary for clean and encapsulated components; we should never use any variables from global scope within an AngularJS component. Now, let's create a second module that stores all the visualization-specific code dependent on D3.js. Thus, we want to create an injectable factory for D3.js, as shown in the following code: /* src/chart.js */ // Chart Module angular.module('myChart', []) // D3 Factory .factory('d3', function() { /* We could declare locals or other D3.js specific configurations here. */ return d3; }); In the preceding example, we returned d3 without modifying it from the global scope. We can also define custom D3.js specific configurations here (such as locals and formatters). We can go one step further and load the complete D3.js code inside this factory so that d3 will not be available in the global scope at all. However, we don't use this approach here to keep things as simple and understandable as possible. We need to make this module or parts of it available to the main application. In AngularJS, we can do this by injecting the myChart module into the myApp application as follows: /* src/app.js */ angular.module('myApp', ['myChart']); Usually, we will just inject the directives and services of the visualization module that we want to use in the application, not the whole module. However, for the start and to access all parts of the visualization, we will leave it like this. We can use the components of the chart module now on the AngularJS application by injecting them into the controllers, services, and directives. The boilerplate—with a simple chart.js and chart.css file—is now ready. We can start to design the chart directive. A chart directive Next, we want to create a reusable and testable chart directive. The first question that comes into one's mind is where to put which functionality? Should we create a svg element as parent for the directive or a div element? Should we draw a data point as a circle in svg and use ng-repeat to replicate these points in the chart? Or should we better create and modify all data points with D3.js? I will answer all these question in the following sections. A directive for SVG As a general rule, we can say that different concepts should be encapsulated so that they can be replaced anytime by a new technology. Hence, we will use AngularJS with an element directive as a parent element for the visualization. We will bind the data and the options of the chart to the private scope of the directive. In the directive itself, we will create the complete chart including the parent svg container, the axis, and all data points using D3.js. Let's first add a simple directive for the chart component: /* src/chart.js */ … // Scatter Chart Directive .directive('myScatterChart', ["d3", function(d3){ return { restrict: 'E', scope: { }, compile: function( element, attrs, transclude ) { // Create a SVG root element var svg = d3.select(element[0]).append('svg'); // Return the link function return function(scope, element, attrs) { }; } }; }]); In the preceding example, we first inject d3 to the directive by passing it as an argument to the caller function. Then, we return a directive as an element with a private scope. Next, we define a custom compile function that returns the link function of the directive. This is important because we need to create the svg container for the visualization during the compilation of the directive. Then, during the link phase of the directive, we need to draw the visualization. Let's try to define some of these directives and look at the generated output. We define three directives in the index.html file, as shown in the following code:  <div ng-controller="MainCtrl">   <my-scatter-chart class="chart"></my-scatter-chart>  <my-scatter-chart class="chart"></my-scatter-chart>  <my-scatter-chart class="chart"></my-scatter-chart> </div> If we look at the output of the html page in the developer tools, we can see that for each base element of the directive, we created a svg parent element for the visualization: Output of the HTML page In the resulting DOM tree, we can see that three svg elements are appended to the directives. We can now start to draw the chart in these directives. Let's fill these elements with some awesome charts. Implementing a custom compile function First, let's add a data attribute to the isolated scope of the directive. This gives us access to the dataset, which we will later pass to the directive in the HTML template. Next, we extend the compile function of the directive to create a g group container for the data points and the axis. We will also add a watcher that checks for changes of the scope data array. Every time the data changes, we call a draw() function that redraws the chart of the directive. Let's get started: /* src/capp..js */ ... // Scatter Chart Directive .directive('myScatterChart', ["d3", function(d3){ // we will soon implement this function var draw = function(svg, width, height, data){ … }; return { restrict: 'E', scope: { data: '=' }, compile: function( element, attrs, transclude ) { // Create a SVG root element var svg = d3.select(element[0]).append('svg'); svg.append('g').attr('class', 'data'); svg.append('g').attr('class', 'x-axis axis'); svg.append('g').attr('class', 'y-axis axis'); // Define the dimensions for the chart var width = 600, height = 300; // Return the link function return function(scope, element, attrs) { // Watch the data attribute of the scope scope.$watch('data', function(newVal, oldVal, scope) { // Update the chart draw(svg, width, height, scope.data); }, true); }; } }; }]); Now, we implement the draw() function in the beginning of the directive. Drawing charts So far, the chart directive should look like the following code. We will now implement the draw() function, draw axis, and time series data. We start with setting the height and width for the svg element as follows: /* src/chart.js */ ... // Scatter Chart Directive .directive('myScatterChart', ["d3", function(d3){ function draw(svg, width, height, data) { svg .attr('width', width) .attr('height', height); // code continues here } return { restrict: 'E', scope: { data: '=' }, compile: function( element, attrs, transclude ) { ... } }]); Axis, scale, range, and domain We first need to create the scales for the data and then the axis for the chart. The implementation looks very similar to the scatter chart. We want to update the axis with the minimum and maximum values of the dataset; therefore, we also add this code to the draw() function: /* src/chart.js --> myScatterChart --> draw() */ function draw(svg, width, height, data) { ... // Define a margin var margin = 30; // Define x-scale var xScale = d3.time.scale() .domain([ d3.min(data, function(d) { return d.time; }), d3.max(data, function(d) { return d.time; }) ]) .range([margin, width-margin]); // Define x-axis var xAxis = d3.svg.axis() .scale(xScale) .orient('top') .tickFormat(d3.time.format('%S')); // Define y-scale var yScale = d3.time.scale() .domain([0, d3.max(data, function(d) { return d.visitors; })]) .range([margin, height-margin]); // Define y-axis var yAxis = d3.svg.axis() .scale(yScale) .orient('left') .tickFormat(d3.format('f')); // Draw x-axis svg.select('.x-axis') .attr("transform", "translate(0, " + margin + ")") .call(xAxis); // Draw y-axis svg.select('.y-axis') .attr("transform", "translate(" + margin + ")") .call(yAxis); } In the preceding code, we create a timescale for the x-axis and a linear scale for the y-axis and adapt the domain of both axes to match the maximum value of the dataset (we can also use the d3.extent() function to return min and max at the same time). Then, we define the pixel range for our chart area. Next, we create two axes objects with the previously defined scales and specify the tick format of the axis. We want to display the number of seconds that have passed on the x-axis and an integer value of the number of visitors on the y-axis. In the end, we draw the axes by calling the axis generator on the axis selection. Joining the data points Now, we will draw the data points and the axis. We finish the draw() function with this code: /* src/chart.js --> myScatterChart --> draw() */ function draw(svg, width, height, data) { ... // Add new the data points svg.select('.data') .selectAll('circle').data(data) .enter() .append('circle'); // Updated all data points svg.select('.data') .selectAll('circle').data(data) .attr('r', 2.5) .attr('cx', function(d) { return xScale(d.time); }) .attr('cy', function(d) { return yScale(d.visitors); }); } In the preceding code, we first create circle elements for the enter join for the data points where no corresponding circle is found in the Selection. Then, we update the attributes of the center point of all circle elements of the chart. Let's look at the generated output of the application: Output of the chart directive We notice that the axes and the whole chart scales as soon as new data points are added to the chart. In fact, this result looks very similar to the previous example with the main difference that we used a directive to draw this chart. This means that the data of the visualization that belongs to the application is stored and updated in the application itself, whereas the directive is completely decoupled from the data. To achieve a nice output like in the previous figure, we need to add some styles to the cart.css file, as shown in the following code: /* src/chart.css */ .axis path, .axis line { fill: none; stroke: #999; shape-rendering: crispEdges; } .tick { font: 10px sans-serif; } circle { fill: steelblue; } We need to disable the filling of the axis and enable crisp edges rendering; this will give the whole visualization a much better look. Summary In this article, you learned how to properly integrate a D3.js component into an AngularJS application—the Angular way. All files, modules, and components should be maintainable, testable, and reusable. You learned how to set up an AngularJS application and how to structure the folder structure for the visualization component. We put different responsibilities in different files and modules. Every piece that we can separate from the main application can be reused in another application; the goal is to use as much modularization as possible. As a next step, we created the visualization directive by implementing a custom compile function. This gives us access to the first compilation of the element—where we can append the svg element as a parent for the visualization—and other container elements. Resources for Article: Further resources on this subject: AngularJS Performance [article] An introduction to testing AngularJS directives [article] Our App and Tool Stack [article]

0
0
7849

article-image-how-perform-iteration-sets-mdx

Packt

05 Aug 2011

5 min read

How to Perform Iteration on Sets in MDX

Packt

05 Aug 2011

5 min read

MDX with Microsoft SQL Server 2008 R2 Analysis Services Cookbook More than 80 recipes for enriching your Business Intelligence solutions with high-performance MDX calculations and flexible MDX queries in this book and eBook Iteration is a very natural way of thinking for us humans. We set a starting point, we step into a loop, and we end when a condition is met. While we're looping, we can do whatever we want: check, take, leave, and modify items in that set. Being able to break down the problems in steps makes us feel that we have things under control. However, by breaking down the problem, the query performance often breaks down as well. Therefore, we have to be extra careful with iterations when data is concerned. If there's a way to manipulate the collection of members as one item, one set, without cutting that set into small pieces and iterating on individual members, we should use it. It's not always easy to find that way, but we should at least try. Iterating on a set in order to reduce it Getting ready Start a new query in SSMS and check that you're working on the right database. Then write the following query: SELECT { [Measures].[Customer Count], [Measures].[Growth in Customer Base] } ON 0, NON EMPTY { [Date].[Fiscal].[Month].MEMBERS } ON 1 FROM [Adventure Works] WHERE ( [Product].[Product Categories].[Subcategory].&[1] ) The query returns fiscal months on rows and two measures: a count of customers and their growth compared to the previous month. Mountain bikes are in slicer. Now let's see how we can get the number of days the growth was positive for each period. How to do it... Follow these steps to reduce the initial set: Create a new calculated measure in the query and name it Positive growth days. Specify that you need descendants of current member on leaves. Wrap around the FILTER() function and specify the condition which says that the growth measure should be greater than zero. Apply the COUNT() function on a complete expression to get count of days. The new calculated member's definition should look as follows, verify that it does. WITH MEMBER [Measures].[Positive growth days] AS FILTER( DESCENDANTS([Date].[Fiscal].CurrentMember, , leaves), [Measures].[Growth in Customer Base] > 0 ).COUNT Add the measure on columns. Run the query and observe if the results match the following image: How it works... The task says we need to count days for each time period and use only positive ones. Therefore, it might seem appropriate to perform iteration, which, in this case, can be performed using the FILTER() function. But, there's a potential problem. We cannot expect to have days on rows, so we must use the DESCENDANTS() function to get all dates in the current context. Finally, in order to get the number of items that came up upon filtering, we use the COUNT function. There's more... Filter function is an iterative function which doesn't run in block mode, hence it will slow down the query. In the introduction, we said that it's always wise to search for an alternative if available. Let's see if something can be done here. A keen eye will notice a "count of filtered items" pattern in this expression. That pattern suggests the use of a set-based approach in the form of SUM-IF combination. The trick is to provide 1 for the True part of the condition taken from the FILTER() statement and null for the False part. The sum of one will be equivalent to the count of filtered items. In other words, once rewritten, that same calculated member would look like this: MEMBER [Measures].[Positive growth days] AS SUM( Descendants([Date].[Fiscal].CurrentMember, , leaves), IIF( [Measures].[Growth in Customer Base] > 0, 1, null) ) Execute the query using the new definition. Both the SUM() and the IIF() functions are optimized to run in the block mode, especially when one of the branches in IIF() is null. In this particular example, the impact on performance was not noticeable because the set of rows was relatively small. Applying this technique on large sets will result in drastic performance improvement as compared to the FILTER-COUNT approach. Be sure to remember that in future. More information about this type of optimization can be found in Mosha Pasumansky's blog: http://tinyurl.com/SumIIF Hints for query improvements There are several ways you can avoid the FILTER() function in order to improve performance. When you need to filter by non-numeric values (i.e. properties or other metadata), you should consider creating an attribute hierarchy for often-searched items and then do one of the following: Use a tuple when you need to get a value sliced by that new member Use the EXCEPT() function when you need to negate that member on its own hierarchy (NOT or <>) Use the EXISTS() function when you need to limit other hierarchies of the same dimension by that member Use the NONEMPTY() function when you need to operate on other dimensions, that is, subcubes created with that new member Use the 3-argument EXISTS() function instead of the NONEMPTY() function if you also want to get combinations with nulls in the corresponding measure group (nulls are available only when the NullProcessing property for a measure is set to Preserve) When you need to filter by values and then count a member in that set, you should consider aggregate functions like SUM() with IIF() part in its expression, as described earlier.

0
0
7814

article-image-spam-filtering-natural-language-processing-approach

Packt

08 Mar 2018

16 min read

Spam Filtering - Natural Language Processing Approach

Packt

08 Mar 2018

16 min read

0
0
7810

article-image-basics-of-spark-sql-and-its-components

Amarabha Banerjee

04 Dec 2017

8 min read

Basics of Spark SQL and its components

Amarabha Banerjee

04 Dec 2017

8 min read

[box type="note" align="" class="" width=""]Below given is an excerpt from the book Learning Spark SQL by Aurobindo Sarkar. Spark SQL APIs provide an optimized interface that helps developers build distributed applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. This book provides you with an understanding of design and implementation best practices used to design and build real-world, Spark-based applications. [/box] In the article, we shall give you a perspective of Spark SQL and its components. Introduction Spark SQL is one of the most advanced components of Apache Spark. It has been a part of the core distribution since Spark 1.0 and supports Python, Scala, Java, and R programming APIs. As illustrated in the figure below, Spark SQL components provide the foundation for Spark machine learning applications, streaming applications, graph applications, and many other types of application architectures. Such applications, typically, use Spark ML pipelines, Structured Streaming, and GraphFrames, which are all based on Spark SQL interfaces (DataFrame/Dataset API). These applications, along with constructs such as SQL, DataFrames, and Datasets API, receive the benefits of the Catalyst optimizer, automatically. This optimizer is also responsible for generating executable query plans based on the lower-level RDD interfaces. SparkSession SparkSession represents a unified entry point for manipulating data in Spark. It minimizes the number of different contexts a developer has to use while working with Spark. SparkSession replaces multiple context objects, such as the SparkContext, SQLContext, and HiveContext. These contexts are now encapsulated within the SparkSession object. In Spark programs, we use the builder design pattern to instantiate a SparkSession object. However, in the REPL environment (that is, in a Spark shell session), the SparkSession is automatically created and made available to you via an instance object called Spark.At this time, start the Spark shell on your computer to interactively execute the code snippets in this section. As the shell starts up, you will notice a bunch of messages appearing on your screen, as shown in the following figure. Understanding Resilient Distributed datasets (RDD) RDDs are Spark's primary distributed Dataset abstraction. It is a collection of data that is immutable, distributed, lazily evaluated, type inferred, and cacheable. Prior to execution, the developer code (using higher-level constructs such as SQL, DataFrames, and Dataset APIs) is converted to a DAG of RDDs (ready for execution). RDDs can be created by parallelizing an existing collection of data or accessing a Dataset residing in an external storage system, such as the file system or various Hadoop-based data sources. The parallelized collections form a distributed Dataset that enable parallel operations on them. An RDD can be created from the input file with number of partitions specified, as shown: scala> val cancerRDD = sc.textFile("file:///Users/aurobindosarkar/Downloads/breast-cancerwisconsin. data", 4) scala> cancerRDD.partitions.size res37: Int = 4 RDD files can be internaly converted to a DataFrame by importing the spark.implicits package and using the toDF() method: scala> import spark.implicits._scala> val cancerDF = cancerRDD.toDF() To create a DataFrame with a specific schema, we define a Row object for the rows contained in the DataFrame. Additionally, we split the comma-separated data, convert it to a list of fields, and then map it to the Row object. Finally, we use the create DataFrame() to create the DataFrame with a specified schema: def row(line: List[String]): Row = { Row(line(0).toLong, line(1).toInt, line(2).toInt, line(3).toInt, line(4).toInt, line(5).toInt, line(6).toInt, line(7).toInt, line(8).toInt, line(9).toInt, line(10).toInt) } val data = cancerRDD.map(_.split(",").to[List]).map(row) val cancerDF = spark.createDataFrame(data, recordSchema) Further, we can easily convert the preceding DataFrame to a Dataset using the case class defined earlier: scala> val cancerDS = cancerDF.as[CancerClass] RDD data is logically divided into a set of partitions; additionally, all input, intermediate, and output data is also represented as partitions. The number of RDD partitions defines the level of data fragmentation. These partitions are also the basic units of parallelism. Spark execution jobs are split into multiple stages, and as each stage operates on one partition at a time, it is very important to tune the number of partitions. Fewer partitions than active stages means your cluster could be under-utilized, while an excessive number of partitions could impact the performance due to higher disk and network I/O. Understanding DataFrames and Datasets A DataFrame is similar to a table in a relational database, a pandas dataframe, or a dataframe in R. It is a distributed collection of rows that is organized into columns. It uses the immutable, in-memory, resilient, distributed, and parallel capabilities of RDD, and applies a schema to the data. DataFrames are also evaluated lazily. Additionally, they provide a domain-specific language (DSL) for distributed data manipulation. Conceptually, the DataFrame is an alias for a collection of generic objects Dataset[Row], where a row is a generic untyped object. This means that syntax errors for DataFrames are caught during the compile stage; however, analysis errors are detected only during runtime. DataFrames can be constructed from a wide array of sources, such as structured data files, Hive tables, databases, or RDDs. The source data can be read from local filesystems, HDFS, Amazon S3, and RDBMSs. In addition, other popular data formats, such as CSV, JSON, Avro, Parquet, and so on, are also supported. Additionally, you can also create and use custom data sources. The DataFrame API supports Scala, Java, Python, and R programming APIs. The DataFrames API is declarative, and combined with procedural Spark code, it provides a much tighter integration between the relational and procedural processing in your applications. DataFrames can be manipulated using Spark's procedural API, or using relational APIs (with richer optimizations). Understanding the Catalyst optimizer The Catalyst optimizer is at the core of Spark SQL and is implemented in Scala. It enables several key features, such as schema inference (from JSON data), that are very useful in data analysis work. The following figure shows the high-level transformation process from a developer's program containing DataFrames/Datasets to the final execution plan: The internal representation of the program is a query plan. The query plan describes data operations such as aggregate, join, and filter, which match what is defined in your query. These operations generate a new Dataset from the input Dataset. After we have an initial version of the query plan ready, the Catalyst optimizer will apply a series of transformations to convert it to an optimized query plan. Finally, the Spark SQL code generation mechanism translates the optimized query plan into a DAG of RDDs that is ready for execution. The query plans and the optimized query plans are internally represented as trees. So, at its core, the Catalyst optimizer contains a general library for representing trees and applying rules to manipulate them. On top of this library, are several other libraries that are more specific to relational query processing. Catalyst has two types of query plans: Logical and Physical Plans. The Logical Plan describes the computations on the Datasets without defining how to carry out the specific computations. Typically, the Logical Plan generates a list of attributes or columns as output under a set of constraints on the generated rows. The Physical Plan describes the computations on Datasets with specific definitions on how to execute them (it is executable). Let's explore the transformation steps in more detail. The initial query plan is essentially an unresolved Logical Plan, that is, we don't know the source of the Datasets or the columns (contained in the Dataset) at this stage and we also don't know the types of columns. The first step in this pipeline is the analysis step. During analysis, the catalog information is used to convert the unresolved Logical Plan to a resolved Logical Plan. In the next step, a set of logical optimization rules is applied to the resolved Logical Plan, resulting in an optimized Logical Plan. In the next step the optimizer may generate multiple Physical Plans and compare their costs to pick the best one. The first version of the Costbased Optimizer (CBO), built on top of Spark SQL has been released in Spark 2.2. More details on cost-based optimization are presented in Chapter 11, Tuning Spark SQL Components for Performance. All three--DataFrame, Dataset and SQL--share the same optimization pipeline as illustrated in the following figure: The primary goal of this article was to give an overview of Spark SQL to enable you being comfortable with the Spark environment through hands-on sessions (using public Datasets). If you liked our article, please be sure to check out Learning Spark SQL which consists of more useful techniques on data extraction and data analysis using Spark SQL.

0
0
7773

article-image-brett-lantz-shows-how-data-scientists-learn-building-algorithms-in-third-edition-machine-learning-r

Packt Editorial Staff

22 Apr 2019

3 min read

The hands-on guide to Machine Learning with R by Brett Lantz

Packt Editorial Staff

22 Apr 2019

3 min read

If science fiction stories are to be believed, the invention of Artificial Intelligence inevitably leads to apocalyptic wars between machines and their makers. Thankfully, at the time of this writing, machines still require user input. Though your impressions of Machine Learning may be colored by these mass-media depictions, today's algorithms are too application-specific to pose any danger of becoming self-aware. The goal of today's Machine Learning is not to create an artificial brain, but rather to assist us with making sense of the world's massive data stores. Conceptually, the learning process involves the abstraction of data into a structured representation, and the generalization of the structure into action that can be evaluated for utility. In practical terms, a machine learner uses data containing examples and features of the concept to be learned, then summarizes this data in the form of a model, which is used for predictive or descriptive purposes. The field of machine learning provides a set of algorithms that transform data into actionable knowledge. Among the many possible methods, machine learning algorithms are chosen on the basis of the input data and the learning task. This fact makes machine learning well-suited to the present-day era of big data. Machine Learning with R, Third Edition introduces you to the fundamental concepts that define and differentiate the most commonly used machine learning approaches and how easy it is to use R to start applying machine learning to real-world problems. Many of the algorithms needed for machine learning are not included as part of the base R installation. Instead, the algorithms are available via a large community of experts who have shared their work freely. These powerful tools are available to download at no cost, but must be installed on top of base R manually. This book covers a small portion of all of R's machine learning packages and will get you up to speed with the learning landscape of machine learning with R. Machine Learning with R, Third Edition updates the classic R data science book with newer and better libraries, advice on ethical and bias issues in machine learning, and an introduction to deep learning. Whether you are an experienced R user or new to the language, Brett Lantz teaches you everything you need to uncover key insights, make new predictions, and visualize your findings. Introduction to Machine Learning with R Machine Learning with R How to make machine learning based recommendations using Julia [Tutorial]

0
0
7754

Packt

30 Oct 2013

16 min read

IBM SPSS Modeler – Pushing the Limits

Packt

30 Oct 2013

16 min read

(For more resources related to this topic, see here.) Using the Feature Selection node creatively to remove or decapitate perfect predictors In this recipe, we will identify perfect or near perfect predictors in order to insure that they do not contaminate our model. Perfect predictors earn their name by being correct 100 percent of the time, usually indicating circular logic and not a prediction of value. It is a common and serious problem. When this occurs we have accidentally allowed information into the model that could not possibly be known at the time of the prediction. Everyone 30 days late on their mortgage receives a late letter, but receiving a late letter is not a good predictor of lateness because their lateness caused the letter, not the other way around. The rather colorful term decapitate is borrowed from the data miner Dorian Pyle. It is a reference to the fact that perfect predictors will be found at the top of any list of key drivers ("caput" means head in Latin). Therefore, to decapitate is to remove the variable at the top. Their status at the top of the list will be capitalized upon in this recipe. The following table shows the three time periods; the past, the present, and the future. It is important to remember that, when we are making predictions, we can use information from the past to predict the present or the future but we cannot use information from the future to predict the future. This seems obvious, but it is common to see analysts use information that was gathered after the date for which predictions are made. As an example, if a company sends out a notice after a customer has churned, you cannot say that the notice is predictive of churning. Past Now Future Contract Start Expiration Outcome Renewal Date Joe January 1, 2010 January 1, 2012 Renewed January 2, 2012 Ann February 15, 2010 February 15, 2012 Out of Contract Null Bill March 21, 2010 March 21, 2012 Churn NA Jack April 5, 2010 April 5, 2012 Renewed April 9, 2012 New Customer 24 Months Ago Today ??? ??? Getting ready We will start with a blank stream, and will be using the cup98lrn reduced vars2.txt data set. How to do it... To identify perfect or near-perfect predictors in order to insure that they do not contaminate our model: Build a stream with a Source node, a Type node, and a Table then force instantiation by running the Table node. Force TARGET_B to be flag and make it the target. Add a Feature Selection Modeling node and run it. Edit the resulting generated model and examine the results. In particular, focus on the top of the list. Review what you know about the top variables, and check to see if any could be related to the target by definition or could possibly be based on information that actually postdates the information in the target. Add a CHAID Modeling node, set it to run in Interactive mode, and run it. Examine the first branch, looking for any child node that might be perfectly predicted; that is, look for child nodes whose members are all found in one category. Continue steps 6 and 7 for the first several variables. Variables that are problematic (steps 5 and/or 7) need to be set to None in the Type node. How it works... Which variables need decapitation? The problem is information that, although it was known at the time that you extracted it, was not known at the time of decision. In this case, the time of decision is the decision that the potential donor made to donate or not to donate. Was the amount, Target_D known before the decision was made to donate? Clearly not. No information that dates after the information in the target variable can ever be used in a predictive model. This recipe is built of the following foundation—variables with this problem will float up to the top of the Feature Selection results. They may not always be perfect predictors, but perfect predictors always must go. For example, you might find that, if a customer initially rejects or postpones a purchase, there should be a follow up sales call in 90 days. They are recorded as rejected offer in the campaign, and as a result most of them had a follow up call in 90 days after the campaign. Since a couple of the follow up calls might not have happened, it won't be a perfect predictor, but it still must go. Note that variables such as RFA_2 and RFA_2A are both very recent information and highly predictive. Are they a problem? You can't be absolutely certain without knowing the data. Here the information recorded in these variables is calculated just prior to the campaign. If the calculation was made just after, they would have to go. The CHAID tree almost certainly would have shown evidence of perfect prediction in this case. There's more... Sometimes a model has to have a lot of lead time; predicting today's weather is a different challenge than next year's prediction in the farmer's almanac. When more lead time is desired you could consider dropping all of the _2 series variables. What would the advantage be? What if you were buying advertising space and there was a 45 day delay for the advertisement to appear? If the _2 variables occur between your advertising deadline and your campaign you might have to use information attained in the _3 campaign. Next-Best-Offer for large datasets Association models have been the basis for next-best-offer recommendation engines for a long time. Recommendation engines are widely used for presenting customers with cross-sell offers. For example, if a customer purchases a shirt, pants, and a belt; which shoes would he also likely buy? This type of analysis is often called market-basket analysis as we are trying to understand which items customers purchase in the same basket/transaction. Recommendations must be very granular (for example, at the product level) to be usable at the check-out register, website, and so on. For example, knowing that female customers purchase a wallet 63.9 percent of the time when they buy a purse is not directly actionable. However, knowing that customers that purchase a specific purse (for example, SKU 25343) also purchase a specific wallet (for example, SKU 98343) 51.8 percent of the time, can be the basis for future recommendations. Product level recommendations require the analysis of massive data sets (that is, millions of rows). Usually, this data is in the form of sales transactions where each line item (that is, row of data) represents a single product. The line items are tied together by a single transaction ID. IBM SPSS Modeler association models support both tabular and transactional data. The tabular format requires each product to be represented as column. As most product level recommendations would contain thousands of products, this format is not practical. The transactional format uses the transactional data directly and requires only two inputs, the transaction ID and the product/item. Getting ready This example uses the file stransactions.sav and scoring.csv. How to do it... To recommend the next best offer for large datasets: Start with a new stream by navigating to File | New Stream. Go to File | Stream Properties from the IBM SPSS Modeler menu bar. On the Options tab change the Maximum members for nominal fields to 50000. Click on OK. Add a Statistics File source node to the upper left of the stream. Set the file field by navigating to transactions.sav. On the Types tab, change the Product_Code field to Nominal and click on the Read Values button. Click on OK. Add a CARMA Modeling node connected to the Statistics File source node in step 3. On the Fields tab, click on the Use custom settings and check the Use transactional format check box. Select Transaction_ID as the ID field and Product_Code as the Content field. On the Model tab of the CARMA Modeling node, change the Minimum rule support (%) to 0.0 and the Minimum rule confidence (%) to 5.0. Click on the Run button to build the model. Double-click the generated model to ensure that you have approximately 40,000 rules. Add a Var File source node to the middle left of the stream. Set the file field by navigating to scoring.csv. On the Types tab, click on the Read Values button. Click on the Preview button to preview the data. Click on OK to dismiss all dialogs. Add a Sort node connected to the Var File node in step 6. Choose Transaction_ID and Line_Number (with Ascending sort) by clicking the down arrow on the right of the dialog. Click on OK. Connect the Sort node in step 7 to the generated model (replacing the current link). Add an Aggregate node connected to the generated model. Add a Merge node connected to the generated model. Connect the Aggregate node in step 9 to the Merge node. On the Merge tab, choose Keys as the Merge Method, select Transaction_ID, and click on the right arrow. Click on OK. Add a Select node connected to the Merge node in step 10. Set the condition to Record_Count = Line_Number. Click on OK. At this point, the stream should look as follows: Add a Table node connected to the Select node in step 11. Right-click on the Table node and click on Run to see the next-best-offer for the input data. How it works... In steps 1-5, we set up the CARMA model to use the transactional data (without needing to restructure the data). CARMA was selected over A Priori for its improved performance and stability with large data sets. For recommendation engines, the settings for the Model tab are somewhat arbitrary and are driven by the practical limitations of the number of rules generated. Lowering the thresholds for confidence and rule support generates more rules. Having more rules can have a negative impact on scoring performance but will result in more (albeit weaker) recommendations. Rule Support How many transactions contain the entire rule (that is, both antecedents ("if" products) and consequents ("then" products)) Confidence If a transaction contains all the antecedents ("if" products), what percentage of the time does it contain the consequents ("then" products) In step 5, when we examine the model we see the generated Association Rules with the corresponding rules support and confidences. In the remaining steps (7-12), we score a new transaction and generate 3 next-best-offers based on the model containing the Association Rules. Since the model was built with transactional data, the scoring data must also be transactional. This means that each row is scored using the current row and the prior rows with the same transaction ID. The only row we generally care about is the last row for each transaction where all the data has been presented to the model. To accomplish this, we count the number of rows for each transaction and select the line number that equals the total row count (that is, the last row for each transaction). Notice that the model returns 3 recommended products, each with a confidence, in order of decreasing confidence. A next-best-offer engine would present the customer with the best option first (or potentially all three options ordered by decreasing confidence). Note that, if there is no rule that applies to the transaction, nulls will be returned in some or all of the corresponding columns. There's more... In this recipe, you'll notice that we generate recommendations across the entire transactional data set. By using all transactions, we are creating generalized next-best-offer recommendations; however, we know that we can probably segment (that is, cluster) our customers into different behavioral groups (for example, fashion conscience, value shoppers, and so on.). Partitioning the transactions by behavioral segment and generating separate models for each segment will result in rules that are more accurate and actionable for each group. The biggest challenge with this approach is that you will have to identify the customer segment for each customer before making recommendations (that is, scoring). A unified approach would be to use the general recommendations for a customer until a customer segment can be assigned then use segmented models. Correcting a confusion matrix for an imbalanced target variable by incorporating priors Classification models generate probabilities and a classification predicted class value. When there is a significant imbalance in the proportion of True values in the target variable, the confusion matrix as seen in the Analysis node output will show that the model has all predicted class values equal to the False value, leading an analyst to conclude the model is not effective and needs to be retrained. Most often, the conventional wisdom is to use a Balance node to balance the proportion of True and False values in the target variable, thus eliminating the problem in the confusion matrix. However, in many cases, the classifier is working fine without the Balance node; it is the interpretation of the model that is biased. Each model generates a probability that the record belongs to the True class and the predicted class is derived from this value by applying a threshold of 0.5. Often, no record has a propensity that high, resulting in every predicted class value being assigned False. In this recipe we learn how to adjust the predicted class for classification problems with imbalanced data by incorporating the prior probability of the target variable. Getting ready This recipe uses the datafile cup98lrn_reduced_vars3.sav and the stream Recipe – correct with priors.str. How to do it... To incorporate prior probabilities when there is an imbalanced target variable: Open the stream Recipe – correct with priors.str by navigating to File | Open Stream. Make sure the datafile points to the correct path to the datafile cup98lrn_reduced_vars3.sav. Open the generated model TARGET_B, and open the Settings tab. Note that compute Raw Propensity is checked. Close the generated model. Duplicate the generated model by copying and pasting the node in the stream. Connect the duplicated model to the original generated model. Add a Type node to the stream and connect it to the generated model. Open the Type node and scroll to the bottom of the list. Note that the fields related to the two models have not yet been instantiated. Click on Read Values so that they are fully instantiated. Insert a Filler node and connect it to the Type node. Open the Filler node and, in the variable list, select $N1-TARGET_B. Inside the Condition section, type $RP1-TARGET_B' >= TARGET_B_Mean, Click on OK to dismiss the Filler node (after exiting the Expression Builder). Insert an Analysis node to the stream. Open the Analysis node and click on the check box for Coincidence Matrices. Click on OK. Run the stream to the Analysis node. Notice that the coincidence matrix (confusion matrix) for $N-TARGET_B has no predictions with value = 1, but the coincidence matrix for the second model, the one adjusted by step 7 ($N1-TARGET_B), has more than 30 percent of the records labeled as value = 1. How it works... Classification algorithms do not generate categorical predictions; they generate probabilities, likelihoods, or confidences. For this data set, the target variable, TARGET_B, has two values: 1 and 0. The classifier output from any classification algorithm will be a number between 0 and 1. To convert the probability to a 1 or 0 label, the probability is thresholded, and the default in Modeler (and all predictive analytics software) is the threshold at 0.5. This recipe changes that default threshold to the prior probability. The proportion of TARGET_B = 1 values in the data is 5.1 percent, and therefore this is the classic imbalanced target variable problem. One solution to this problem is to resample the data so that the proportion of 1s and 0s are equal, normally achieved through use of the Balance node in Modeler. Moreover, one can create the Balance node from running a Distribution node for TARGET_B, and using the Generate | Balance node (reduce) option. The justification for balancing the sample is that, if one doesn't do it, all the records will be classified with value = 0. The reason for all the classification decisions having value 0 is not because the Neural Network isn't working properly. Consider the histogram of predictions from the Neural Network shown in the following screenshot. Notice that the maximum value of the predictions is less than 0.4, but the center of density is about 0.05. The actual shape of the histogram and the maximum predicted value depend on the Neural Network; some may have maximum values slightly above 0.5. If the threshold for the classification decision is set to 0.5, since no neural network predicted confidence is greater than 0.5, all of the classification labels will be 0. However, if one sets the threshold to the TARGET_B prior probability, 0.051, many of the predictions will exceed that value and be labeled as 1. We can see the result of the new threshold by color-coding the histogram of the previous figure with the new class label, in the following screenshot. This recipe used a Filler node to modify the existing predicted target value. The categorical prediction from the Neural Network whose prediction is being changed is $N1-TARGET_B. The $ variables are special field names that are used automatically in the Analysis node and Evaluation node. It's possible to construct one's own $ fields with a Derive node, but it is safer to modify the one that's already in the data. There's more... This same procedure defined in this recipe works for other modeling algorithms as well, including logistic regression. Decision trees are a different matter. Consider the following screenshot. This result, stating that the C5 tree didn't split at all, is the result of the imbalanced target variable. Rather than balancing the sample, there are other ways to get a tree built. For C&RT or Quest trees, go to the Build Options, select the Costs & Priors item, and select Equal for all classes for priors: equal priors. This option forces C&RT to treat the two classes mathematically as if their counts were equal. It is equivalent to running the Balance node to boost samples so that there are equal numbers of 0s and 1s. However, it's done without adding additional records to the data, slowing down training; equal priors is purely a mathematical reweighting. The C5 tree doesn't have the option of setting priors. An alternative, one that will work not only with C5 but also with C&RT, CHAID, and Quest trees, is to change the Misclassification Costs so that the cost of classifying a one as a zero is 20, approximately the ratio of the 95 percent 0s to 5 percent 1s.

0
0
7730

Packt

29 Jan 2016

6 min read

FPGA Mining

Packt

29 Jan 2016

6 min read

In this article by Albert Szmigielski, author of the book Bitcoin Essentials, we will take a look at mining with Field-Programmable Gate Arrays, or FPGAs. These are microprocessors that can be programmed for a specific purpose. In the case of bitcoin mining, they are configured to perform the SHA-256 hash function, which is used to mine bitcoins. FPGAs have a slight advantage over GPUs for mining. The period of FPGA mining of bitcoins was rather short (just under a year), as faster machines became available. The advent of ASIC technology for bitcoin mining compelled a lot of miners to make the move from FPGAs to ASICs. Nevertheless, FPGA mining is worth learning about. We will look at the following: Pros and cons of FPGA mining FPGA versus other hardware mining Best practices when mining with FPGAs Discussion of profitability (For more resources related to this topic, see here.) Pros and cons of FPGA mining Mining with an FPGA has its advantages and disadvantages. Let's examine these in order to better understand if and when it is appropriate to use FPGAs to mine bitcoins. As you may recall, mining started on CPUs, moved over to GPUs, and then people discovered that FPGAs could be used for mining as well. Pros of FPGA mining FPGA mining is the third step in mining hardware evolution. They are faster and more efficient than GPUs. In brief, mining bitcoins with FPGAs has the following advantages: FPGAs are faster than GPUs and CPUs FPGAs are more electricity-efficient per unit of hashing than CPUs or GPUs Cons of FPGA mining FPGAs are rather difficult to source and program. They are not usually sold in stores open to the public. We have not touched upon programming FPGAs to mine bitcoins as it is assumed that the reader has already acquired preprogrammed FPGAs. There are several good resources regarding FPGA programming on the Internet. Electricity costs are also an issue with FPGAs, although not as big as with GPUs. To summarize, mining bitcoins with FPGAs has the following disadvantages: Electricity costs Hardware costs Fierce competition with other miners Best practices when mining with FPGAs Let's look at the recommended things to do when mining with FPGAs. Mining is fun, and it could also be profitable if several factors are taken into account. Make sure that all your FPGAs have adequate cooling. Additional fans beyond what is provided by the manufacturer are always a good idea. Remove dust frequently, as a buildup of dust might have a detrimental effect on cooling efficiency, and therefore, mining speed. For your particular mining machine, look up all the optimization tweaks online in order to get all the hashing power possible out of the device. When setting up a mining operation for profit, keep in mind that electricity costs will be a large percentage of your overall costs. Seek a location with the lowest electricity rates. Think about cooling costs—perhaps it would be most beneficial to mine somewhere where the climate is cooler. When purchasing FPGAs, make sure you calculate hashes per dollar of hardware costs, and also hashes per unit of electricity used. In mining, electricity has the biggest cost after hardware, and electricity will exceed the cost of the hardware over time. Keep in mind that hardware costs fall over time, so purchasing your equipment in stages rather than all at once may be desirable. To summarize, keep in mind these factors when mining with FPGAs: Adequate cooling Optimization Electricity costs Hardware cost per MH/s Benchmarks of mining speeds with different FPGAs As we have mentioned before, the Bitcoin network hash rate is really high now. Mining even with FPGAs does not guarantee profits. This is due to the fact that during the mining process, you are competing with other miners to try to solve a block. If those other miners are running a larger percentage of the total mining power, you will be at a disadvantage, as they are more likely to solve a block. To compare the mining speed of a few FPGAs, look at the following table: FPGA Mining speed (MH/s) Power used (Watts) Bitcoin Dominator X5000 100 6.8 Icarus 380 19.2 Lancelot 400 26 ModMiner Quad 800 40 Butterflylabs Mini Rig 25,200 1250 Comparison of the mining speed of different FPGAs FPGA versus GPU and CPU mining FPGAs hash much faster than any other hardware. The fastest in our list reaches 25,000 MH/s. FPGAs are faster at performing hashing calculations than both CPUs and GPUs. They are also more efficient with respect to the use of electricity per hashing unit. The increase in hashing speed in FPGAs is a significant improvement over GPUs and even more so over CPUs. The profitability of FPGA mining In calculating your potential profit, keep in mind the following factors: The cost of your FPGAs Electricity costs to run the hardware Cooling costs—FPGAs generate a decent amount of heat Your percentage of the total network hashing power To calculate the expected rewards from mining, we can do the following: First, calculate what percentage of total hashing power you command. To look up the network mining speed, execute the getmininginfo command in the console of the Bitcoin Core wallet. We will do our calculations with an FPGA that can hash at 1 GH/s. If the Bitcoin network hashes at 400,000 TH/s, then our proportion of the hashing power is 0.001/400 000 = 0.0000000025 of the total mining power. A bitcoin block is found, on average, every 10 minutes, which makes six per hour and 144 for a 24-hour period. The current reward per block is 25 BTC; therefore, in a day, we have 144 * 25 = 3600 BTC mined. If we command a certain percentage of the mining power, then on average we should earn that proportion of newly minted bitcoins. Multiplying our portion of the hashing power by the number of bitcoins mined daily, we arrive at the following: 0.0000000025 * 3600 BTC = 0.000009 BTC As one can see, this is roughly $0.0025 USD for a 24-hour period. For up-to-date profitability information, you can look at https://www.multipool.us/, which publishes the average profitability per gigahash of mining power. Summary In this article, we explored FPGA mining. We examined the advantages and disadvantages of mining with FPGAs. It would serve any miner well to ponder them over when deciding to start mining or when thinking about improving current mining operations. We touched upon some best practices that we recommend keeping in mind. We also investigated the profitability of mining, given current conditions. A simple way of calculating your average earnings was also presented. We concluded that mining competition is fierce; therefore, any improvements you can make will serve you well. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [article] Protecting Your Bitcoins [article] E-commerce with MEAN [article]

0
0
7679

article-image-introducing-interactive-plotting

Packt

20 Mar 2015

29 min read

Introducing Interactive Plotting

Packt

20 Mar 2015

29 min read

This article is written by Benjamin V. Root, the author of Interactive Applications using Matplotlib. The goal of any interactive application is to provide as much information as possible while minimizing complexity. If it can't provide the information the users need, then it is useless to them. However, if the application is too complex, then the information's signal gets lost in the noise of the complexity. A graphical presentation often strikes the right balance. The Matplotlib library can help you present your data as graphs in your application. Anybody can make a simple interactive application without knowing anything about draw buffers, event loops, or even what a GUI toolkit is. And yet, the Matplotlib library will cede as much control as desired to allow even the most savvy GUI developer to create a masterful application from scratch. Like much of the Python language, Matplotlib's philosophy is to give the developer full control, but without being stupidly unhelpful and tedious. (For more resources related to this topic, see here.) Installing Matplotlib There are many ways to install Matplotlib on your system. While the library used to have a reputation for being difficult to install on non-Linux systems, it has come a long way since then, along with the rest of the Python ecosystem. Refer to the following command: $ pip install matplotlib Most likely, the preceding command would work just fine from the command line. Python Wheels (the next-generation Python package format that has replaced "eggs") for Matplotlib are now available from PyPi for Windows and Mac OS X systems. This method would also work for Linux users; however, it might be more favorable to install it via the system's built-in package manager. While the core Matplotlib library can be installed with few dependencies, it is a part of a much larger scientific computing ecosystem known as SciPy. Displaying your data is often the easiest part of your application. Processing it is much more difficult, and the SciPy ecosystem most likely has the packages you need to do that. For basic numerical processing and N-dimensional data arrays, there is NumPy. For more advanced but general data processing tools, there is the SciPy package (the name was so catchy, it ended up being used to refer to many different things in the community). For more domain-specific needs, there are "Sci-Kits" such as scikit-learn for artificial intelligence, scikit-image for image processing, and statsmodels for statistical modeling. Another very useful library for data processing is pandas. This was just a short summary of the packages available in the SciPy ecosystem. Manually managing all of their installations, updates, and dependencies would be difficult for many who just simply want to use the tools. Luckily, there are several distributions of the SciPy Stack available that can keep the menagerie under control. The following are Python distributions that include the SciPy Stack along with many other popular Python packages or make the packages easily available through package management software: Anaconda from Continuum Analytics Canopy from Enthought SciPy Superpack Python(x, y) (Windows only) WinPython (Windows only) Pyzo (Python 3 only) Algorete Loopy from Dartmouth College Show() your work With Matplotlib installed, you are now ready to make your first simple plot. Matplotlib has multiple layers. Pylab is the topmost layer, often used for quick one-off plotting from within a live Python session. Start up your favorite Python interpreter and type the following: >>> from pylab import * >>> plot([1, 2, 3, 2, 1]) Nothing happened! This is because Matplotlib, by default, will not display anything until you explicitly tell it to do so. The Matplotlib library is often used for automated image generation from within Python scripts, with no need for any interactivity. Also, most users would not be done with their plotting yet and would find it distracting to have a plot come up automatically. When you are ready to see your plot, use the following command: >>> show() Interactive navigation A figure window should now appear, and the Python interpreter is not available for any additional commands. By default, showing a figure will block the execution of your scripts and interpreter. However, this does not mean that the figure is not interactive. As you mouse over the plot, you will see the plot coordinates in the lower right-hand corner. The figure window will also have a toolbar: From left to right, the following are the tools: Home, Back, and Forward: These are similar to that of a web browser. These buttons help you navigate through the previous views of your plot. The "Home" button will take you back to the first view when the figure was opened. "Back" will take you to the previous view, while "Forward" will return you to the previous views. Pan (and zoom): This button has two modes: pan and zoom. Press the left mouse button and hold it to pan the figure. If you press x or y while panning, the motion will be constrained to just the x or y axis, respectively. Press the right mouse button to zoom. The plot will be zoomed in or out proportionate to the right/left and up/down movements. Use the X, Y, or Ctrl key to constrain the zoom to the x axis or the y axis or preserve the aspect ratio, respectively. Zoom-to-rectangle: Press the left mouse button and drag the cursor to a new location and release. The axes view limits will be zoomed to the rectangle you just drew. Zoom out using your right mouse button, placing the current view into the region defined by the rectangle you just drew. Subplot configuration: This button brings up a tool to modify plot spacing. Save: This button brings up a dialog that allows you to save the current figure. The figure window would also be responsive to the keyboard. The default keymap is fairly extensive (and will be covered fully later), but some of the basic hot keys are the Home key for resetting the plot view, the left and right keys for back and forward actions, p for pan/zoom mode, o for zoom-to-rectangle mode, and Ctrl + s to trigger a file save. When you are done viewing your figure, close the window as you would close any other application window, or use Ctrl + w. Interactive plotting When we did the previous example, no plots appeared until show() was called. Furthermore, no new commands could be entered into the Python interpreter until all the figures were closed. As you will soon learn, once a figure is closed, the plot it contains is lost, which means that you would have to repeat all the commands again in order to show() it again, perhaps with some modification or additional plot. Matplotlib ships with its interactive plotting mode off by default. There are a couple of ways to turn the interactive plotting mode on. The main way is by calling the ion() function (for Interactive ON). Interactive plotting mode can be turned on at any time and turned off with ioff(). Once this mode is turned on, the next plotting command will automatically trigger an implicit show() command. Furthermore, you can continue typing commands into the Python interpreter. You can modify the current figure, create new figures, and close existing ones at any time, all from the current Python session. Scripted plotting Python is known for more than just its interactive interpreters; it is also a fully fledged programming language that allows its users to easily create programs. Having a script to display plots from daily reports can greatly improve your productivity. Alternatively, you perhaps need a tool that can produce some simple plots of the data from whatever mystery data file you have come across on the network share. Here is a simple example of how to use Matplotlib's pyplot API and the argparse Python standard library tool to create a simple CSV plotting script called plotfile.py. Code: chp1/plotfile.py#!/usr/bin/env python from argparse import ArgumentParserimport matplotlib.pyplot as pltif __name__ == '__main__': parser = ArgumentParser(description="Plot a CSV file") parser.add_argument("datafile", help="The CSV File") # Require at least one column name parser.add_argument("columns", nargs='+', help="Names of columns to plot") parser.add_argument("--save", help="Save the plot as...") parser.add_argument("--no-show", action="store_true", help="Don't show the plot") args = parser.parse_args() plt.plotfile(args.datafile, args.columns) if args.save: plt.savefig(args.save) if not args.no_show: plt.show() Note the two optional command-line arguments: --save and --no-show. With the --save option, the user can have the plot automatically saved (the graphics format is determined automatically from the filename extension). Also, the user can choose not to display the plot, which when coupled with the --save option might be desirable if the user is trying to plot several CSV files. When calling this script to show a plot, the execution of the script will stop at the call to plt.show(). If the interactive plotting mode was on, then the execution of the script would continue past show(), terminating the script, thus automatically closing out any figures before the user has had a chance to view them. This is why the interactive plotting mode is turned off by default in Matplotlib. Also note that the call to plt.savefig() is before the call to plt.show(). As mentioned before, when the figure window is closed, the plot is lost. You cannot save a plot after it has been closed. Getting help We have covered how to install Matplotlib and went over how to make very simple plots from a Python session or a Python script. Most likely, this went very smoothly for you.. You may be very curious and want to learn more about the many kinds of plots this library has to offer, or maybe you want to learn how to make new kinds of plots. Help comes in many forms. The Matplotlib website (http://matplotlib.org) is the primary online resource for Matplotlib. It contains examples, FAQs, API documentation, and, most importantly, the gallery. Gallery Many users of Matplotlib are often faced with the question, "I want to make a plot that has this data along with that data in the same figure, but it needs to look like this other plot I have seen." Text-based searches on graphing concepts are difficult, especially if you are unfamiliar with the terminology. The gallery showcases the variety of ways in which one can make plots, all using the Matplotlib library. Browse through the gallery, click on any figure that has pieces of what you want in your plot, and see the code that generated it. Soon enough, you will be like a chef, mixing and matching components to produce that perfect graph. Mailing lists and forums When you are just simply stuck and cannot figure out how to get something to work or just need some hints on how to get started, you will find much of the community at the Matplotlib-users mailing list. This mailing list is an excellent resource of information with many friendly members who just love to help out newcomers. Be persistent! While many questions do get answered fairly quickly, some will fall through the cracks. Try rephrasing your question or with a plot showing your attempts so far. The people at Matplotlib-users love plots, so an image that shows what is wrong often gets the quickest response. A newer community resource is StackOverflow, which has many very knowledgeable users who are able to answer difficult questions. From front to backend So far, we have shown you bits and pieces of two of Matplotlib's topmost abstraction layers: pylab and pyplot. The layer below them is the object-oriented layer (the OO layer). To develop any type of application, you will want to use this layer. Mixing the pylab/pyplot layers with the OO layer will lead to very confusing behaviors when dealing with multiple plots and figures. Below the OO layer is the backend interface. Everything above this interface level in Matplotlib is completely platform-agnostic. It will work the same regardless of whether it is in an interactive GUI or comes from a driver script running on a headless server. The backend interface abstracts away all those considerations so that you can focus on what is most important: writing code to visualize your data. There are several backend implementations that are shipped with Matplotlib. These backends are responsible for taking the figures represented by the OO layer and interpreting it for whichever "display device" they implement. The backends are chosen automatically but can be explicitly set, if needed. Interactive versus non-interactive There are two main classes of backends: ones that provide interactive figures and ones that don't. Interactive backends are ones that support a particular GUI, such as Tcl/Tkinter, GTK, Qt, Cocoa/Mac OS X, wxWidgets, and Cairo. With the exception of the Cocoa/Mac OS X backend, all interactive backends can be used on Windows, Linux, and Mac OS X. Therefore, when you make an interactive Matplotlib application that you wish to distribute to users of any of those platforms, unless you are embedding Matplotlib, you will not have to concern yourself with writing a single line of code for any of these toolkits—it has already been done for you! Non-interactive backends are used to produce image files. There are backends to produce Postscript/EPS, Adobe PDF, and Scalable Vector Graphics (SVG) as well as rasterized image files such as PNG, BMP, and JPEGs. Anti-grain geometry The open secret behind the high quality of Matplotlib's rasterized images is its use of the Anti-Grain Geometry (AGG) library (http://agg.sourceforge.net/antigrain.com/index.html). The quality of the graphics generated from AGG is far superior than most other toolkits available. Therefore, not only is AGG used to produce rasterized image files, but it is also utilized in most of the interactive backends as well. Matplotlib maintains and ships with its own fork of the library in order to ensure you have consistent, high quality image products across all platforms and toolkits. What you see on your screen in your interactive figure window will be the same as the PNG file that is produced when you call savefig(). Selecting your backend When you install Matplotlib, a default backend is chosen for you based upon your OS and the available GUI toolkits. For example, on Mac OS X systems, your installation of the library will most likely set the default interactive backend to MacOSX or CocoaAgg for older Macs. Meanwhile, Windows users will most likely have a default of TkAgg or Qt5Agg. In most situations, the choice of interactive backends will not matter. However, in certain situations, it may be necessary to force a particular backend to be used. For example, on a headless server without an active graphics session, you would most likely need to force the use of the non-interactive Agg backend: import matplotlibmatplotlib.use("Agg") When done prior to any plotting commands, this will avoid loading any GUI toolkits, thereby bypassing problems that occur when a GUI fails on a headless server. Any call to show() effectively becomes a no-op (and the execution of the script is not blocked). Another purpose of setting your backend is for scenarios when you want to embed your plot in a native GUI application. Therefore, you will need to explicitly state which GUI toolkit you are using. Finally, some users simply like the look and feel of some GUI toolkits better than others. They may wish to change the default backend via the backend parameter in the matplotlibrc configuration file. Most likely, your rc file can be found in the .matplotlib directory or the .config/matplotlib directory under your home folder. If you can't find it, then use the following set of commands: >>> import matplotlib >>> matplotlib.matplotlib_fname() u'/home/ben/.config/matplotlib/matplotlibrc' Here is an example of the relevant section in my matplotlibrc file: #### CONFIGURATION BEGINS HERE # the default backend; one of GTK GTKAgg GTKCairo GTK3Agg # GTK3Cairo CocoaAgg MacOSX QtAgg Qt4Agg TkAgg WX WXAgg Agg Cairo # PS PDF SVG # You can also deploy your own backend outside of matplotlib by # referring to the module name (which must be in the PYTHONPATH) # as 'module://my_backend' #backend : GTKAgg #backend : QT4Agg backend : TkAgg # If you are using the Qt4Agg backend, you can choose here # to use the PyQt4 bindings or the newer PySide bindings to # the underlying Qt4 toolkit. #backend.qt4 : PyQt4 # PyQt4 | PySide This is the global configuration file that is used if one isn't found in the current working directory when Matplotlib is imported. The settings contained in this configuration serves as default values for many parts of Matplotlib. In particular, we see that the choice of backends can be easily set without having to use a single line of code. The Matplotlib figure-artist hierarchy Everything that can be drawn in Matplotlib is called an artist. Any artist can have child artists that are also drawable. This forms the basis of a hierarchy of artist objects that Matplotlib sends to a backend for rendering. At the root of this artist tree is the figure. In the examples so far, we have not explicitly created any figures. The pylab and pyplot interfaces will create the figures for us. However, when creating advanced interactive applications, it is highly recommended that you explicitly create your figures. You will especially want to do this if you have multiple figures being displayed at the same time. This is the entry into the OO layer of Matplotlib: fig = plt.figure() Canvassing the figure The figure is, quite literally, your canvas. Its primary component is the FigureCanvas instance upon which all drawing occurs. Unless you are embedding your Matplotlib figures into a GUI application, it is very unlikely that you will need to interact with this object directly. Instead, as plotting commands are issued, artist objects are added to the canvas automatically. While any artist can be added directly to the figure, usually only Axes objects are added. A figure can have many axes objects, typically called subplots. Much like the figure object, our examples so far have not explicitly created any axes objects to use. This is because the pylab and pyplot interfaces will also automatically create and manage axes objects for a figure if needed. For the same reason as for figures, you will want to explicitly create these objects when building your interactive applications. If an axes or a figure is not provided, then the pyplot layer will have to make assumptions about which axes or figure you mean to apply a plotting command to. While this might be fine for simple situations, these assumptions get hairy very quickly in non-trivial applications. Luckily, it is easy to create both your figure and its axes using a single command: fig, axes = plt.subplots(2, 1) # 2x1 grid of subplots These objects are highly advanced complex units that most developers will utilize for their plotting needs. Once placed on the figure canvas, the axes object will provide the ticks, axis labels, axes title(s), and the plotting area. An axes is an artist that manages all of its scale and coordinate transformations (for example, log scaling and polar coordinates), automated tick labeling, and automated axis limits. In addition to these responsibilities, an axes object provides a wide assortment of plotting functions. A sampling of plotting functions is as follows: Function Description bar Make a bar plot barbs Plot a two-dimensional field of barbs boxplot Make a box and whisker plot cohere Plot the coherence between x and y contour Plot contours errorbar Plot an errorbar graph hexbin Make a hexagonal binning plot hist Plot a histogram imshow Display an image on the axes pcolor Create a pseudocolor plot of a two-dimensional array pcolormesh Plot a quadrilateral mesh pie Plot a pie chart plot Plot lines and/or markers quiver Plot a two-dimensional field of arrows sankey Create a Sankey flow diagram scatter Make a scatter plot of x versus y stem Create a stem plot streamplot Draw streamlines of a vector flow This application will be a storm track editing application. Given a series of radar images, the user can circle each storm cell they see in the radar image and link those storm cells across time. The application will need the ability to save and load track data and provide the user with mechanisms to edit the data. Along the way, we will learn about Matplotlib's structure, its artists, the callback system, doing animations, and finally, embedding this application within a larger GUI application. So, to begin, we first need to be able to view a radar image. There are many ways to load data into a Python program but one particular favorite among meteorologists is the Network Common Data Form (NetCDF) file. The SciPy package has built-in support for NetCDF version 3, so we will be using an hour's worth of radar reflectivity data prepared using this format from a NEXRAD site near Oklahoma City, OK on the evening of May 10, 2010, which produced numerous tornadoes and severe storms. The NetCDF binary file is particularly nice to work with because it can hold multiple data variables in a single file, with each variable having an arbitrary number of dimensions. Furthermore, metadata can be attached to each variable and to the dataset itself, allowing you to self-document data files. This particular data file has three variables, namely Reflectivity, lat, and lon to record the radar reflectivity values and the latitude and longitude coordinates of each pixel in the reflectivity data. The reflectivity data is three-dimensional, with the first dimension as time and the other two dimensions as latitude and longitude. The following code example shows how easy it is to load this data and display the first image frame using SciPy and Matplotlib. Code: chp1/simple_radar_viewer.py import matplotlib.pyplot as plt from scipy.io import netcdf_file ncf = netcdf_file('KTLX_20100510_22Z.nc') data = ncf.variables['Reflectivity'] lats = ncf.variables['lat'] lons = ncf.variables['lon'] i = 0 cmap = plt.get_cmap('gist_ncar') cmap.set_under('lightgrey') fig, ax = plt.subplots(1, 1) im = ax.imshow(data[i], origin='lower', extent=(lons[0], lons[-1], lats[0], lats[-1]), vmin=0.1, vmax=80, cmap='gist_ncar') cb = fig.colorbar(im) cb.set_label('Reflectivity (dBZ)') ax.set_xlabel('Longitude') ax.set_ylabel('Latitude') plt.show() Running this script should result in a figure window that will display the first frame of our storms. The plot has a colorbar and the axes ticks label the latitudes and longitudes of our data. What is probably most important in this example is the imshow() call. Being an image, traditionally, the origin of the image data is shown in the upper-left corner and Matplotlib follows this tradition by default. However, this particular dataset was saved with its origin in the lower-left corner, so we need to state this with the origin parameter. The extent parameter is a tuple describing the data extent of the image. By default, it is assumed to be at (0, 0) and (N – 1, M – 1) for an MxN shaped image. The vmin and vmax parameters are a good way to ensure consistency of your colormap regardless of your input data. If these two parameters are not supplied, then imshow() will use the minimum and maximum of the input data to determine the colormap. This would be undesirable as we move towards displaying arbitrary frames of radar data. Finally, one can explicitly specify the colormap to use for the image. The gist_ncar colormap is very similar to the official NEXRAD colormap for radar data, so we will use it here: The gist_ncar colormap, along with some other colormaps packaged with Matplotlib such as the default jet colormap, are actually terrible for visualization. See the Choosing Colormaps page of the Matplotlib website for an explanation of why, and guidance on how to choose a better colormap. The menagerie of artists Whenever a plotting function is called, the input data and parameters are processed to produce new artists to represent the data. These artists are either primitives or collections thereof. They are called primitives because they represent basic drawing components such as lines, images, polygons, and text. It is with these primitives that your data can be represented as bar charts, line plots, errorbars, or any other kinds of plots. Primitives There are four drawing primitives in Matplotlib: Line2D, AxesImage, Patch, and Text. It is through these primitive artists that all other artist objects are derived from, and they comprise everything that can be drawn in a figure. A Line2D object uses a list of coordinates to draw line segments in between. Typically, the individual line segments are straight, and curves can be approximated with many vertices; however, curves can be specified to draw arcs, circles, or any other Bezier-approximated curves. An AxesImage class will take two-dimensional data and coordinates and display an image of that data with a colormap applied to it. There are actually other kinds of basic image artists available besides AxesImage, but they are typically for very special uses. AxesImage objects can be very tricky to deal with, so it is often best to use the imshow() plotting method to create and return these objects. A Patch object is an arbitrary two-dimensional object that has a single color for its "face." A polygon object is a specific instance of the slightly more general patch. These objects have a "path" (much like a Line2D object) that specifies segments that would enclose a face with a single color. The path is known as an "edge," and can have its own color as well. Besides the Polygons that one sees for bar plots and pie charts, Patch objects are also used to create arrows, legend boxes, and the markers used in scatter plots and elsewhere. Finally, the Text object takes a Python string, a point coordinate, and various font parameters to form the text that annotates plots. Matplotlib primarily uses TrueType fonts. It will search for fonts available on your system as well as ship with a few FreeType2 fonts, and it uses Bitstream Vera by default. Additionally, a Text object can defer to LaTeX to render its text, if desired. While specific artist classes will have their own set of properties that make sense for the particular art object they represent, there are several common properties that can be set. The following table is a listing of some of these properties. Property Meaning alpha 0 represents transparent and 1 represents opaque color Color name or other color specification visible boolean to flag whether to draw the artist or not zorder value of the draw order in the layering engine Let's extend the radar image example by loading up already saved polygons of storm cells in the tutorial.py file. Code: chp1/simple_storm_cell_viewer.py import matplotlib.pyplot as plt from scipy.io import netcdf_file from matplotlib.patches import Polygon from tutorial import polygon_loader ncf = netcdf_file('KTLX_20100510_22Z.nc') data = ncf.variables['Reflectivity'] lats = ncf.variables['lat'] lons = ncf.variables['lon'] i = 0 cmap = plt.get_cmap('gist_ncar') cmap.set_under('lightgrey') fig, ax = plt.subplots(1, 1) im = ax.imshow(data[i], origin='lower', extent=(lons[0], lons[-1], lats[0], lats[-1]), vmin=0.1, vmax=80, cmap='gist_ncar') cb = fig.colorbar(im) polygons = polygon_loader('polygons.shp') for poly in polygons[i]: p = Polygon(poly, lw=3, fc='k', ec='w', alpha=0.45) ax.add_artist(p) cb.set_label("Reflectivity (dBZ)") ax.set_xlabel("Longitude") ax.set_ylabel("Latitude") plt.show() The polygon data returned from polygon_loader() is a dictionary of lists keyed by a frame index. The list contains Nx2 numpy arrays of vertex coordinates in longitude and latitude. The vertices form the outline of a storm cell. The Polygon constructor, like all other artist objects, takes many optional keyword arguments. First, lw is short for linewidth, (referring to the outline of the polygon), which we specify to be three points wide. Next is fc, which is short for facecolor, and is set to black ('k'). This is the color of the filled-in region of the polygon. Then edgecolor (ec) is set to white ('w') to help the polygons stand out against a dark background. Finally, we set the alpha argument to be slightly less than half to make the polygon fairly transparent so that one can still see the reflectivity data beneath the polygons. Note a particular difference between how we plotted the image using imshow() and how we plotted the polygons using polygon artists. For polygons, we called a constructor and then explicitly called ax.add_artist() to add each polygon instance as a child of the axes. Meanwhile, imshow() is a plotting function that will do all of the hard work in validating the inputs, building the AxesImage instance, making all necessary modifications to the axes instance (such as setting the limits and aspect ratio), and most importantly, adding the artist object to the axes. Finally, all plotting functions in Matplotlib return artists or a list of artist objects that it creates. In most cases, you will not need to save this return value in a variable because there is nothing else to do with them. In this case, we only needed the returned AxesImage so that we could pass it to the fig.colorbar() method. This is so that it would know what to base the colorbar upon. The plotting functions in Matplotlib exist to provide convenience and simplicity to what can often be very tricky to get right by yourself. They are not magic! They use the same OO interface that is accessible to application developers. Therefore, anyone can write their own plotting functions to make complicated plots easier to perform. Collections Any artist that has child artists (such as a figure or an axes) is called a container. A special kind of container in Matplotlib is called a Collection. A collection usually contains a list of primitives of the same kind that should all be treated similarly. For example, a CircleCollection would have a list of Circle objects, all with the same color, size, and edge width. Individual values for artists in the collection can also be set. A collection makes management of many artists easier. This becomes especially important when considering the number of artist objects that may be needed for scatter plots, bar charts, or any other kind of plot or diagram. Some collections are not just simply a list of primitives, but are artists in their own right. These special kinds of collections take advantage of various optimizations that can be assumed when rendering similar or identical things. RegularPolyCollection, for example, just needs to know the points of a single polygon relative to its center (such as a star or box) and then just needs a list of all the center coordinates, avoiding the need to store all the vertices of every polygon in its collection in memory. In the following example, we will display storm tracks as LineCollection. Note that instead of using ax.add_artist() (which would work), we will use ax.add_collection() instead. This has the added benefit of performing special handling on the object to determine its bounding box so that the axes object can incorporate the limits of this collection with any other plotted objects to automatically set its own limits which we trigger with the ax.autoscale(True) call. Code: chp1/linecoll_track_viewer.py import matplotlib.pyplot as plt from matplotlib.collections import LineCollection from tutorial import track_loader tracks = track_loader('polygons.shp') # Filter out non-tracks (unassociated polygons given trackID of -9) tracks = {tid: t for tid, t in tracks.items() if tid != -9} fig, ax = plt.subplots(1, 1) lc = LineCollection(tracks.values(), color='b') ax.add_collection(lc) ax.autoscale(True) ax.set_xlabel("Longitude") ax.set_ylabel("Latitude") plt.show() Much easier than the radar images, Matplotlib took care of all the limit setting automatically. Such features are extremely useful for writing generic applications that do not wish to concern themselves with such details. Summary In this article, we introduced you to the foundational concepts of Matplotlib. Using show(), you showed your first plot with only three lines of Python. With this plot up on your screen, you learned some of the basic interactive features built into Matplotlib, such as panning, zooming, and the myriad of key bindings that are available. Then we discussed the difference between interactive and non-interactive plotting modes and the difference between scripted and interactive plotting. You now know where to go online for more information, examples, and forum discussions of Matplotlib when it comes time for you to work on your next Matplotlib project. Next, we discussed the architectural concepts of Matplotlib: backends, figures, axes, and artists. Then we started our construction project, an interactive storm cell tracking application. We saw how to plot a radar image using a pre-existing plotting function, as well as how to display polygons and lines as artists and collections. While creating these objects, we had a glimpse of how to customize the properties of these objects for our display needs, learning some of the property and styling names. We also learned some of the steps one needs to consider when creating their own plotting functions, such as autoscaling. Resources for Article: Further resources on this subject: The plot function [article] First Steps [article] Machine Learning in IPython with scikit-learn [article]

0
2
7670

Packt

13 Oct 2016

25 min read

Reconstructing 3D Scenes

Packt

13 Oct 2016

25 min read

0
0
7668

Packt

30 Oct 2013

9 min read

Getting started with Haskell

Packt

30 Oct 2013

9 min read

(For more resources related to this topic, see here.) So what is Haskell? It is a fast, type-safe, purely functional programming language with a powerful type inference. Having said that, let us try to understand what it gives us. First, a purely functional programming language means that, in general, functions in Haskell don't have side effects. There is a special type for impure functions, or functions with side effects. Then, Haskell has a strong, static type system with an automatic and robust type inference. This, in practice, means that you do not usually need to specify types of functions and also the type checker does not allow passing incompatible types. In strongly typed languages, types are considered to be a specification, due to the Curry-Howard correspondence, the direct relationship between programs and mathematical proofs. Under this great simplification, the theorem states that if a value of the type exists (or is inhabited), then the corresponding mathematical proof is correct. Or jokingly saying, if a program compiles, then there is 99 percent probability that it works according to specification. Though the question if the types conform, the specification in natural language is still open; Haskell won't help you with it. The Haskell platform The glorious Glasgow Haskell Compilation System, or simply Glasgow Haskell Compiler (GHC), is the most widely used Haskell compiler. It is the current de facto standard. The compiler is packaged into the Haskell platform that follows Python's principle, "Batteries included". The platform is updated twice a year with new compilers and libraries. It usually includes a compiler, an interactive Read-Evaluate-Print Loop (REPL) interpreter, Haskell 98/2010 libraries (so-called Prelude) that includes most of the common definitions and functions, and a set of commonly used libraries. If you are on Windows or Mac OS X, it is strongly recommended to use prepackaged installers of the Haskell platform at http://www.haskell.org/platform/. Quick tour of Haskell To start with development, first we should be familiar with a few basic features of Haskell. We really need to know about laziness, datatypes, pattern matching, type classes, and basic notion of monads to start with Haskell. Laziness Haskell is a language with lazy evaluation. From a programmer's point of view that means that the value is evaluated if and only if it is really needed. Imperative languages usually have a strict evaluation, that is, function arguments are evaluated before function application. To see the difference, let's take a look at a simple expression in Haskell: let x = 2 + 3 In a strict or eager language, the 2 + 3 expression would be immediately evaluated to 5, whereas in Haskell, only a promise to do this evaluation will be created and the value will be evaluated only when it is needed. In other words, this statement just introduces definition of x which might be used afterwards, unlike in strict language where it is an operator that assigns the computed value, 5 to a memory cell named x. Also, this strategy allows sharing of evaluations, because laziness assumes that computations can be executed whenever it is needed and therefore, the result can be memorized. It might reduce the running time by exponential factor over strict evaluation. Laziness also allows to manipulate with infinite data structures. For instance, we can construct an infinite list of natural numbers as follows: let ns = [1..] And moreover, we can manipulate it as if it is a normal list, even though some caution is needed, as you can get an infinite loop. We can take the first five elements of this infinite list by means of the built-in function, take: take 5 ns By running this example in GHCi you will get [1,2,3,4,5]. Functions as first-class citizens The notion of function is one of the core ideas in functional languages and Haskell is not an exception at all. The definition of a function includes a body of function and an optional type declaration. For instance, the take function is defined in Prelude as follows: take :: Int -> [a] -> [a] take = ... Here, the type declaration says that the function takes an integer as the argument and a list of objects of the a type, and returns a new list of the same type. Also Haskell allows partial application of a function. For example, you can construct a function that takes first five elements of the list as follows: take5 :: [a] -> [a] take5 = take 5 Also functions are themselves objects, that is, you may pass a function as an argument to another function. Prelude defines map function as a function of a function: map :: (a -> b) -> [a] -> [b] map takes a function and applies it to each element of the list. Thus functions are first-class citizens in the language and it is possible to manipulate with them as if they were normal objects. Data types Data type is a core of a strongly typed language as Haskell. The distinctive feature of Haskell data types is that they all are immutable, i.e. after object constructed it cannot be changed. It might be weird on the first sight but in the long run it has few positive consequences. First, it enables computation parallelization. Second, all data are referentially transparent, i.e. there is no difference between reference to object and object itself. Those two properties allow compiler to reason about code optimization on higher level than C/C++ compiler can. Let us consider the following data type definitions in Haskell: type TimeStamp = Integer newtype Price = Price Double data Quote = AskQuote TimeStamp Price | BidQuote TimeStamp Price There are three common ways to define data types in Haskell: The first declaration just creates a synonym for an existing data type and type checker won’t prevent you from using Integer instead of TimeStamp. Also you can use a value of type TimeStamp with any function that expects to work with an Integer. The second declaration creates a new type for prices and you are not allowed to use Double instead of Price. The compiler will raise an error in such cases and thus it will enforce the correct usage. The last declaration introduces Algebraic Data Type (ADT) and says that type Quote might be constructed either by data constructor AskQuote, or by BidQuote with time stamp and price as its parameters. Quote itself is called a type constructor. Type constructor might be parameterized by type variables. For example, types Maybe and Either, quite often used standard types, are defined as follows: data Maybe a = Nothing | Just a data Either a b = Left a | Right b Type variables a and b can be substituted by any other type. Type classes Type classes in Haskell are not classes like in object-oriented languages. It is more like interfaces with optional implementation. You can find them in other languages named traits, mix-ins and so on but unlike in them, this feature in Haskell enables ad-hoc polymorphism, i.e. function could be applied to arguments of different types. It is also known as function overloading or operator overloading. A polymorphic function can specify different implementation for different types. In principle, type class consists of function declaration over some objects. Eq type class is a standard type class that specifies how to compare two objects of same type: class Eq a where (==) :: a -> a -> Bool (/=) :: a -> a -> Bool (/=) x y = not $ (==) x y Here Eq is the name of type class, a is a type variable and == and /= are the operations defined in the type class. This definition means that some type a is of class Eq if it has defined operation == and /=. Moreover, the definition provides default implementation of the operation /=. And if you decide to implement this class for some data type, you need provide the single operation implementation. There are numbers of useful type classes defined in Prelude. For example, Eq is used to define equality between 2 objects; Ord specifies total ordering; Show and Read introduce a String representation of object; Enum type class describes enumerations, i.e. data types with null constructors. It might be quite boring to implement this quite trivial but useful type classes for each data type, so Haskell supports automatic derivation of most of standard type classes: newtype Price = Price Double deriving (Eq, Show, Read, Ord) Now objects of type Price could be converted from/to String by means of read/show functions and compared with themselves using equality operators. Those who want to know all the details can look them up in Haskell language report. IO monad Being a pure functional language Haskell requires a marker of impure functions and IO monad is the marker. Function Main is an entry point for any Haskell program. It cannot be a pure function because it changes state of the “World”, at least by creating a new process in OS. Let us take a look at its type: main :: IO () main = do putStrLn "Hello, World!" IO () type signs that there is a computation that performs some I/O operation and return empty result. There is really only one way to perform I/O in Haskell: use it in main procedure of your program. Also it is not possible to execute I/O action from arbitrary function, unless that function is in the IO monad and called from main, directly or indirectly. Summary In this article we walked through the main unique features of Haskell language and learned a bit of its history. Resources for Article : Further resources on this subject: Core Data iOS: Designing a Data Model and Building Data Objects [Article] The OpenFlow Controllers [Article] Managing Network Layout [Article]

0
0
7664

Packt

16 Oct 2009

9 min read

Oracle Wallet Manager

Packt

16 Oct 2009

9 min read

0
0
7656

article-image-amazon-dynamodb-modelling-relationships-error-handling

Packt

20 Aug 2014

7 min read

Amazon DynamoDB - Modelling relationships, Error handling

Packt

20 Aug 2014

7 min read

In this article by Tanmay Deshpande author of Mastering DynamoDB we are going to revise our concepts about the DynamoDB and will try to discover more about its features and implementation. (For more resources related to this topic, see here.) Amazon DynamoDB is a fully managed, cloud hosted, NoSQL database. It provides fast and predictable performance with the ability to scale seamlessly. It allows you to store and retrieve any amount of data , serving any level of network traffic without having any operational burden. DynamoDB gives numerous other advantages like consistent and predictable performance, flexible data modeling and durability. With just few clicks on Amazon Web Service console, you would be able to create your own DynamoDB database table, scale up or scale down provision throughput without taking down your application even for a millisecond. DynamoDB uses solid state disks (SSD) to store the data which confirms the durability of the work you are doing. It also automatically replicates the data across other AWS Availability Zones which provides built-in high availability and reliability. Before we start discussion details about DynamoDB let's try to understand what NoSQL databases are and when to choose DynamoDB over RDBMS. With the rise in data volume, variety and velocity, RDBMS were neither designed to cope up with the scale and flexibility challenges developers are facing to build the modern day applications nor were they able to take advantage of cheap commodity hardware. Also we need to provide schema before we start adding data which was restricting developers from making their application flexible. On the other hand, NoSQL databases are fast, provide flexible schema operations and do the effective use of cheap storage. Considering all these things, NoSQL is becoming popular very quickly amongst the developer community. But one has to be very cautious about when to go for NoSQL and when to stick to RDBMS. Sticking to relational databases makes sense when you know the schema is more over static, strong consistency is must and the data is not going to be that big in volume. But when you want to build an application which is Internet scalable, schema is more likely to get evolved over the time, the storage is going to be really big and the operations involved are ok to be eventually consistent then NoSQL is the way to go. There are various types of NoSQL databases. Following is the list of NoSQL database types and popular examples Document Store – MongoDB, CouchDB, MarkLogic etc. Column Store – Hbase, Cassandra etc. Key Value Store – DynamoDB, Azure, Redis etc. Graph Databases – Neo4J, DEX etc. Most of these NoSQL solutions are open source except few like DynamoDB, Azure which are available as service over Internet. DynamoDB being a key-value store indexes data only upon primary keys and one need to go through primary key to access certain attribute. Let's start learning more about DynamoDB by having a look at its history. DynamoDB's History Amazon's ecommerce platform had a huge set of decoupled services developed and managed individually and each and every service had API to be used and consumed for others. Earlier each service had direct database access which was a major bottleneck. In terms of scalability, Amazon's requirements were more than any third party vendors could provide at that time. DynamoDB was built to address Amazon's high availability, extreme scalability and durability needs. Earlier Amazon used to store its production data in relational databases and services had been provided for all required operations. But later they realized that most of the services access data only through its primary key and need not use complex queries to fetch the required data, plus maintaining these RDBMS systems required high end hardware and skilled personnel. So to overcome all such issues, Amazon's engineering team built a NoSQL database which addresses all above mentioned issues. In 2007, Amazon released one research paper on Dynamo which was combining the best of ideas from database and key value store worlds which was inspiration for many open source projects at the time. Cassandra, Voldemort and Riak were one of them. You can find the above mentioned paper at http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf Even though Dynamo had great features which would take care of all engineering needs, it was not widely accepted at that time in Amazon itself as it was not a fully managed service. When Amazon released S3 and SimpleDB, engineering teams were quite excited to adopt those compared to Dynamo as DynamoDB was bit expensive at that time due to SSDs. So finally after rounds of improvement, Amazon released Dynamo as cloud based service and since then it is one the most widely used NoSQL database. Before releasing to public cloud in 2012, DynamoDB has been the core storage service for Amazon's e-commerce platform, which was started shopping cart and session management service. Any downtime or degradation in performance had major impact on Amazon's business and any financial impact was strictly not acceptable and DynamoDB proved itself to be the best choice at the end. Now let's try to understand in more detail about DynamoDB. What is DynamoDB? DynamoDB is a fully managed, Internet scalable and easily administered, cost effective NoSQL database. It is a part of database as a service offering pane of Amazon Web Services. The above diagram shows how Amazon offers its various cloud services and where DynamoDB is exactly placed. AWS RDS is relational database as a service over Internet from Amazon while Simple DB and DynamoDB are NoSQL database as services. Both SimpleDB and DynamoDB are fully managed, non-relational services. DynamoDB is build considering fast, seamless scalability, and high performance. It runs on SSDs to provide faster responses and has no limits on request capacity and storage. It automatically partitions your data throughout the cluster to meet the expectations while in SimpleDB we have storage limit of 10 GB and can only take limited requests per second. Also in SimpleDB we have to manage our own partitions. So depending upon your need you have to choose the correct solution. To use DynamoDB, the first and foremost requirement is having an AWS account. Through easy to use AWS management console, you can directly create new tables, providing necessary information and can start loading data into the tables in few minutes. Modelling Relationships Like any other database, modeling relationships is quite interesting even though DynamoDB is a NoSQL database. Most of the time, people get confused on how do I model the relationships between various tables, in this section, we are trying make an effort to simplify this problem. To understand the relationships better, let's try to understand that using our example of Book Store where we have entities like Book, Author, Publisher, and so on. One to One In this type of relationship, one entity record of a table is related only one entity record of other table. In our book store application, we have BookInfo and Book-Details tables, BookInfo table can have short information about the book which can be used to display book information on web page whereas BookDetails table would be used when someone explicitly needs to see all the details of book. This design helps us keeping our system healthy as even if there are high request on one table, the other table would always be to up and running to fulfil what it is supposed to do. Following diagram shows how the table structure would look like. One to many In this type of relationship, one record from an entity is related to more than one record in another entity. In book store application, we can have Publisher Book Table which would keep information about the book and publisher relationship. Here we can have Publisher Id as hash key and Book Id as range key. Following diagram shows how a table structure would like. Many to many Many to many relationship means many records from an entity is related to many records from another entity. In case of book store application, we can say that a book can be authored by multiple authors and an author can write multiple books. In this we should use two tables with both and range keys.

0
0
7638

article-image-roger-mcnamee-on-silicon-valleys-obsession-for-building-data-voodoo-dolls

Savia Lobo

05 Jun 2019

5 min read

Roger McNamee on Silicon Valley’s obsession for building “data voodoo dolls”

Savia Lobo

05 Jun 2019

5 min read

The Canadian Parliament's Standing Committee on Access to Information, Privacy and Ethics hosted the hearing of the International Grand Committee on Big Data, Privacy and Democracy from Monday May 27 to Wednesday May 29. Witnesses from at least 11 countries appeared before representatives to testify on how governments can protect democracy and citizen rights in the age of big data. This section of the hearing, which took place on May 28, includes Roger McNamee’s take on why Silicon Valley wants to build data voodoo dolls for users. Roger McNamee is the Author of Zucked: Waking up to the Facebook Catastrophe. His remarks in this section of the hearing builds on previous hearing presentations by Professor Zuboff, Professor Park Ben Scott and the previous talk by Jim Balsillie. Roger McNamee’s remarks build on previous hearing presentations by Professor Zuboff, Professor Park Ben Scott and the previous talk by Jim Balsillie. He started off by saying, “Beginning in 2004, I noticed a transformation in the culture of Silicon Valley and over the course of a decade customer focused models were replaced by the relentless pursuit of global scale, monopoly, and massive wealth.” McNamee says that Google wants to make the world more efficient, they want to eliminate user stress that results from too many choices. Now, Google knew that society would not permit a business model based on denying consumer choice and free will, so they covered their tracks. Beginning around 2012, Facebook adopted a similar strategy later followed by Amazon, Microsoft, and others. For Google and Facebook, the business is behavioral prediction using which they build a high-resolution data avatar of every consumer--a voodoo doll if you will. They gather a tiny amount of data from user posts and queries; but the vast majority of their data comes from surveillance, web tracking, scanning emails and documents, data from apps and third parties, and ambient surveillance from products like Alexa, Google assistant, sidewalk labs, and Pokemon go. Google and Facebook used data voodoo dolls to provide their customers who are marketers with perfect information about every consumer. They use the same data to manipulate consumer choices just as in China behavioral manipulation is the goal. The algorithms of Google and Facebook are tuned to keep users on site and active; preferably by pressing emotional buttons that reveal each user's true self. For most users, this means content that provokes fear or outrage. Hate speech, disinformation, and conspiracy theories are catnip for these algorithms. The design of these platforms treats all content precisely the same whether it be hard news from a reliable site, a warning about an emergency, or a conspiracy theory. The platforms make no judgments, users choose aided by algorithms that reinforce past behavior. The result is, 2.5 billion Truman shows on Facebook each a unique world with its own facts. In the U.S. nearly 40% of the population identifies with at least one thing that is demonstrably false; this undermines democracy. “The people at Google and Facebook are not evil they are the products of an American business culture with few rules where misbehavior seldom results in punishment”, he says. Unlike industrial businesses, internet platforms are highly adaptable and this is the challenge. If you take away one opportunity they will move on to the next one and they are moving upmarket getting rid of the middlemen. Today, they apply behavioral prediction to advertising but they have already set their sights on transportation and financial services. This is not an argument against undermining their advertising business but rather a warning that it may be a Pyrrhic victory. If a user’s goals are to protect democracy and personal liberty, McNamee tells them, they have to be bold. They have to force a radical transformation of the business model of internet platforms. That would mean, at a minimum banning web tracking, scanning of email and documents, third party commerce and data, and ambient surveillance. A second option would be to tax micro targeted advertising to make it economically unattractive. But you also need to create space for alternative business models using trust that longs last. Startups can happen anywhere they can come from each of your countries. At the end of the day, though the most effective path to reform would be to shut down the platforms at least temporarily as Sri Lanka did. Any country can go first. The platform's have left you no choice the time has come to call their bluff companies with responsible business models will emerge overnight to fill the void. McNamee explains, “when they (organizations) gather all of this data the purpose of it is to create a high resolution avatar of each and every human being. Doesn't matter whether they use their systems or not they collect it on absolutely everybody. In the Caribbean, Voodoo was essentially this notion that you create a doll, an avatar, such that you can poke it with a pin and the person would experience that pain right and so it becomes literally a representation of the human being.” To know more you can listen to the full hearing video titled, “Meeting No. 152 ETHI - Standing Committee on Access to Information, Privacy and Ethics” on ParlVU. Experts present most pressing issues facing global lawmakers on citizens’ privacy, democracy and rights to freedom of speech Time for data privacy: DuckDuckGo CEO Gabe Weinberg in an interview with Kara Swisher Over 19 years of ANU(Australian National University) students’ and staff data breached

0
0
7628

article-image-administration-rights-for-power-bi-users

Pravin Dhandre

02 Jul 2018

8 min read

Administration rights for Power BI users

Pravin Dhandre

02 Jul 2018

8 min read

In this tutorial, you will understand and learn administration rights/rules for Power BI users. This includes setting and monitoring rules like; who in the organization can utilize which feature, how Power BI Premium capacity is allocated and by whom, and other settings such as embed codes and custom visuals. This article is an excerpt from a book written by Brett Powell titled Mastering Microsoft Power BI. The admin portal is accessible to Office 365 Global Administrators and users mapped to the Power BI service administrator role. To open the admin portal, log in to the Power BI service and select the Admin portal item from the Settings (Gear icon) menu in the top right, as shown in the following screenshot: All Power BI users, including Power BI free users, are able to access the Admin portal. However, users who are not admins can only view the Capacity settings page. The Power BI service administrators and Office 365 global administrators have view and edit access to the following seven pages: Administrators of Power BI most commonly utilize the Tenant settings and Capacity settings as described in the Tenant Settings and Power BI Premium Capacities sections later in this tutorial. However, the admin portal can also be used to manage any approved custom visuals for the organization. Usage metrics The Usage metrics page of the Admin portal provides admins with a Power BI dashboard of several top metrics, such as the most consumed dashboards and the most consumed dashboards by workspace. However, the dashboard cannot be modified and the tiles of the dashboard are not linked to any underlying reports or separate dashboards to support further analysis. Given these limitations, alternative monitoring solutions are recommended, such as the Office 365 audit logs and usage metric datasets specific to Power BI apps. Details of both monitoring options are included in the app usage metrics and Power BI audit log activities sections later in this chapter. Users and Audit logs The Users and Audit logs pages only provide links to the Office 365 admin center. In the admin center, Power BI users can be added, removed and managed. If audit logging is enabled for the organization via the Create audit logs for internal activity and auditing and compliance tenant setting, this audit log data can be retrieved from the Office 365 Security & Compliance Center or via PowerShell. This setting is noted in the following section regarding the Tenant settings tab of the Power BI admin portal. An Office 365 license is not required to utilize the Office 365 admin center for Power BI license assignments or to retrieve Power BI audit log activity. Tenant settings The Tenant settings page of the Admin portal allows administrators to enable or disable various features of the Power BI web service. Likewise, the administrator could allow only a certain security group to embed Power BI content in SaaS applications such as SharePoint Online. The following diagram identifies the 18 tenant settings currently available in the admin portal and the scope available to administrators for configuring each setting: From a data security perspective, the first seven settings within the Export and Sharing and Content packs and apps groups are most important. For example, many organizations choose to disable the Publish to web feature for the entire organization. Additionally, only certain security groups may be allowed to export data or to print hard copies of reports and dashboards. As shown in the Scope column of the previous table and the following example, granular security group configurations are available to minimize risk and manage the overall deployment. Currently, only one tenant setting is available for custom visuals and this setting (Custom visuals settings) can be enabled or disabled for the entire organization only. For organizations that wish to restrict or prohibit custom visuals for security reasons, this setting can be used to eliminate the ability to add, view, share, or interact with custom visuals. More granular controls to this setting are expected later in 2018, such as the ability to define users or security groups of users who are allowed to use custom visuals. In the following screenshot from the Tenant settings page of the Admin portal, only the users within the BI Admin security group who are not also members of the BI Team security group are allowed to publish apps to the entire organization: For example, a report author who also helps administer the On-premises data gateway via the BI Admin security group would be denied the ability to publish apps to the organization given membership in the BI Team security group. Many of the tenant setting configurations will be more simple than this example, particularly for smaller organizations or at the beginning of Power BI deployments. However, as adoption grows and the team responsible for Power BI changes, it's important that the security groups created to help administer these settings are kept up to date. Embed Codes Embed Codes are created and stored in the Power BI service when the Publish to web feature is utilized. As described in the Publish to web section of the previous chapter, this feature allows a Power BI report to be embedded in any website or shared via URL on the public internet. Users with edit rights to the workspace of the published to web content are able to manage the embed codes themselves from within the workspace. However, the admin portal provides visibility and access to embed codes across all workspaces, as shown in the following screenshot: Via the Actions commands on the far right of the Embed Codes page, a Power BI Admin can view the report in a browser (diagonal arrow) or remove the embed code. The Embed Codes page can be helpful to periodically monitor the usage of the Publish to web feature and for scenarios in which data was included in a publish to web report that shouldn't have been, and thus needs to be removed. As shown in the Power BI Tenant settings table referenced in the previous section, this feature can be enabled or disabled for the entire organization or for specific users within security groups. Organizational Custom visuals The Custom Visuals page allows admins to upload and manage custom visuals (.pbiviz files) that have been approved for use within the organization. For example, an organization may have proprietary custom visuals developed internally, which it wishes to expose to business users. Alternatively, the organization may wish to define a set of approved custom visuals, such as only the custom visuals that have been certified by Microsoft. In the following screenshot, the Chiclet Slicer custom visual is added as an organizational custom visual from the Organizational visuals page of the Power BI admin portal: The Organizational visuals page provides a link (Add a custom visual) to launch the form and identifies all uploaded visuals, as well as their last update. Once a visual has been uploaded, it can be deleted but not updated or modified. Therefore, when a new version of an organizational visual becomes available, this visual can be added to the list of organizational visuals with a descriptive title (Chiclet Slicer v2.0). Deleting an organizational custom visual will cause any reports that use this visual to stop rendering. The following screenshot reflects the uploaded Chiclet Slicer custom visual on the Organization visuals page: Once the custom visual has been uploaded as an organizational custom visual, it will be accessible to users in Power BI Desktop. In the following screenshot from Power BI Desktop, the user has opened the MARKETPLACE of custom visuals and selected MY ORGANIZATION: In this screenshot, rather than searching through the MARKETPLACE, the user can go directly to visuals defined by the organization. The marketplace of custom visuals can be launched via either the Visualizations pane or the From Marketplace icon on the Home tab of the ribbon. Organizational custom visuals are not supported for reports or dashboards shared with external users. Additionally, organizational custom visuals used in reports that utilize the publish to web feature will not render outside the Power BI tenant. Moreover, Organizational custom visuals are currently a preview feature. Therefore, users must enable the My organization custom visuals feature via the Preview features tab of the Options window in Power BI Desktop. With this, we got you acquainted with features and processes applicable in administering Power BI for an organization. This includes the configuration of tenant settings in the Power BI admin portal, analyzing the usage of Power BI assets, and monitoring overall user activity via the Office 365 audit logs. If you found this tutorial useful, do check out the book Mastering Microsoft Power BI to develop visually rich, immersive, and interactive Power BI reports and dashboards. Unlocking the secrets of Microsoft Power BI A tale of two tools: Tableau and Power BI Building a Microsoft Power BI Data Model

0
0
7578

article-image-building-linear-regression-model-python-developers

Pravin Dhandre

07 Feb 2018

7 min read

Building a Linear Regression Model in Python for developers

Pravin Dhandre

07 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rodolfo Bonnin titled Machine Learning for Developers. This book is a systematic guide for developers to implement various Machine Learning techniques and develop efficient and intelligent applications.[/box] Let’s start using one of the most well-known toy datasets, explore it, and select one of the dimensions to learn how to build a linear regression model for its values. Let's start by importing all the libraries (scikit-learn, seaborn, and matplotlib); one of the excellent features of Seaborn is its ability to define very professional-looking style settings. In this case, we will use the whitegrid style: import numpy as np from sklearn import datasets import seaborn.apionly as sns %matplotlib inline import matplotlib.pyplot as plt sns.set(style='whitegrid', context='notebook') The Iris Dataset It’s time to load the Iris dataset. This is one of the most well-known historical datasets. You will find it in many books and publications. Given the good properties of the data, it is useful for classification and regression examples. The Iris dataset (https://archive.ics.uci.edu/ml/datasets/Iris) contains 50 records for each of the three types of iris, 150 lines in a total over five fields. Each line is a measurement of the following: Sepal length in cm Sepal width in cm Petal length in cm Petal width in cm The final field is the type of flower (setosa, versicolor, or virginica). Let’s use the load_dataset method to create a matrix of values from the dataset: iris2 = sns.load_dataset('iris') In order to understand the dependencies between variables, we will implement the covariance operation. It will receive two arrays as parameters and will return the covariance(x,y) value: def covariance (X, Y): xhat=np.mean(X) yhat=np.mean(Y) epsilon=0 for x,y in zip (X,Y): epsilon=epsilon+(x-xhat)*(y-yhat) return epsilon/(len(X)-1) Let's try the implemented function and compare it with the NumPy function. Note that we calculated cov (a,b), and NumPy generated a matrix of all the combinations cov(a,a), cov(a,b), so our result should be equal to the values (1,0) and (0,1) of that matrix: print (covariance ([1,3,4], [1,0,2])) print (np.cov([1,3,4], [1,0,2])) 0.5 [[ 2.33333333 0.5 ] [ 0.5 1. ]] Having done a minimal amount of testing of the correlation function as defined earlier, receive two arrays, such as covariance, and use them to get the final value: def correlation (X, Y): return (covariance(X,Y)/(np.std(X, ddof=1)*np.std(Y, ddof=1))) ##We have to indicate ddof=1 the unbiased std Let’s test this function with two sample arrays, and compare this with the (0,1) and (1,0) values of the correlation matrix from NumPy: print (correlation ([1,1,4,3], [1,0,2,2])) print (np.corrcoef ([1,1,4,3], [1,0,2,2])) 0.870388279778 [[ 1. 0.87038828] [ 0.87038828 1. ]] Getting an intuitive idea with Seaborn pairplot A very good idea when starting worke on a problem is to get a graphical representation of all the possible variable combinations. Seaborn’s pairplot function provides a complete graphical summary of all the variable pairs, represented as scatterplots, and a representation of the univariate distribution for the matrix diagonal. Let’s look at how this plot type shows all the variables dependencies, and try to look for a linear relationship as a base to test our regression methods: sns.pairplot(iris2, size=3.0) <seaborn.axisgrid.PairGrid at 0x7f8a2a30e828> Pairplot of all the variables in the dataset. Lets' select two variables that, from our initial analysis, have the property of being linearly dependent. They are petal_width and petal_length: X=iris2['petal_width'] Y=iris2['petal_length'] Let’s now take a look at this variable combination, which shows a clear linear tendency: plt.scatter(X,Y) This is the representation of the chosen variables, in a scatter type graph: This is the current distribution of data that we will try to model with our linear prediction function. Creating the prediction function First, let's define the function that will abstractedly represent the modeled data, in the form of a linear function, with the form y=beta*x+alpha: def predict(alpha, beta, x_i): return beta * x_i + alpha Defining the error function It’s now time to define the function that will show us the difference between predictions and the expected output during training. We have two main alternatives: measuring the absolute difference between the values (or L1), or measuring a variant of the square of the difference (or L2). Let’s define both versions, including the first formulation inside the second: def error(alpha, beta, x_i, y_i): #L1 return y_i - predict(alpha, beta, x_i) def sum_sq_e(alpha, beta, x, y): #L2 return sum(error(alpha, beta, x_i, y_i) ** 2 for x_i, y_i in zip(x, y)) Correlation fit Now, we will define a function implementing the correlation method to find the parameters for our regression: def correlation_fit(x, y): beta = correlation(x, y) * np.std(y, ddof=1) / np.std(x,ddof=1) alpha = np.mean(y) - beta * np.mean(x) return alpha, beta Let’s then run the fitting function and print the guessed parameters: alpha, beta = correlation_fit(X, Y) print(alpha) print(beta) 1.08355803285 2.22994049512 Let’s now graph the regressed line with the data in order to intuitively show the appropriateness of the solution: plt.scatter(X,Y) xr=np.arange(0,3.5) plt.plot(xr,(xr*beta)+alpha) This is the final plot we will get with our recently calculated slope and intercept: Final regressed line Polynomial regression and an introduction to underfitting and overfitting When looking for a model, one of the main characteristics we look for is the power of generalizing with a simple functional expression. When we increase the complexity of the model, it's possible that we are building a model that is good for the training data, but will be too optimized for that particular subset of data. Underfitting, on the other hand, applies to situations where the model is too simple, such as this case, which can be represented fairly well with a simple linear model. In the following example, we will work on the same problem as before, using the scikit- learn library to search higher-order polynomials to fit the incoming data with increasingly complex degrees. Going beyond the normal threshold of a quadratic function, we will see how the function looks to fit every wrinkle in the data, but when we extrapolate, the values outside the normal range are clearly out of range: from sklearn.linear_model import Ridge from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline ix=iris2['petal_width'] iy=iris2['petal_length'] # generate points used to represent the fitted function x_plot = np.linspace(0, 2.6, 100) # create matrix versions of these arrays X = ix[:, np.newaxis] X_plot = x_plot[:, np.newaxis] plt.scatter(ix, iy, s=30, marker='o', label="training points") for count, degree in enumerate([3, 6, 20]): model = make_pipeline(PolynomialFeatures(degree), Ridge()) model.fit(X, iy) y_plot = model.predict(X_plot) plt.plot(x_plot, y_plot, label="degree %d" % degree) plt.legend(loc='upper left') plt.show() The combined graph shows how the different polynomials' coefficients describe the data population in different ways. The 20 degree polynomial shows clearly how it adjusts perfectly for the trained dataset, and after the known values, it diverges almost spectacularly, going against the goal of generalizing for future data. Curve ﬁtting of the initial dataset, with polynomials of increasing values With this, we successfully explored how to develop an efficient linear regression model in Python and how you can make predictions using the designed model. We've reviewed ways to identify and optimize the correlation between the prediction and the expected output using simple and definite functions. If you enjoyed our post, you must check out Machine Learning for Developers to uncover advanced tools for building machine learning applications on your fingertips.

0
1
7566

How-To Tutorials - Data

Integrating a D3.js visualization into a simple AngularJS application

How to Perform Iteration on Sets in MDX

Spam Filtering - Natural Language Processing Approach

Basics of Spark SQL and its components

The hands-on guide to Machine Learning with R by Brett Lantz

IBM SPSS Modeler – Pushing the Limits

FPGA Mining

Introducing Interactive Plotting

Reconstructing 3D Scenes

Getting started with Haskell

Trending Topics

Oracle Wallet Manager

Amazon DynamoDB - Modelling relationships, Error handling

Roger McNamee on Silicon Valley’s obsession for building “data voodoo dolls”

Administration rights for Power BI users

Building a Linear Regression Model in Python for developers

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access