Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-understanding-hbase-ecosystem
Packt
24 Nov 2014
11 min read
Save for later

Understanding the HBase Ecosystem

Packt
24 Nov 2014
11 min read
This article by Shashwat Shriparv, author of the book, Learning HBase, will introduce you to the world of HBase. (For more resources related to this topic, see here.) HBase is a horizontally scalable, distributed, open source, and a sorted map database. It runs on top of Hadoop file system that is Hadoop Distributed File System (HDFS). HBase is a NoSQL nonrelational database that doesn't always require a predefined schema. It can be seen as a scaling flexible, multidimensional spreadsheet where any structure of data is fit with on-the-fly addition of new column fields, and fined column structure before data can be inserted or queried. In other words, HBase is a column-based database that runs on top of Hadoop distributed file system and supports features such as linear scalability (scale out), automatic failover, automatic sharding, and more flexible schema. HBase is modeled on Google BigTable. It was inspired by Google BigTable, which is a compressed, high-performance, proprietary data store built on the Google filesystem. HBase was developed as a Hadoop subproject to support storage of structural data, which can take advantage of most distributed files systems (typically, the Hadoop Distributed File System known as HDFS). The following table contains key information about HBase and its features: Features Description Developed by Apache Written in Java Type Column oriented License Apache License Lacking features of relational databases SQL support, relations, primary, foreign, and unique key constraints, normalization Website http://hbase.apache.org Distributions Apache, Cloudera Download link http://mirrors.advancedhosters.com/apache/hbase/ Mailing lists The user list: user-subscribe@hbase.apache.org The developer list: dev-subscribe@hbase.apache.org Blog http://blogs.apache.org/hbase/ HBase layout on top of Hadoop The following figure represents the layout information of HBase on top of Hadoop: There is more than one ZooKeeper in the setup, which provides high availability of master status; a RegionServer may contain multiple rations. The RegionServers run on the machines where DataNodes run. There can be as many RegionServers as DataNodes. RegionServers can have multiple HRegions; one HRegion can have one HLog and multiple HFiles with its associate's MemStore. HBase can be seen as a master-slave database where the master is called HMaster, which is responsible for coordination between client application and HRegionServer. It is also responsible for monitoring and recording metadata changes and management. Slaves are called HRegionServers, which serve the actual tables in form of regions. These regions are the basic building blocks of the HBase tables, which contain distribution of tables. So, HMaster and RegionServer work in coordination to serve the HBase tables and HBase cluster. Usually, HMaster is co-hosted with Hadoop NameNode daemon process on a server and communicates to DataNode daemon for reading and writing data on HDFS. The RegionServer runs or is co-hosted on the Hadoop DataNodes. Comparing architectural differences between RDBMs and HBase Let's list the major differences between relational databases and HBase: Relational databases HBase Uses tables as databases Uses regions as databases Filesystems supported are FAT, NTFS, and EXT Filesystem supported is HDFS The technique used to store logs is commit logs The technique used to store logs is Write-Ahead Logs (WAL) The reference system used is coordinate system The reference system used is ZooKeeper Uses the primary key Uses the row key Partitioning is supported Sharding is supported Use of rows, columns, and cells Use of rows, column families, columns, and cells HBase features Let's see the major features of HBase that make it one of the most useful databases for the current and future industry: Automatic failover and load balancing: HBase runs on top of HDFS, which is internally distributed and automatically recovered using multiple block allocation and replications. It works with multiple HMasters and region servers. This failover is also facilitated using HBase and RegionServer replication. Automatic sharding: An HBase table is made up of regions that are hosted by RegionServers and these regions are distributed throughout the RegionServers on different DataNodes. HBase provides automatic and manual splitting of these regions to smaller subregions, once it reaches a threshold size to reduce I/O time and overhead. Hadoop/HDFS integration: It's important to note that HBase can run on top of other filesystems as well. While HDFS is the most common choice as it supports data distribution and high availability using distributed Hadoop, for which we just need to set some configuration parameters and enable HBase to communicate to Hadoop, an out-of-the-box underlying distribution is provided by HDFS. Real-time, random big data access: HBase uses log-structured merge-tree (LSM-tree) as data storage architecture internally, which merges smaller files to larger files periodically to reduce disk seeks. MapReduce: HBase has a built-in support of Hadoop MapReduce framework for fast and parallel processing of data stored in HBase. You can search for the Package org.apache.hadoop.hbase.mapreduce for more details. Java API for client access: HBase has a solid Java API support (client/server) for easy development and programming. Thrift and a RESTtful web service: HBase not only provides a thrift and RESTful gateway but also web service gateways for integrating and accessing HBase besides Java code (HBase Java APIs) for accessing and working with HBase. Support for exporting metrics via the Hadoop metrics subsystem: HBase provides Java Management Extensions (JMX) and exporting matrix for monitoring purposes with tools such as Ganglia and Nagios. Distributed: HBase works when used with HDFS. It provides coordination with Hadoop so that distribution of tables, high availability, and consistency is supported by it. Linear scalability (scale out): Scaling of HBase is not scale up but scale out, which means that we don't need to make servers more powerful but we add more machines to its cluster. We can add more nodes to the cluster on the fly. As soon as a new RegionServer node is up, the cluster can begin rebalancing, start the RegionServer on the new node, and it is scaled up, it is as simple as that. Column oriented: HBase stores each column separately in contrast with most of the relational databases, which uses stores or are row-based storage. So in HBase, columns are stored contiguously and not the rows. More about row- and column-oriented databases will follow. HBase shell support: HBase provides a command-line tool to interact with HBase and perform simple operations such as creating tables, adding data, and scanning data. This also provides full-fledged command-line tool using which we can interact with HBase and perform operations such as creating table, adding data, removing data, and a few other administrative commands. Sparse, multidimensional, sorted map database: HBase is a sparse, multidimensional, sorted map-based database, which supports multiple versions of the same record. Snapshot support: HBase supports taking snapshots of metadata for getting the previous or correct state form of data. HBase in the Hadoop ecosystem Let's see where HBase sits in the Hadoop ecosystem. In the Hadoop ecosystem, HBase provides a persistent, structured, schema-based data store. The following figure illustrates the Hadoop ecosystem: HBase can work as a separate entity on the local filesystem (which is not really effective as no distribution is provided) as well as in coordination with Hadoop as a separate but connected entity. As we know, Hadoop provides two services, a distributed files system (HDFS) for storage and a MapReduce framework for processing in a parallel mode. When there was a need to store structured data (data in the form of tables, rows and columns), which most of the programmers are already familiar with, the programmers were finding it difficult to process the data that was stored on HDFS as an unstructured flat file format. This led to the evolution of HBase, which provided a way to store data in a structural way. Consider that we have got a CSV file stored on HDFS and we need to query from it. We would need to write a Java code for this, which wouldn't be a good option. It would be better if we could specify the data key and fetch the data from that file. So, what we can do here is create a schema or table with the same structure of CSV file to store the data of the CSV file in the HBase table and query using HBase APIs, or HBase shell using key. Data representation in HBase Let's look into the representation of rows and columns in HBase table: An HBase table is divided into rows, column families, columns, and cells. Row keys are unique keys to identify a row, column families are groups of columns, columns are fields of the table, and the cell contains the actual value or the data. So, we have been through the introduction of HBase; now, let's see what Hadoop and its components are in brief. It is assumed here that you are already familiar with Hadoop; if not, following a brief introduction about Hadoop will help you to understand it. Hadoop Hadoop is an underlying technology of HBase, providing high availability, fault tolerance, and distribution. It is an Apache-sponsored, free, open source, Java-based programming framework which supports large dataset storage. It provides distributed file system and MapReduce, which is a distributed programming framework. It provides a scalable, reliable, distributed storage and development environment. Hadoop makes it possible to run applications on a system with tens to tens of thousands of nodes. The underlying distributed file system provides large-scale storage, rapid data access. It has the following submodules: Hadoop Common: This is the core component that supports the other Hadoop modules. It is like the master components facilitating communication and coordination between different Hadoop modules. Hadoop distributed file system: This is the underlying distributed file system, which is abstracted on the top of the local filesystem that provides high throughput of read and write operations of data on Hadoop. Hadoop YARN: This is the new framework that is shipped with newer releases of Hadoop. It provides job scheduling and job and resource management. Hadoop MapReduce: This is the Hadoop-based processing system that provides parallel processing of large data and datasets. Other Hadoop subprojects are HBase, Hive, Ambari, Avro, Cassandra (Cassandra isn't a Hadoop subproject, it's a related project; they solve similar problems in different ways), Mahout, Pig, Spark, ZooKeeper (ZooKeeper isn't a Hadoop subproject. It's a dependency shared by many distributed systems), and so on. All of these have different usability and the combination of all these subprojects forms the Hadoop ecosystem. Core daemons of Hadoop The following are the core daemons of Hadoop: NameNode: This stores and manages all metadata about the data present on the cluster, so it is the single point of contact to Hadoop. In the new release of Hadoop, we have an option of more than one NameNode for high availability. JobTracker: This runs on the NameNode and performs the MapReduce of the jobs submitted to the cluster. SecondaryNameNode: This maintains the backup of metadata present on the NameNode, and also records the file system changes. DataNode: This will contain the actual data. TaskTracker: This will perform tasks on the local data assigned by the JobTracker. The preceding are the daemons in the case of Hadoop v1 or earlier. In newer versions of Hadoop, we have ResourceManager instead of JobTracker, the node manager instead of TaskTrackers, and the YARN framework instead of a simple MapReduce framework. The following is the comparison between daemons in Hadoop 1 and Hadoop 2: Hadoop 1 Hadoop 2 HDFS NameNode Secondary NameNode DataNode   NameNode (more than one active/standby) Checkpoint node DataNode Processing MapReduce v1 JobTracker TaskTracker   YARN (MRv2) ResourceManager NodeManager Application Master Comparing HBase with Hadoop As we now know what HBase and what Hadoop are, let's have a comparison between HDFS and HBase for better understanding: Hadoop/HDFS HBase This provide filesystem for distributed storage This provides tabular column-oriented data storage This is optimized for storage of huge-sized files with no random read/write of these files This is optimized for tabular data with random read/write facility This uses flat files This uses key-value pairs of data The data model is not flexible Provides a flexible data model This uses file system and processing framework This uses tabular storage with built-in Hadoop MapReduce support This is mostly optimized for write-once read-many This is optimized for both read/write many Summary So in this article, we discussed the introductory aspects of HBase and it's features. We have also discussed HBase's components and their place in the HBase ecosystem. Resources for Article: Further resources on this subject: The HBase's Data Storage [Article] HBase Administration, Performance Tuning [Article] Comparative Study of NoSQL Products [Article]
Read more
  • 0
  • 0
  • 15533

article-image-plot-function
Packt
18 Nov 2014
17 min read
Save for later

The plot function

Packt
18 Nov 2014
17 min read
In this article by L. Felipe Martins, the author of the book, IPython Notebook Essentials, has discussed about the plot() function, which is an important aspect of matplotlib, an IPython library for production of publication-quality graphs. (For more resources related to this topic, see here.) The plot() function is the workhorse of the matplotlib library. In this section, we will explore the line-plotting and formatting capabilities included in this function. To make things a bit more concrete, let's consider the formula for logistic growth, as follows: This model is frequently used to represent growth that shows an initial exponential phase, and then is eventually limited by some factor. The examples are the population in an environment with limited resources and new products and/or technological innovations, which initially attract a small and quickly growing market but eventually reach a saturation point. A common strategy to understand a mathematical model is to investigate how it changes as the parameters defining it are modified. Let's say, we want to see what happens to the shape of the curve when the parameter b changes. To be able to do what we want more efficiently, we are going to use a function factory. This way, we can quickly create logistic models with arbitrary values for r, a, b, and c. Run the following code in a cell: def make_logistic(r, a, b, c):    def f_logistic(t):        return a / (b + c * exp(-r * t))    return f_logistic The function factory pattern takes advantage of the fact that functions are first-class objects in Python. This means that functions can be treated as regular objects: they can be assigned to variables, stored in lists in dictionaries, and play the role of arguments and/or return values in other functions. In our example, we define the make_logistic() function, whose output is itself a Python function. Notice how the f_logistic() function is defined inside the body of make_logistic() and then returned in the last line. Let's now use the function factory to create three functions representing logistic curves, as follows: r = 0.15 a = 20.0 c = 15.0 b1, b2, b3 = 2.0, 3.0, 4.0 logistic1 = make_logistic(r, a, b1, c) logistic2 = make_logistic(r, a, b2, c) logistic3 = make_logistic(r, a, b3, c) In the preceding code, we first fix the values of r, a, and c, and define three logistic curves for different values of b. The important point to notice is that logistic1, logistic2, and logistic3 are functions. So, for example, we can use logistic1(2.5) to compute the value of the first logistic curve at the time 2.5. We can now plot the functions using the following code: tmax = 40 tvalues = linspace(0, tmax, 300) plot(tvalues, logistic1(tvalues)) plot(tvalues, logistic2(tvalues)) plot(tvalues, logistic3(tvalues)) The first line in the preceding code sets the maximum time value, tmax, to be 40. Then, we define the set of times at which we want the functions evaluated with the assignment, as follows: tvalues = linspace(0, tmax, 300) The linspace() function is very convenient to generate points for plotting. The preceding code creates an array of 300 equally spaced points in the interval from 0 to tmax. Note that, contrary to other functions, such as range() and arange(), the right endpoint of the interval is included by default. (To exclude the right endpoint, use the endpoint=False option.) After defining the array of time values, the plot() function is called to graph the curves. In its most basic form, it plots a single curve in a default color and line style. In this usage, the two arguments are two arrays. The first array gives the horizontal coordinates of the points being plotted, and the second array gives the vertical coordinates. A typical example will be the following function call: plot(x,y) The variables x and y must refer to NumPy arrays (or any Python iterable values that can be converted into an array) and must have the same dimensions. The points plotted have coordinates as follows: x[0], y[0] x[1], y[1] x[2], y[2] … The preceding command will produce the following plot, displaying the three logistic curves: You may have noticed that before the graph is displayed, there is a line of text output that looks like the following: [<matplotlib.lines.Line2D at 0x7b57c50>] This is the return value of the last call to the plot() function, which is a list (or with a single element) of objects of the Line2D type. One way to prevent the output from being shown is to enter None as the last row in the cell. Alternatively, we can assign the return value of the last call in the cell to a dummy variable: _dummy_ = plot(tvalues, logistic3(tvalues)) The plot() function supports plotting several curves in the same function call. We need to change the contents of the cell that are shown in the following code and run it again: tmax = 40 tvalues = linspace(0, tmax, 300) plot(tvalues, logistic1(tvalues),      tvalues, logistic2(tvalues),      tvalues, logistic3(tvalues)) This form saves some typing but turns out to be a little less flexible when it comes to customizing line options. Notice that the text output produced now is a list with three elements: [<matplotlib.lines.Line2D at 0x9bb6cc0>, <matplotlib.lines.Line2D at 0x9bb6ef0>, <matplotlib.lines.Line2D at 0x9bb9518>] This output can be useful in some instances. For now, we will stick with using one call to plot() for each curve, since it produces code that is clearer and more flexible. Let's now change the line options in the plot and set the plot bounds. Change the contents of the cell to read as follows: plot(tvalues, logistic1(tvalues),      linewidth=1.5, color='DarkGreen', linestyle='-') plot(tvalues, logistic2(tvalues),      linewidth=2.0, color='#8B0000', linestyle=':') plot(tvalues, logistic3(tvalues),      linewidth=3.5, color=(0.0, 0.0, 0.5), linestyle='--') axis([0, tmax, 0, 11.]) None Running the preceding command lines will produce the following plots: The options set in the preceding code are as follows: The first curve is plotted with a line width of 1.5, with the HTML color of DarkGreen, and a filled-line style The second curve is plotted with a line width of 2.0, colored with the RGB value given by the hexadecimal string '#8B0000', and a dotted-line style The third curve is plotted with a line width of 3.0, colored with the RGB components, (0.0, 0.0, 0.5), and a dashed-line style Notice that there are different ways of specifying a fixed color: a HTML color name, a hexadecimal string, or a tuple of floating-point values. In the last case, the entries in the tuple represent the intensity of the red, green, and blue colors, respectively, and must be floating-point values between 0.0 and 1.0. A complete list of HTML name colors can be found at http://www.w3schools.com/html/html_colornames.asp. Editor's Tip: For more insights on colors, check out https://dgtl.link/colors Line styles are specified by a symbolic string. The allowed values are shown in the following table: Symbol string Line style '-' Solid (the default) '--' Dashed ':' Dotted '-.' Dash-dot 'None', '', or '' Not displayed After the calls to plot(), we set the graph bounds with the function call: axis([0, tmax, 0, 11.]) The argument to axis() is a four-element list that specifies, in this order, the maximum and minimum values of the horizontal coordinates, and the maximum and minimum values of the vertical coordinates. It may seem non-intuitive that the bounds for the variables are set after the plots are drawn. In the interactive mode, matplotlib remembers the state of the graph being constructed, and graphics objects are updated in the background after each command is issued. The graph is only rendered when all computations in the cell are done so that all previously specified options take effect. Note that starting a new cell clears all the graph data. This interactive behavior is part of the matplotlib.pyplot module, which is one of the components imported by pylab. Besides drawing a line connecting the data points, it is also possible to draw markers at specified points. Change the graphing commands indicated in the following code snippet, and then run the cell again: plot(tvalues, logistic1(tvalues),      linewidth=1.5, color='DarkGreen', linestyle='-',      marker='o', markevery=50, markerfacecolor='GreenYellow',      markersize=10.0) plot(tvalues, logistic2(tvalues),      linewidth=2.0, color='#8B0000', linestyle=':',      marker='s', markevery=50, markerfacecolor='Salmon',      markersize=10.0) plot(tvalues, logistic3(tvalues),      linewidth=2.0, color=(0.0, 0.0, 0.5), linestyle='--',      marker = '*', markevery=50, markerfacecolor='SkyBlue',      markersize=12.0) axis([0, tmax, 0, 11.]) None Now, the graph will look as shown in the following figure: The only difference from the previous code is that now we added options to draw markers. The following are the options we use: The marker option specifies the shape of the marker. Shapes are given as symbolic strings. In the preceding examples, we use 'o' for a circular marker, 's' for a square, and '*' for a star. A complete list of available markers can be found at http://matplotlib.org/api/markers_api.html#module-matplotlib.markers. The markevery option specifies a stride within the data points for the placement of markers. In our example, we place a marker after every 50 data points. The markercolor option specifies the color of the marker. The markersize option specifies the size of the marker. The size is given in pixels. There are a large number of other options that can be applied to lines in matplotlib. A complete list is available at http://matplotlib.org/api/artist_api.html#module-matplotlib.lines. Adding a title, labels, and a legend The next step is to add a title and labels for the axes. Just before the None line, add the following three lines of code to the cell that creates the graph: title('Logistic growth: a={:5.2f}, c={:5.2f}, r={:5.2f}'.format(a, c, r)) xlabel('$t$') ylabel('$N(t)=a/(b+ce^{-rt})$') In the first line, we call the title() function to set the title of the plot. The argument can be any Python string. In our example, we use a formatted string: title('Logistic growth: a={:5.2f}, b={:5.2f}, r={:5.2f}'.format(a, c, r)) We use the format() method of the string class. The formats are placed between braces, as in {:5.2f}, which specifies a floating-point format with five spaces and two digits of precision. Each of the format specifiers is then associated sequentially with one of the data arguments of the method. A full documentation covering the details of string formatting is available at https://docs.python.org/2/library/string.html. The axis labels are set in the calls: xlabel('$t$') ylabel('$N(t)=a/(b+ce^{-rt})$') As in the title() functions, the xlabel() and ylabel() functions accept any Python string. Note that in the '$t$' and '$N(t)=a/(b+ce^{-rt}$' strings, we use LaTeX to format the mathematical formulas. This is indicated by the dollar signs, $...$, in the string. After the addition of a title and labels, our graph looks like the following: Next, we need a way to identify each of the curves in the picture. One way to do that is to use a legend, which is indicated as follows: legend(['b={:5.2f}'.format(b1),        'b={:5.2f}'.format(b2),        'b={:5.2f}'.format(b3)]) The legend() function accepts a list of strings. Each string is associated with a curve in the order they are added to the plot. Notice that we are again using formatted strings. Unfortunately, the preceding code does not produce great results. The legend, by default, is placed in the top-right corner of the plot, which, in this case, hides part of the graph. This is easily fixed using the loc option in the legend function, as shown in the following code: legend(['b={:5.2f}'.format(b1),        'b={:5.2f}'.format(b2),        'b={:5.2f}'.format(b3)], loc='upper left') Running this code, we obtain the final version of our logistic growth plot, as follows: The legend location can be any of the strings: 'best', 'upper right', 'upper left', 'lower left', 'lower right', 'right', 'center left', 'center right', 'lower center', 'upper center', and 'center'. It is also possible to specify the location of the legend precisely with the bbox_to_anchor option. To see how this works, modify the code for the legend as follows: legend(['b={:5.2f}'.format(b1),        'b={:5.2f}'.format(b2),        'b={:5.2f}'.format(b3)], bbox_to_anchor=(0.9,0.35)) Notice that the bbox_to_anchor option, by default, uses a coordinate system that is not the same as the one we specified for the plot. The x and y coordinates of the box in the preceding example are interpreted as a fraction of the width and height, respectively, of the whole figure. A little trial-and-error is necessary to place the legend box precisely where we want it. Note that the legend box can be placed outside the plot area. For example, try the coordinates (1.32,1.02). The legend() function is quite flexible and has quite a few other options that are documented at http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.legend. Text and annotations In this subsection, we will show how to add annotations to plots in matplotlib. We will build a plot demonstrating the fact that the tangent to a curve must be horizontal at the highest and lowest points. We start by defining the function associated with the curve and the set of values at which we want the curve to be plotted, which is shown in the following code: f = lambda x: (x**3 - 6*x**2 + 9*x + 3) / (1 + 0.25*x**2) xvalues = linspace(0, 5, 200) The first line in the preceding code uses a lambda expression to define the f() function. We use this approach here because the formula for the function is a simple, one-line expression. The general form of a lambda expression is as follows: lambda <arguments> : <return expression> This expression by itself creates an anonymous function that can be used in any place that a function object is expected. Note that the return value must be a single expression and cannot contain any statements. The formula for the function may seem unusual, but it was chosen by trial-and-error and a little bit of calculus so that it produces a nice graph in the interval from 0 to 5. The xvalues array is defined to contain 200 equally spaced points on this interval. Let's create an initial plot of our curve, as shown in the following code: plot(xvalues, f(xvalues), lw=2, color='FireBrick') axis([0, 5, -1, 8]) grid() xlabel('$x$') ylabel('$f(x)$') title('Extreme values of a function') None # Prevent text output Most of the code in this segment is explained in the previous section. The only new bit is that we use the grid() function to draw a grid. Used with no arguments, the grid coincides with the tick marks on the plot. As everything else in matplotlib, grids are highly customizable. Check the documentation at http://matplotlib.org/1.3.1/api/pyplot_api.html#matplotlib.pyplot.grid. When the preceding code is executed, the following plot is produced: Note that the curve has a highest point (maximum) and a lowest point (minimum). These are collectively called the extreme values of the function (on the displayed interval, this function actually grows without bounds as x becomes large). We would like to locate these on the plot with annotations. We will first store the relevant points as follows: x_min = 3.213 f_min = f(x_min) x_max = 0.698 f_max = f(x_max) p_min = array([x_min, f_min]) p_max = array([x_max, f_max]) print p_min print p_max The variables, x_min and f_min, are defined to be (approximately) the coordinates of the lowest point in the graph. Analogously, x_max and f_max represent the highest point. Don't be concerned with how these points were found. For the purposes of graphing, even a rough approximation by trial-and-error would suffice. Now, add the following code to the cell that draws the plot, right below the title() command, as shown in the following code: arrow_props = dict(facecolor='DimGray', width=3, shrink=0.05,              headwidth=7) delta = array([0.1, 0.1]) offset = array([1.0, .85]) annotate('Maximum', xy=p_max+delta, xytext=p_max+offset,          arrowprops=arrow_props, verticalalignment='bottom',          horizontalalignment='left', fontsize=13) annotate('Minimum', xy=p_min-delta, xytext=p_min-offset,          arrowprops=arrow_props, verticalalignment='top',          horizontalalignment='right', fontsize=13) Run the cell to produce the plot shown in the following diagram: In the code, start by assigning the variables arrow_props, delta, and offset, which will be used to set the arguments in the calls to annotate(). The annotate() function adds a textual annotation to the graph with an optional arrow indicating the point being annotated. The first argument of the function is the text of the annotation. The next two arguments give the locations of the arrow and the text: xy: This is the point being annotated and will correspond to the tip of the arrow. We want this to be the maximum/minimum points, p_min and p_max, but we add/subtract the delta vector so that the tip is a bit removed from the actual point. xytext: This is the point where the text will be placed as well as the base of the arrow. We specify this as offsets from p_min and p_max using the offset vector. All other arguments of annotate() are formatting options: arrowprops: This is a Python dictionary containing the arrow properties. We predefine the dictionary, arrow_props, and use it here. Arrows can be quite sophisticated in matplotlib, and you are directed to the documentation for details. verticalalignment and horizontalalignment: These specify how the arrow should be aligned with the text. fontsize: This signifies the size of the text. Text is also highly configurable, and the reader is directed to the documentation for details. The annotate() function has a huge number of options; for complete details of what is available, users should consult the documentation at http://matplotlib.org/1.3.1/api/pyplot_api.html#matplotlib.pyplot.annotate for the full details. We now want to add a comment for what is being demonstrated by the plot by adding an explanatory textbox. Add the following code to the cell right after the calls to annotate(): bbox_props = dict(boxstyle='round', lw=2, fc='Beige') text(2, 6, 'Maximum and minimum pointsnhave horizontal tangents',      bbox=bbox_props, fontsize=12, verticalalignment='top') The text()function is used to place text at an arbitrary position of the plot. The first two arguments are the position of the textbox, and the third argument is a string containing the text to be displayed. Notice the use of 'n' to indicate a line break. The other arguments are configuration options. The bbox argument is a dictionary with the options for the box. If omitted, the text will be displayed without any surrounding box. In the example code, the box is a rectangle with rounded corners, with a border width of 2 pixels and the face color, beige. As a final detail, let's add the tangent lines at the extreme points. Add the following code: plot([x_min-0.75, x_min+0.75], [f_min, f_min],      color='RoyalBlue', lw=3) plot([x_max-0.75, x_max+0.75], [f_max, f_max],      color='RoyalBlue', lw=3) Since the tangents are segments of straight lines, we simply give the coordinates of the endpoints. The reason to add the code for the tangents at the top of the cell is that this causes them to be plotted first so that the graph of the function is drawn at the top of the tangents. This is the final result: The examples we have seen so far only scratch the surface of what is possible with matplotlib. The reader should read the matplotlib documentation for more examples. Summary In this article, we learned how to use matplotlib to produce presentation-quality plots. We covered two-dimensional plots and how to set plot options, and annotate and configure plots. You also learned how to add labels, titles, and legends. Edited on July 27, 2018 to replace a broken reference link. Resources for Article: Further resources on this subject: Installing NumPy, SciPy, matplotlib, and IPython [Article] SciPy for Computational Geometry [Article] Fast Array Operations with NumPy [Article]
Read more
  • 0
  • 0
  • 10842

article-image-hbases-data-storage
Packt
13 Nov 2014
9 min read
Save for later

The HBase's Data Storage

Packt
13 Nov 2014
9 min read
In this article by Nishant Garg author of HBase Essentials, we will look at HBase's data storage from its architectural view point. (For more resources related to this topic, see here.) For most of the developers or users, the preceding topics are not of big interest, but for an administrator, it really makes sense to understand how underlying data is stored or replicated within HBase. Administrators are the people who deal with HBase, starting from its installation to cluster management (performance tuning, monitoring, failure, recovery, data security and so on). Let's start with data storage in HBase first. Data storage In HBase, tables are split into smaller chunks that are distributed across multiple servers. These smaller chunks are called regions and the servers that host regions are called RegionServers. The master process handles the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. In HBase implementation, the HRegionServer and HRegion classes represent the region server and the region, respectively. HRegionServer contains the set of HRegion instances available to the client and handles two types of files for data storage: HLog (the write-ahead log file, also known as WAL) HFile (the real data storage file) In HBase, there is a system-defined catalog table called hbase:meta that keeps the list of all the regions for user-defined tables. In older versions prior to 0.96.0, HBase had two catalog tables called-ROOT- and .META. The -ROOT- table was used to keep track of the location of the .META table. Version 0.96.0 onwards, the -ROOT- table is removed. The .META table is renamed as hbase:meta. Now, the location of .META is stored in Zookeeper. The following is the structure of the hbase:meta table. Key—the region key of the format ([table],[region start key],[region id]). A region with an empty start key is the first region in a table. The values are as follows: info:regioninfo(serialized the HRegionInfo instance for this region) info:server(server:port of the RegionServer containing this region) info:serverstartcode(start time of the RegionServer process that contains this region) When the table is split, two new columns will be created as info:splitA and info:splitB. These columns represent the two newly created regions. The values for these columns are also serialized as HRegionInfo instances. Once the split process is complete, the row that contains the old region information is deleted. In the case of data reading, the client application first connects to ZooKeeper and looks up the location of the hbase:meta table. For the next client, the HTable instance queries the hbase:meta table and finds out the region that contains the rows of interest and also locate the region server that is serving the identified region. The information about the region and region server is then cached by the client application for future interactions and avoids the lookup process. If the region is reassigned by the load balancer process or if the region server has expired, fresh lookup is done on the hbase:meta catalog table to get the new location of the user table region and cache is updated accordingly. At the object level, the HRegionServer class is responsible to create a connection with the region by creating HRegion objects. This HRegion instance sets up a store instance that has one or more StoreFile instances (wrapped around HFile) and MemStore. MemStore accumulates the data edits as it happens and buffers them into the memory. This is also important for accessing the recent edits of table data. As shown in the preceding diagram, the HRegionServer instance (the region server) contains the map of HRegion instances (regions) and also has an HLog instance that represents the WAL. There is a single block cache instance at the region-server level, which holds data from all the regions hosted on that region server. A block cache instance is created at the time of the region server startup and it can have an implementation of LruBlockCache, SlabCache, or BucketCache. The block cache also supports multilevel caching; that is, a block cache might have first-level cache, L1, as LruBlockCache and second-level cache, L2, as SlabCache or BucketCache. All these cache implementations have their own way of managing the memory; for example, LruBlockCache is like a data structure and resides on the JVM heap whereas the other two types of implementation also use memory outside of the JVM heap. HLog (the write-ahead log – WAL) In the case of writing the data, when the client calls HTable.put(Put), the data is first written to the write-ahead log file (which contains actual data and sequence number together represented by the HLogKey class) and also written in MemStore. Writing data directly into MemStrore can be dangerous as it is a volatile in-memory buffer and always open to the risk of losing data in case of a server failure. Once MemStore is full, the contents of the MemStore are flushed to the disk by creating a new HFile on the HDFS. While inserting data from the HBase shell, the flush command can be used to write the in-memory (memstore) data to the store files. If there is a server failure, the WAL can effectively retrieve the log to get everything up to where the server was prior to the crash failure. Hence, the WAL guarantees that the data is never lost. Also, as another level of assurance, the actual write-ahead log resides on the HDFS, which is a replicated filesystem. Any other server having a replicated copy can open the log. The HLog class represents the WAL. When an HRegion object is instantiated, the single HLog instance is passed on as a parameter to the constructor of HRegion. In the case of an update operation, it saves the data directly to the shared WAL and also keeps track of the changes by incrementing the sequence numbers for each edits. WAL uses a Hadoop SequenceFile, which stores records as sets of key-value pairs. Here, the HLogKey instance represents the key, and the key-value represents the rowkey, column family, column qualifier, timestamp, type, and value along with the region and table name where data needs to be stored. Also, the structure starts with two fixed-length numbers that indicate the size and value of the key. The following diagram shows the structure of a key-value pair: The WALEdit class instance takes care of atomicity at the log level by wrapping each update. For example, in the case of a multicolumn update for a row, each column is represented as a separate KeyValue instance. If the server fails after updating few columns to the WAL, it ends up with only a half-persisted row and the remaining updates are not persisted. Atomicity is guaranteed by wrapping all updates that comprise multiple columns into a single WALEdit instance and writing it in a single operation. For durability, a log writer's sync() method is called, which gets the acknowledgement from the low-level filesystem on each update. This method also takes care of writing the WAL to the replication servers (from one datanode to another). The log flush time can be set to as low as you want, or even be kept in sync for every edit to ensure high durability but at the cost of performance. To take care of the size of the write ahead log file, the LogRoller instance runs as a background thread and takes care of rolling log files at certain intervals (the default is 60 minutes). Rolling of the log file can also be controlled based on the size and hbase.regionserver.logroll.multiplier. It rotates the log file when it becomes 90 percent of the block size, if set to 0.9. HFile (the real data storage file) HFile represents the real data storage file. The files contain a variable number of data blocks and fixed number of file info blocks and trailer blocks. The index blocks records the offsets of the data and meta blocks. Each data block contains a magic header and a number of serialized KeyValue instances. The default size of the block is 64 KB and can be as large as the block size. Hence, the default block size for files in HDFS is 64 MB, which is 1,024 times the HFile default block size but there is no correlation between these two blocks. Each key-value in the HFile is represented as a low-level byte array. Within the HBase root directory, we have different files available at different levels. Write-ahead log files represented by the HLog instances are created in a directory called WALs under the root directory defined by the hbase.rootdir property in hbase-site.xml. This WALs directory also contains a subdirectory for each HRegionServer. In each subdirectory, there are several write-ahead log files (because of log rotation). All regions from that region server share the same HLog files. In HBase, every table also has its own directory created under the data/default directory. This data/default directory is located under the root directory defined by the hbase.rootdir property in hbase-site.xml. Each table directory contains a file called .tableinfo within the .tabledesc folder. This .tableinfo file stores the metadata information about the table, such as table and column family schemas, and is represented as the serialized HTableDescriptor class. Each table directory also has a separate directory for every region comprising the table, and the name of this directory is created using the MD5 hash portion of a region name. The region directory also has a .regioninfo file that contains the serialized information of the HRegionInfo instance for the given region. Once the region exceeds the maximum configured region size, it splits and a matching split directory is created within the region directory. This size is configured using the hbase.hregion.max.filesize property or the configuration done at the column-family level using the HColumnDescriptor instance. In the case of multiple flushes by the MemStore, the number of files might get increased on this disk. The compaction process running in the background combines the files to the largest configured file size and also triggers region split. Summary In this article, we have learned about the internals of HBase and how it stores the data. Resources for Article: Further resources on this subject: Big Data Analysis [Article] Advanced Hadoop MapReduce Administration [Article] HBase Administration, Performance Tuning [Article]
Read more
  • 0
  • 0
  • 5620

article-image-postmodel-workflow
Packt
04 Nov 2014
23 min read
Save for later

Postmodel Workflow

Packt
04 Nov 2014
23 min read
 This article written by Trent Hauck, the author of scikit-learn Cookbook, Packt Publishing, will cover the following recipes: K-fold cross validation Automatic cross validation Cross validation with ShuffleSplit Stratified k-fold Poor man's grid search Brute force grid search Using dummy estimators to compare results (For more resources related to this topic, see here.) Even though by design the articles are unordered, you could argue by virtue of the art of data science, we've saved the best for last. For the most part, each recipe within this article is applicable to the various models we've worked with. In some ways, you can think about this article as tuning the parameters and features. Ultimately, we need to choose some criteria to determine the "best" model. We'll use various measures to define best. Then in the Cross validation with ShuffleSplit recipe, we will randomize the evaluation across subsets of the data to help avoid overfitting. K-fold cross validation In this recipe, we'll create, quite possibly, the most important post-model validation exercise—cross validation. We'll talk about k-fold cross validation in this recipe. There are several varieties of cross validation, each with slightly different randomization schemes. K-fold is perhaps one of the most well-known randomization schemes. Getting ready We'll create some data and then fit a classifier on the different folds. It's probably worth mentioning that if you can keep a holdout set, then that would be best. For example, we have a dataset where N = 1000. If we hold out 200 data points, then use cross validation between the other 800 points to determine the best parameters. How to do it... First, we'll create some fake data, then we'll examine the parameters, and finally, we'll look at the size of the resulting dataset: >>> N = 1000>>> holdout = 200>>> from sklearn.datasets import make_regression>>> X, y = make_regression(1000, shuffle=True) Now that we have the data, let's hold out 200 points, and then go through the fold scheme like we normally would: >>> X_h, y_h = X[:holdout], y[:holdout]>>> X_t, y_t = X[holdout:], y[holdout:]>>> from sklearn.cross_validation import KFold K-fold gives us the option of choosing how many folds we want, if we want the values to be indices or Booleans, if want to shuffle the dataset, and finally, the random state (this is mainly for reproducibility). Indices will actually be removed in later versions. It's assumed to be True. Let's create the cross validation object: >>> kfold = KFold(len(y_t), n_folds=4) Now, we can iterate through the k-fold object: >>> output_string = "Fold: {}, N_train: {}, N_test: {}">>> for i, (train, test) in enumerate(kfold):       print output_string.format(i, len(y_t[train]),       len(y_t[test]))Fold: 0, N_train: 600, N_test: 200Fold: 1, N_train: 600, N_test: 200Fold: 2, N_train: 600, N_test: 200Fold: 3, N_train: 600, N_test: 200 Each iteration should return the same split size. How it works... It's probably clear, but k-fold works by iterating through the folds and holds out 1/n_folds * N, where N for us was len(y_t). From a Python perspective, the cross validation objects have an iterator that can be accessed by using the in operator. Often times, it's useful to write a wrapper around a cross validation object that will iterate a subset of the data. For example, we may have a dataset that has repeated measures for data points or we may have a dataset with patients and each patient having measures. We're going to mix it up and use pandas for this part: >>> import numpy as np>>> import pandas as pd>>> patients = np.repeat(np.arange(0, 100, dtype=np.int8), 8)>>> measurements = pd.DataFrame({'patient_id': patients,                   'ys': np.random.normal(0, 1, 800)}) Now that we have the data, we only want to hold out certain customers instead of data points: >>> custids = np.unique(measurements.patient_id)>>> customer_kfold = KFold(custids.size, n_folds=4)>>> output_string = "Fold: {}, N_train: {}, N_test: {}">>> for i, (train, test) in enumerate(customer_kfold):       train_cust_ids = custids[train]       training = measurements[measurements.patient_id.isin(                 train_cust_ids)]       testing = measurements[~measurements.patient_id.isin(                 train_cust_ids)]       print output_string.format(i, len(training), len(testing))Fold: 0, N_train: 600, N_test: 200Fold: 1, N_train: 600, N_test: 200Fold: 2, N_train: 600, N_test: 200Fold: 3, N_train: 600, N_test: 200 Automatic cross validation We've looked at the using cross validation iterators that scikit-learn comes with, but we can also use a helper function to perform cross validation for use automatically. This is similar to how other objects in scikit-learn are wrapped by helper functions, pipeline for instance. Getting ready First, we'll need to create a sample classifier; this can really be anything, a decision tree, a random forest, whatever. For us, it'll be a random forest. We'll then create a dataset and use the cross validation functions. How to do it... First import the ensemble module and we'll get started: >>> from sklearn import ensemble>>> rf = ensemble.RandomForestRegressor(max_features='auto') Okay, so now, let's create some regression data: >>> from sklearn import datasets>>> X, y = datasets.make_regression(10000, 10) Now that we have the data, we can import the cross_validation module and get access to the functions we'll use: >>> from sklearn import cross_validation>>> scores = cross_validation.cross_val_score(rf, X, y)>>> print scores[ 0.86823874 0.86763225 0.86986129] How it works... For the most part, this will delegate to the cross validation objects. One nice thing is that, the function will handle performing the cross validation in parallel. We can activate verbose mode play by play: >>> scores = cross_validation.cross_val_score(rf, X, y, verbose=3, cv=4)[CV] no parameters to be set[CV] no parameters to be set, score=0.872866 - 0.7s[CV] no parameters to be set[CV] no parameters to be set, score=0.873679 - 0.6s[CV] no parameters to be set[CV] no parameters to be set, score=0.878018 - 0.7s[CV] no parameters to be set[CV] no parameters to be set, score=0.871598 - 0.6s[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 0.7s[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 2.6s finished As we can see, during each iteration, we scored the function. We also get an idea of how long the model runs. It's also worth knowing that we can score our function predicated on which kind of model we're trying to fit. Cross validation with ShuffleSplit ShuffleSplit is one of the simplest cross validation techniques. This cross validation technique will simply take a sample of the data for the number of iterations specified. Getting ready ShuffleSplit is another cross validation technique that is very simple. We'll specify the total elements in the dataset, and it will take care of the rest. We'll walk through an example of estimating the mean of a univariate dataset. This is somewhat similar to resampling, but it'll illustrate one reason why we want to use cross validation while showing cross validation. How to do it... First, we need to create the dataset. We'll use NumPy to create a dataset, where we know the underlying mean. We'll sample half of the dataset to estimate the mean and see how close it is to the underlying mean: >>> import numpy as np>>> true_loc = 1000>>> true_scale = 10>>> N = 1000>>> dataset = np.random.normal(true_loc, true_scale, N)>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.hist(dataset, color='k', alpha=.65, histtype='stepfilled');>>> ax.set_title("Histogram of dataset");>>> f.savefig("978-1-78398-948-5_06_06.png") NumPy will give the following output: Now, let's take the first half of the data and guess the mean: >>> from sklearn import cross_validation>>> holdout_set = dataset[:500]>>> fitting_set = dataset[500:]>>> estimate = fitting_set[:N/2].mean()>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.set_title("True Mean vs Regular Estimate")>>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5,             alpha=.65, label='true mean')>>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5,             alpha=.65, label='regular estimate')>>> ax.set_xlim(999, 1001)>>> ax.legend()>>> f.savefig("978-1-78398-948-5_06_07.png") We'll get the following output: Now, we can use ShuffleSplit to fit the estimator on several smaller datasets: >>> from sklearn.cross_validation import ShuffleSplit>>> shuffle_split = ShuffleSplit(len(fitting_set))>>> mean_p = []>>> for train, _ in shuffle_split:       mean_p.append(fitting_set[train].mean())       shuf_estimate = np.mean(mean_p)>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5,             alpha=.65, label='true mean')>>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5,             alpha=.65, label='regular estimate')>>> ax.vlines(shuf_estimate, 0, 1, color='b', linestyles='-', lw=5,             alpha=.65, label='shufflesplit estimate')>>> ax.set_title("All Estimates")>>> ax.set_xlim(999, 1001)>>> ax.legend(loc=3) The output will be as follows: As we can see, we got an estimate that was similar to what we expected, but we were able to take many samples to get that estimate. Stratified k-fold In this recipe, we'll quickly look at stratified k-fold valuation. We've walked through different recipes where the class representation was unbalanced in some manner. Stratified k-fold is nice because its scheme is specifically designed to maintain the class proportions. Getting ready We're going to create a small dataset. In this dataset, we will then use stratified k-fold validation. We want it small so that we can see the variation. For larger samples. it probably won't be as big of a deal. We'll then plot the class proportions at each step to illustrate how the class proportions are maintained: >>> from sklearn import datasets>>> X, y = datasets.make_classification(n_samples=int(1e3), weights=[1./11]) Let's check the overall class weight distribution: >>> y.mean()0.90300000000000002 Roughly, 90.5 percent of the samples are 1, with the balance 0. How to do it... Let's create a stratified k-fold object and iterate it through each fold. We'll measure the proportion of verse that are 1. After that we'll plot the proportion of classes by the split number to see how and if it changes. This code will hopefully illustrate how this is beneficial. We'll also plot this code against a basic ShuffleSplit: >>> from sklearn import cross_validation>>> n_folds = 50>>> strat_kfold = cross_validation.StratifiedKFold(y,                 n_folds=n_folds)>>> shuff_split = cross_validation.ShuffleSplit(n=len(y),                 n_iter=n_folds)>>> kfold_y_props = []>>> shuff_y_props = []>>> for (k_train, k_test), (s_train, s_test) in zip(strat_kfold,         shuff_split):        kfold_y_props.append(y[k_train].mean())       shuff_y_props.append(y[s_train].mean()) Now, let's plot the proportions over each fold: >>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.plot(range(n_folds), shuff_y_props, label="ShuffleSplit",           color='k')>>> ax.plot(range(n_folds), kfold_y_props, label="Stratified",           color='k', ls='--')>>> ax.set_title("Comparing class proportions.")>>> ax.legend(loc='best') The output will be as follows: We can see that the proportion of each fold for stratified k-fold is stable across folds. How it works... Stratified k-fold works by taking the y value. First, getting the overall proportion of the classes, then intelligently splitting the training and test set into the proportions. This will generalize to multiple labels: >>> import numpy as np>>> three_classes = np.random.choice([1,2,3], p=[.1, .4, .5],                   size=1000)>>> import itertools as it>>> for train, test in cross_validation.StratifiedKFold(three_classes, 5):       print np.bincount(three_classes[train])[ 0 90 314 395][ 0 90 314 395][ 0 90 314 395][ 0 91 315 395][ 0 91 315 396] As we can see, we got roughly the sample sizes of each class for our training and testing proportions. Poor man's grid search In this recipe, we're going to introduce grid search with basic Python, though we will use sklearn for the models and matplotlib for the visualization. Getting ready In this recipe, we will perform the following tasks: Design a basic search grid in the parameter space Iterate through the grid and check the loss/score function at each point in the parameter space for the dataset Choose the point in the parameter space that minimizes/maximizes the evaluation function Also, the model we'll fit is a basic decision tree classifier. Our parameter space will be 2 dimensional to help us with the visualization: The parameter space will then be the Cartesian product of the those two sets: We'll see in a bit how we can iterate through this space with itertools. Let's create the dataset and then get started: >>> from sklearn import datasets>>> X, y = datasets.make_classification(n_samples=2000, n_features=10) How to do it... Earlier we said that we'd use grid search to tune two parameters—criteria and max_features. We need to represent those as Python sets, and then use itertools product to iterate through them: >>> criteria = {'gini', 'entropy'}>>> max_features = {'auto', 'log2', None}>>> import itertools as it>>> parameter_space = it.product(criteria, max_features) Great! So now that we have the parameter space, let's iterate through it and check the accuracy of each model as specified by the parameters. Then, we'll store that accuracy so that we can compare different parameter spaces. We'll also use a test and train split of 50, 50: import numpy as nptrain_set = np.random.choice([True, False], size=len(y))from sklearn.tree import DecisionTreeClassifieraccuracies = {}for criterion, max_feature in parameter_space:   dt = DecisionTreeClassifier(criterion=criterion,         max_features=max_feature)   dt.fit(X[train_set], y[train_set])   accuracies[(criterion, max_feature)] = (dt.predict(X[~train_set])                                         == y[~train_set]).mean()>>> accuracies{('entropy', None): 0.974609375, ('entropy', 'auto'): 0.9736328125,('entropy', 'log2'): 0.962890625, ('gini', None): 0.9677734375, ('gini','auto'): 0.9638671875, ('gini', 'log2'): 0.96875} So we now have the accuracies and its performance. Let's visualize the performance: >>> from matplotlib import pyplot as plt>>> from matplotlib import cm>>> cmap = cm.RdBu_r>>> f, ax = plt.subplots(figsize=(7, 4))>>> ax.set_xticklabels([''] + list(criteria))>>> ax.set_yticklabels([''] + list(max_features))>>> plot_array = []>>> for max_feature in max_features:m = []>>> for criterion in criteria:       m.append(accuracies[(criterion, max_feature)])       plot_array.append(m)>>> colors = ax.matshow(plot_array, vmin=np.min(accuracies.values())             - 0.001, vmax=np.max(accuracies.values()) + 0.001,             cmap=cmap)>>> f.colorbar(colors) The following is the output: It's fairly easy to see which one performed best here. Hopefully, you can see how this process can be taken to the further stage with a brute force method. How it works... This works fairly simply, we just have to perform the following steps: Choose a set of parameters. Iterate through them and find the accuracy of each step. Find the best performer by visual inspection. Brute force grid search In this recipe, we'll do an exhaustive grid search through scikit-learn. This is basically the same thing we did in the previous recipe, but we'll utilize built-in methods. We'll also walk through an example of performing randomized optimization. This is an alternative to brute force search. Essentially, we're trading computer cycles to make sure that we search the entire space. We were fairly calm in the last recipe. However, you could imagine a model that has several steps, first imputation for fix missing data, then PCA reduce the dimensionality to classification. Your parameter space could get very large, very fast; therefore, it can be advantageous to only search a part of that space. Getting ready To get started, we'll need to perform the following steps: Create some classification data. We'll then create a LogisticRegression object that will be the model we're fitting. After that, we'll create the search objects, GridSearch and RandomizedSearchCV. How to do it... Run the following code to create some classification data: >>> from sklearn.datasets import make_classification>>> X, y = make_classification(1000, n_features=5) Now, we'll create our logistic regression object: >>> from sklearn.linear_model import LogisticRegression>>> lr = LogisticRegression(class_weight='auto') We need to specify the parameters we want to search. For GridSearch, we can just specify the ranges that we care about, but for RandomizedSearchCV, we'll need to actually specify the distribution over the same space from which to sample: >>> lr.fit(X, y)LogisticRegression(C=1.0, class_weight={0: 0.25, 1: 0.75},                   dual=False,fit_intercept=True,                  intercept_scaling=1, penalty='l2',                   random_state=None, tol=0.0001)>>> grid_search_params = {'penalty': ['l1', 'l2'],'C': [1, 2, 3, 4]} The only change we'll need to make is to describe the C parameter as a probability distribution. We'll keep it simple right now, though we will use scipy to describe the distribution: >>> import scipy.stats as st>>> import numpy as np>>> random_search_params = {'penalty': ['l1', 'l2'],'C': st.randint(1, 4)} How it works... Now, we'll fit the classifier. This works by passing lr to the parameter search objects: >>> from sklearn.grid_search import GridSearchCV, RandomizedSearchCV>>> gs = GridSearchCV(lr, grid_search_params) GridSearchCV implements the same API as the other models: >>> gs.fit(X, y)GridSearchCV(cv=None, estimator=LogisticRegression(C=1.0,             class_weight='auto', dual=False, fit_intercept=True,             intercept_scaling=1, penalty='l2', random_state=None,             tol=0.0001), fit_params={}, iid=True, loss_func=None,             n_jobs=1, param_grid={'penalty': ['l1', 'l2'], 'C':             [1, 2, 3, 4]}, pre_dispatch='2*n_jobs', refit=True,             score_func=None, scoring=None, verbose=0) As we can see with the param_grid parameter, our penalty and C are both arrays. To access the scores, we can use the grid_scores_ attribute of the grid search. We also want to find the optimal set of parameters. We can also look at the marginal performance of the grid search: >>> gs.grid_scores_[mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 1},mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 2},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 2},mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 3},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 3},mean: 0.90100, std: 0.01258, params: {'penalty': 'l1', 'C': 4},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 4}] We might want to get the max score: >>> gs.grid_scores_[1][1]0.90100000000000002>>> max(gs.grid_scores_, key=lambda x: x[1])mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1} The parameters obtained are the best choices for our logistic regression. Using dummy estimators to compare results This recipe is about creating fake estimators; this isn't the pretty or exciting stuff, but it is worthwhile to have a reference point for the model you'll eventually build. Getting ready In this recipe, we'll perform the following tasks: Create some data random data. Fit the various dummy estimators. We'll perform these two steps for regression data and classification data. How to do it... First, we'll create the random data: >>> from sklearn.datasets import make_regression, make_classification# classification if for later>>> X, y = make_regression()>>> from sklearn import dummy>>> dumdum = dummy.DummyRegressor()>>> dumdum.fit(X, y)DummyRegressor(constant=None, strategy='mean') By default, the estimator will predict by just taking the mean of the values and predicting the mean values: >>> dumdum.predict(X)[:5]array([ 2.23297907, 2.23297907, 2.23297907, 2.23297907, 2.23297907]) There are other two other strategies we can try. We can predict a supplied constant (refer to constant=None from the preceding command). We can also predict the median value. Supplying a constant will only be considered if strategy is "constant". Let's have a look: >>> predictors = [("mean", None),                 ("median", None),                 ("constant", 10)]>>> for strategy, constant in predictors:       dumdum = dummy.DummyRegressor(strategy=strategy,                 constant=constant)>>> dumdum.fit(X, y)>>> print "strategy: {}".format(strategy), ",".join(map(str,         dumdum.predict(X)[:5]))strategy: mean 2.23297906733,2.23297906733,2.23297906733,2.23297906733,2.23297906733strategy: median 20.38535248,20.38535248,20.38535248,20.38535248,20.38535248strategy: constant 10.0,10.0,10.0,10.0,10.0 We actually have four options for classifiers. These strategies are similar to the continuous case, it's just slanted toward classification problems: >>> predictors = [("constant", 0),                 ("stratified", None),                 ("uniform", None),                 ("most_frequent", None)] We'll also need to create some classification data: >>> X, y = make_classification()>>> for strategy, constant in predictors:       dumdum = dummy.DummyClassifier(strategy=strategy,                 constant=constant)       dumdum.fit(X, y)       print "strategy: {}".format(strategy), ",".join(map(str,             dumdum.predict(X)[:5]))strategy: constant 0,0,0,0,0strategy: stratified 1,0,0,1,0strategy: uniform 0,0,0,1,1strategy: most_frequent 1,1,1,1,1 How it works... It's always good to test your models against the simplest models and that's exactly what the dummy estimators give you. For example, imagine a fraud model. In this model, only 5 percent of the data set is fraud. Therefore, we can probably fit a pretty good model just by never guessing any fraud. We can create this model by using the stratified strategy, using the following command. We can also get a good example of why class imbalance causes problems: >>> X, y = make_classification(20000, weights=[.95, .05])>>> dumdum = dummy.DummyClassifier(strategy='most_frequent')>>> dumdum.fit(X, y)DummyClassifier(constant=None, random_state=None, strategy='most_frequent')>>> from sklearn.metrics import accuracy_score>>> print accuracy_score(y, dumdum.predict(X))0.94575 We were actually correct very often, but that's not the point. The point is that this is our baseline. If we cannot create a model for fraud that is more accurate than this, then it isn't worth our time. Summary This article taught us how we can take a basic model produced from one of the recipes and tune it so that we can achieve better results than we could with the basic model. Resources for Article: Further resources on this subject: Specialized Machine Learning Topics [article] Machine Learning in IPython with scikit-learn [article] Our First Machine Learning Method – Linear Classification [article]
Read more
  • 0
  • 0
  • 2245

article-image-loading-data-creating-app-and-adding-dashboards-and-reports-splunk
Packt
31 Oct 2014
13 min read
Save for later

Loading data, creating an app, and adding dashboards and reports in Splunk

Packt
31 Oct 2014
13 min read
In this article by Josh Diakun, Paul R Johnson, and Derek Mock, authors of Splunk Operational Intelligence Cookbook, we will take a look at how to load sample data into Splunk, how to create an application, and how to add dashboards and reports in Splunk. (For more resources related to this topic, see here.) Loading the sample data While most of the data you will index with Splunk will be collected in real time, there might be instances where you have a set of data that you would like to put into Splunk, either to backfill some missing or incomplete data, or just to take advantage of its searching and reporting tools. This recipe will show you how to perform one-time bulk loads of data from files located on the Splunk server. We will also use this recipe to load the data samples that will be used as we build our Operational Intelligence app in Splunk. There are two files that make up our sample data. The first is access_log, which represents data from our web layer and is modeled on an Apache web server. The second file is app_log, which represents data from our application layer and is modeled on the log4j application log data. Getting ready To step through this recipe, you will need a running Splunk server and should have a copy of the sample data generation app (OpsDataGen.spl). (This file is part of the downloadable code bundle, which is available on the book's website.) How to do it... Follow the given steps to load the sample data generator on your system: Log in to your Splunk server using your credentials. From the home launcher, select the Apps menu in the top-left corner and click on Manage Apps. Select Install App from file. Select the location of the OpsDataGen.spl file on your computer, and then click on the Upload button to install the application. After installation, a message should appear in a blue bar at the top of the screen, letting you know that the app has installed successfully. You should also now see the OpsDataGen app in the list of apps. By default, the app installs with the data-generation scripts disabled. In order to generate data, you will need to enable either a Windows or Linux script, depending on your Splunk operating system. To enable the script, select the Settings menu from the top-right corner of the screen, and then select Data inputs. From the Data inputs screen that follows, select Scripts. On the Scripts screen, locate the OpsDataGen script for your operating system and click on Enable. For Linux, it will be $SPLUNK_HOME/etc/apps/OpsDataGen/bin/AppGen.path For Windows, it will be $SPLUNK_HOMEetcappsOpsDataGenbinAppGen-win.path The following screenshot displays both the Windows and Linux inputs that are available after installing the OpsDataGen app. It also displays where to click to enable the correct one based on the operating system Splunk is installed on. Select the Settings menu from the top-right corner of the screen, select Data inputs, and then select Files & directories. On the Files & directories screen, locate the two OpsDataGen inputs for your operating system and for each click on Enable. For Linux, it will be: $SPLUNK_HOME/etc/apps/OpsDataGen/data/access_log $SPLUNK_HOME/etc/apps/OpsDataGen/data/app_log For Windows, it will be: $SPLUNK_HOMEetcappsOpsDataGendataaccess_log $SPLUNK_HOMEetcappsOpsDataGendataapp_log The following screenshot displays both the Windows and Linux inputs that are available after installing the OpsDataGen app. It also displays where to click to enable the correct one based on the operating system Splunk is installed on. The data will now be generated in real time. You can test this by navigating to the Splunk search screen and running the following search over an All time (real-time) time range: index=main sourcetype=log4j OR sourcetype=access_combined After a short while, you should see data from both source types flowing into Splunk, and the data generation is now working as displayed in the following screenshot: How it works... In this case, you installed a Splunk application that leverages a scripted input. The script we wrote generates data for two source types. The access_combined source type contains sample web access logs, and the log4j source type contains application logs. Creating an Operational Intelligence application This recipe will show you how to create an empty Splunk app that we will use as the starting point in building our Operational Intelligence application. Getting ready To step through this recipe, you will need a running Splunk Enterprise server, with the sample data loaded from the previous recipe. You should be familiar with navigating the Splunk user interface. How to do it... Follow the given steps to create the Operational Intelligence application: Log in to your Splunk server. From the top menu, select Apps and then select Manage Apps. Click on the Create app button. Complete the fields in the box that follows. Name the app Operational Intelligence and give it a folder name of operational_intelligence. Add in a version number and provide an author name. Ensure that Visible is set to Yes, and the barebones template is selected. When the form is completed, click on Save. This should be followed by a blue bar with the message, Successfully saved operational_intelligence. Congratulations, you just created a Splunk application! How it works... When an app is created through the Splunk GUI, as in this recipe, Splunk essentially creates a new folder (or directory) named operational_intelligence within the $SPLUNK_HOME/etc/apps directory. Within the $SPLUNK_HOME/etc/apps/operational_intelligence directory, you will find four new subdirectories that contain all the configuration files needed for our barebones Operational Intelligence app that we just created. The eagle-eyed among you would have noticed that there were two templates, barebones and sample_app, out of which any one could have been selected when creating the app. The barebones template creates an application with nothing much inside of it, and the sample_app template creates an application populated with sample dashboards, searches, views, menus, and reports. If you wish to, you can also develop your own custom template if you create lots of apps, which might enforce certain color schemes for example. There's more... As Splunk apps are just a collection of directories and files, there are other methods to add apps to your Splunk Enterprise deployment. Creating an application from another application It is relatively simple to create a new app from an existing app without going through the Splunk GUI, should you wish to do so. This approach can be very useful when we are creating multiple apps with different inputs.conf files for deployment to Splunk Universal Forwarders. Taking the app we just created as an example, copy the entire directory structure of the operational_intelligence app and name it copied_app. cp -r $SPLUNK_HOME$/etc/apps/operational_intelligence/* $SPLUNK_HOME$/etc/apps/copied_app Within the directory structure of copied_app, we must now edit the app.conf file in the default directory. Open $SPLUNK_HOME$/etc/apps/copied_app/default/app.conf and change the label field to My Copied App, provide a new description, and then save the conf file. ## Splunk app configuration file#[install]is_configured = 0[ui]is_visible = 1label = My Copied App[launcher]author = John Smithdescription = My Copied applicationversion = 1.0 Now, restart Splunk, and the new My Copied App application should now be seen in the application menu. $SPLUNK_HOME$/bin/splunk restart Downloading and installing a Splunk app Splunk has an entire application website with hundreds of applications, created by Splunk, other vendors, and even users of Splunk. These are great ways to get started with a base application, which you can then modify to meet your needs. If the Splunk server that you are logged in to has access to the Internet, you can click on the Apps menu as you did earlier and then select the Find More Apps button. From here, you can search for apps and install them directly. An alternative way to install a Splunk app is to visit http://apps.splunk.com and search for the app. You will then need to download the application locally. From your Splunk server, click on the Apps menu and then on the Manage Apps button. After that, click on the Install App from File button and upload the app you just downloaded, in order to install it. Once the app has been installed, go and look at the directory structure that the installed application just created. Familiarize yourself with some of the key files and where they are located. When downloading applications from the Splunk apps site, it is best practice to test and verify them in a nonproduction environment first. The Splunk apps site is community driven and, as a result, quality checks and/or technical support for some of the apps might be limited. Adding dashboards and reports Dashboards are a great way to present many different pieces of information. Rather than having lots of disparate dashboards across your Splunk environment, it makes a lot of sense to group related dashboards into a common Splunk application, for example, putting operational intelligence dashboards into a common Operational Intelligence application. In this recipe, you will learn how to move the dashboards and associated reports into our new Operational Intelligence application. Getting ready To step through this recipe, you will need a running Splunk Enterprise server, with the sample data loaded from the Loading the sample data recipe. You should be familiar with navigating the Splunk user interface. How to do it... Follow these steps to move your dashboards into the new application: Log in to your Splunk server. Select the newly created Operational Intelligence application. From the top menu, select Settings and then select the User interface menu item. Click on the Views section. In the App Context dropdown, select Searching & Reporting (search) or whatever application you were in when creating the dashboards: Locate the website_monitoring dashboard row in the list of views and click on the Move link to the right of the row. In the Move Object pop up, select the Operational Intelligence (operational_intelligence) application that was created earlier and then click on the Move button. A message bar will then be displayed at the top of the screen to confirm that the dashboard was moved successfully. Repeat from step 5 to move the product_monitoring dashboard as well. After the Website Monitoring and Product Monitoring dashboards have been moved, we now want to move all the reports that were created, as these power the dashboards and provide operational intelligence insight. From the top menu, select Settings and this time select Searches, reports, and alerts. Select the Search & Reporting (search) context and filter by cp0* to view the searches (reports) that are created. Click on the Move link of the first cp0* search in the list. Select to move the object to the Operational Intelligence (operational_intelligence) application and click on the Move button. A message bar will then be displayed at the top of the screen to confirm that the dashboard was moved successfully. Select the Search & Reporting (search) context and repeat from step 11 to move all the other searches over to the new Operational Intelligence application—this seems like a lot but will not take you long! All of the dashboards and reports are now moved over to your new Operational Intelligence application. How it works... In the previous recipe, we revealed how Splunk apps are essentially just collections of directories and files. Dashboards are XML files found within the $SPLUNK_HOME/etc/apps directory structure. When moving a dashboard from one app to another, Splunk is essentially just moving the underlying file from a directory inside one app to a directory in the other app. In this recipe, you moved the dashboards from the Search & Reporting app to the Operational Intelligence app, as represented in the following screenshot: As visualizations on the dashboards leverage the underlying saved searches (or reports), you also moved these reports to the new app so that the dashboards maintain permissions to access them. Rather than moving the saved searches, you could have changed the permissions of each search to Global such that they could be seen from all the other apps in Splunk. However, the other reason you moved the reports was to keep everything contained within a single Operational Intelligence application, which you will continue to build on going forward. It is best practice to avoid setting permissions to Global for reports and dashboards, as this makes them available to all the other applications when they most likely do not need to be. Additionally, setting global permissions can make things a little messy from a housekeeping perspective and crowd the lists of reports and views that belong to specific applications. The exception to this rule might be for knowledge objects such as tags, event types, macros, and lookups, which often have advantages to being available across all applications. There's more… As you went through this recipe, you likely noticed that the dashboards had application-level permissions, but the reports had private-level permissions. The reports are private as this is the default setting in Splunk when they are created. This private-level permission restricts access to only your user account and admin users. In order to make the reports available to other users of your application, you will need to change the permissions of the reports to Shared in App as we did when adjusting the permissions of reports. Changing the permissions of saved reports Changing the sharing permission levels of your reports from the default Private to App is relatively straightforward: Ensure that you are in your newly created Operational Intelligence application. Select the Reports menu item to see the list of reports. Click on Edit next to the report you wish to change the permissions for. Then, click on Edit Permissions from the drop-down list. An Edit Permissions pop-up box will appear. In the Display for section, change from Owner to App, and then, click on Save. The box will close, and you will see that the Sharing permissions in the table will now display App for the specific report. This report will now be available to all the users of your application. Summary In this article, we loaded the sample data into Splunk. We also saw how to organize dashboards and knowledge into a custom Splunk app. Resources for Article: Further resources on this subject: VWorking with Pentaho Mobile BI [Article] Visualization of Big Data [Article] Highlights of Greenplum [Article]
Read more
  • 0
  • 0
  • 10125

article-image-hosting-service-iis-using-tcp-protocol
Packt
30 Oct 2014
8 min read
Save for later

Hosting the service in IIS using the TCP protocol

Packt
30 Oct 2014
8 min read
In this article by Mike Liu, the author of WCF Multi-layer Services Development with Entity Framework, Fourth Edtion, we will learn how to create and host a service in IIS using the TCP protocol. (For more resources related to this topic, see here.) Hosting WCF services in IIS using the HTTP protocol gives the best interoperability to the service, because the HTTP protocol is supported everywhere today. However, sometimes interoperability might not be an issue. For example, the service may be invoked only within your network with all Microsoft clients only. In this case, hosting the service by using the TCP protocol might be a better solution. Benefits of hosting a WCF service using the TCP protocol Compared to HTTP, there are a few benefits in hosting a WCF service using the TCP protocol: It supports connection-based, stream-oriented delivery services with end-to-end error detection and correction It is the fastest WCF binding for scenarios that involve communication between different machines It supports duplex communication, so it can be used to implement duplex contracts It has a reliable data delivery capability (this is applied between two TCP/IP nodes and is not the same thing as WS-ReliableMessaging, which applies between endpoints) Preparing the folders and files First, we need to prepare the folders and files for the host application, just as we did for hosting the service using the HTTP protocol. We will use the previous HTTP hosting application as the base to create the new TCP hosting application: Create the folders: In Windows Explorer, create a new folder called HostIISTcp under C:SOAwithWCFandEFProjectsHelloWorld and a new subfolder called bin under the HostIISTcp folder. You should now have the following new folders: C:SOAwithWCFandEFProjectsHelloWorld HostIISTcp and a bin folder inside the HostIISTcp folder. Copy the files: Now, copy all the files from the HostIIS hosting application folder at C:SOAwithWCFandEFProjectsHelloWorldHostIIS to the new folder that we created at C:SOAwithWCFandEFProjectsHelloWorldHostIISTcp. Create the Visual Studio solution folder: To make it easier to be viewed and managed from the Visual Studio Solution Explorer, you can add a new solution folder, HostIISTcp, to the solution and add the Web.config file to this folder. Add another new solution folder, bin, under HostIISTcp and add the HelloWorldService.dll and HelloWorldService.pdb files under this bin folder. Add the following post-build events to the HelloWorldService project, so next time, all the files will be copied automatically when the service project is built: xcopy "$(AssemblyName).dll" "C:SOAwithWCFandEFProjectsHelloWorldHostIISTcpbin" /Y xcopy "$(AssemblyName).pdb" "C:SOAwithWCFandEFProjectsHelloWorldHostIISTcpbin" /Y Modify the Web.config file: The Web.config file that we have copied from HostIIS is using the default basicHttpBinding as the service binding. To make our service use the TCP binding, we need to change the binding to TCP and add a TCP base address. Open the Web.config file and add the following node to it under the <system.serviceModel> node: <services> <service name="HelloWorldService.HelloWorldService">    <endpoint address="" binding="netTcpBinding"    contract="HelloWorldService.IHelloWorldService"/>    <host>      <baseAddresses>        <add baseAddress=        "net.tcp://localhost/HelloWorldServiceTcp/"/>      </baseAddresses>    </host> </service> </services> In this new services node, we have defined one service called HelloWorldService.HelloWorldService. The base address of this service is net.tcp://localhost/HelloWorldServiceTcp/. Remember, we have defined the host activation relative address as ./HelloWorldService.svc, so we can invoke this service from the client application with the following URL: http://localhost/HelloWorldServiceTcp/HelloWorldService.svc. For the file-less WCF activation, if no endpoint is defined explicitly, HTTP and HTTPS endpoints will be defined by default. In this example, we would like to expose only one TCP endpoint, so we have added an endpoint explicitly (as soon as this endpoint is added explicitly, the default endpoints will not be added). If you don't add this TCP endpoint explicitly here, the TCP client that we will create in the next section will still work, but on the client config file you will see three endpoints instead of one and you will have to specify which endpoint you are using in the client program. The following is the full content of the Web.config file: <?xml version="1.0"?> <!-- For more information on how to configure your ASP.NET application, please visit http://go.microsoft.com/fwlink/?LinkId=169433 --> <configuration> <system.web>    <compilation debug="true" targetFramework="4.5"/>    <httpRuntime targetFramework="4.5" /> </system.web>   <system.serviceModel>    <serviceHostingEnvironment >      <serviceActivations>        <add factory="System.ServiceModel.Activation.ServiceHostFactory"          relativeAddress="./HelloWorldService.svc"          service="HelloWorldService.HelloWorldService"/>      </serviceActivations>    </serviceHostingEnvironment>      <behaviors>      <serviceBehaviors>        <behavior>          <serviceMetadata httpGetEnabled="true"/>        </behavior>      </serviceBehaviors>    </behaviors>    <services>      <service name="HelloWorldService.HelloWorldService">        <endpoint address="" binding="netTcpBinding"         contract="HelloWorldService.IHelloWorldService"/>        <host>          <baseAddresses>            <add baseAddress=            "net.tcp://localhost/HelloWorldServiceTcp/"/>          </baseAddresses>        </host>      </service>    </services> </system.serviceModel>   </configuration> Enabling the TCP WCF activation for the host machine By default, the TCP WCF activation service is not enabled on your machine. This means your IIS server won't be able to host a WCF service with the TCP protocol. You can follow these steps to enable the TCP activation for WCF services: Go to Control Panel | Programs | Turn Windows features on or off. Expand the Microsoft .Net Framework 3.5.1 node on Windows 7 or .Net Framework 4.5 Advanced Services on Windows 8. Check the checkbox for Windows Communication Foundation Non-HTTP Activation on Windows 7 or TCP Activation on Windows 8. The following screenshot depicts the options required to enable WCF activation on Windows 7: The following screenshot depicts the options required to enable TCP WCF activation on Windows 8: Repair the .NET Framework: After you have turned on the TCP WCF activation, you have to repair .NET. Just go to Control Panel, click on Uninstall a Program, select Microsoft .NET Framework 4.5.1, and then click on Repair. Creating the IIS application Next, we need to create an IIS application named HelloWorldServiceTcp to host the WCF service, using the TCP protocol. Follow these steps to create this application in IIS: Open IIS Manager. Add a new IIS application, HelloWorldServiceTcp, pointing to the HostIISTcp physical folder under your project's folder. Choose DefaultAppPool as the application pool for the new application. Again, make sure your default app pool is a .NET 4.0.30319 application pool. Enable the TCP protocol for the application. Right-click on HelloWorldServiceTcp, select Manage Application | Advanced Settings, and then add net.tcp to Enabled Protocols. Make sure you use all lowercase letters and separate it from the existing HTTP protocol with a comma. Now the service is hosted in IIS using the TCP protocol. To view the WSDL of the service, browse to http://localhost/HelloWorldServiceTcp/HelloWorldService.svc and you should see the service description and a link to the WSDL of the service. Testing the WCF service hosted in IIS using the TCP protocol Now, we have the service hosted in IIS using the TCP protocol; let's create a new test client to test it: Add a new console application project to the solution, named HelloWorldClientTcp. Add a reference to System.ServiceModel in the new project. Add a service reference to the WCF service in the new project, naming the reference HelloWorldServiceRef and use the URL http://localhost/HelloWorldServiceTcp/HelloWorldService.svc?wsdl. You can still use the SvcUtil.exe command-line tool to generate the proxy and config files for the service hosted with TCP, just as we did in previous sections. Actually, behind the scenes Visual Studio is also calling SvcUtil.exe to generate the proxy and config files. Add the following code to the Main method of the new project: var client = new HelloWorldServiceRef.HelloWorldServiceClient (); Console.WriteLine(client.GetMessage("Mike Liu")); Finally, set the new project as the startup project. Now, if you run the program, you will get the same result as before; however, this time the service is hosted in IIS using the TCP protocol. Summary In this article, we created and tested an IIS application to host the service with the TCP protocol. Resources for Article: Further resources on this subject: Microsoft WCF Hosting and Configuration [Article] Testing and Debugging Windows Workflow Foundation 4.0 (WF) Program [Article] Applying LINQ to Entities to a WCF Service [Article]
Read more
  • 0
  • 0
  • 17467
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-theming-highcharts
Packt
30 Oct 2014
10 min read
Save for later

Theming with Highcharts

Packt
30 Oct 2014
10 min read
Besides the charting capabilities offered by Highcharts, theming is yet another strong feature of Highcharts. With its extensive theming API, charts can be customized completely to match the branding of a website or an app. Almost all of the chart elements are customizable through this API. In this article by Bilal Shahid, author of Highcharts Essentials, we will do the following things: (For more resources related to this topic, see here.) Use different fill types and fonts Create a global theme for our charts Use jQuery easing for animations Using Google Fonts with Highcharts Google provides an easy way to include hundreds of high quality web fonts to web pages. These fonts work in all major browsers and are served by Google CDN for lightning fast delivery. These fonts can also be used with Highcharts to further polish the appearance of our charts. This section assumes that you know the basics of using Google Web Fonts. If you are not familiar with them, visit https://developers.google.com/fonts/docs/getting_started. We will style the following example with Google Fonts. We will use the Merriweather family from Google Fonts and link to its style sheet from our web page inside the <head> tag: <link href='http://fonts.googleapis.com/css?family=Merriweather:400italic,700italic' rel='stylesheet' type='text/css'> Having included the style sheet, we can actually use the font family in our code for the labels in yAxis: yAxis: [{ ... labels: {    style: {      fontFamily: 'Merriweather, sans-serif',      fontWeight: 400,      fontStyle: 'italic',      fontSize: '14px',      color: '#ffffff'    } } }, { ... labels: {    style: {      fontFamily: 'Merriweather, sans-serif',      fontWeight: 700,      fontStyle: 'italic',      fontSize: '21px',      color: '#ffffff'    },    ... } }] For the outer axis, we used a font size of 21px with font weight of 700. For the inner axis, we lowered the font size to 14px and used font weight of 400 to compensate for the smaller font size. The following is the modified speedometer: In the next section, we will continue with the same example to include jQuery UI easing in chart animations. Using jQuery UI easing for series animation Animations occurring at the point of initialization of charts can be disabled or customized. The customization requires modifying two properties: animation.duration and animation.easing. The duration property accepts the number of milliseconds for the duration of the animation. The easing property can have various values depending on the framework currently being used. For a standalone jQuery framework, the values can be either linear or swing. Using the jQuery UI framework adds a couple of more options for the easing property to choose from. In order to follow this example, you must include the jQuery UI framework to the page. You can also grab the standalone easing plugin from http://gsgd.co.uk/sandbox/jquery/easing/ and include it inside your <head> tag. We can now modify the series to have a modified animation: plotOptions: { ... series: {    animation: {      duration: 1000,      easing: 'easeOutBounce'    } } } The preceding code will modify the animation property for all the series in the chart to have duration set to 1000 milliseconds and easing to easeOutBounce. Each series can have its own different animation by defining the animation property separately for each series as follows: series: [{ ... animation: {    duration: 500,    easing: 'easeOutBounce' } }, { ... animation: {    duration: 1500,    easing: 'easeOutBounce' } }, { ... animation: {      duration: 2500,    easing: 'easeOutBounce' } }] Different animation properties for different series can pair nicely with column and bar charts to produce visually appealing effects. Creating a global theme for our charts A Highcharts theme is a collection of predefined styles that are applied before a chart is instantiated. A theme will be applied to all the charts on the page after the point of its inclusion, given that the styling options have not been modified within the chart instantiation. This provides us with an easy way to apply custom branding to charts without the need to define styles over and over again. In the following example, we will create a basic global theme for our charts. This way, we will get familiar with the fundamentals of Highcharts theming and some API methods. We will define our theme inside a separate JavaScript file to make the code reusable and keep things clean. Our theme will be contained in an options object that will, in turn, contain styling for different Highcharts components. Consider the following code placed in a file named custom-theme.js. This is a basic implementation of a Highcharts custom theme that includes colors and basic font styles along with some other modifications for axes: Highcharts.customTheme = {      colors: ['#1BA6A6', '#12734F', '#F2E85C', '#F27329', '#D95D30', '#2C3949', '#3E7C9B', '#9578BE'],      chart: {        backgroundColor: {            radialGradient: {cx: 0, cy: 1, r: 1},            stops: [                [0, '#ffffff'],                [1, '#f2f2ff']            ]        },        style: {            fontFamily: 'arial, sans-serif',            color: '#333'        }    },    title: {        style: {            color: '#222',            fontSize: '21px',            fontWeight: 'bold'        }    },    subtitle: {        style: {            fontSize: '16px',            fontWeight: 'bold'        }    },    xAxis: {        lineWidth: 1,        lineColor: '#cccccc',        tickWidth: 1,        tickColor: '#cccccc',        labels: {            style: {                fontSize: '12px'            }        }    },    yAxis: {        gridLineWidth: 1,        gridLineColor: '#d9d9d9',        labels: {           style: {                fontSize: '12px'            }        }    },    legend: {        itemStyle: {            color: '#666',            fontSize: '9px'        },        itemHoverStyle:{            color: '#222'        }      } }; Highcharts.setOptions( Highcharts.customTheme ); We start off by modifying the Highcharts object to include an object literal named customTheme that contains styles for our charts. Inside customTheme, the first option we defined is for series colors. We passed an array containing eight colors to be applied to series. In the next part, we defined a radial gradient as a background for our charts and also defined the default font family and text color. The next two object literals contain basic font styles for the title and subtitle components. Then comes the styles for the x and y axes. For the xAxis, we define lineColor and tickColor to be #cccccc with the lineWidth value of 1. The xAxis component also contains the font style for its labels. The y axis gridlines appear parallel to the x axis that we have modified to have the width and color at 1 and #d9d9d9 respectively. Inside the legend component, we defined styles for the normal and mouse hover states. These two states are stated by itemStyle and itemHoverStyle respectively. In normal state, the legend will have a color of #666 and font size of 9px. When hovered over, the color will change to #222. In the final part, we set our theme as the default Highcharts theme by using an API method Highcharts.setOptions(), which takes a settings object to be applied to Highcharts; in our case, it is customTheme. The styles that have not been defined in our custom theme will remain the same as the default theme. This allows us to partially customize a predefined theme by introducing another theme containing different styles. In order to make this theme work, include the file custom-theme.js after the highcharts.js file: <script src="js/highcharts.js"></script> <script src="js/custom-theme.js"></script> The output of our custom theme is as follows: We can also tell our theme to include a web font from Google without having the need to include the style sheet manually in the header, as we did in a previous section. For that purpose, Highcharts provides a utility method named Highcharts.createElement(). We can use it as follows by placing the code inside the custom-theme.js file: Highcharts.createElement( 'link', {    href: 'http://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,700italic,400,300,700',    rel: 'stylesheet',    type: 'text/css' }, null, document.getElementsByTagName( 'head' )[0], null ); The first argument is the name of the tag to be created. The second argument takes an object as tag attributes. The third argument is for CSS styles to be applied to this element. Since, there is no need for CSS styles on a link element, we passed null as its value. The final two arguments are for the parent node and padding, respectively. We can now change the default font family for our charts to 'Open Sans': chart: {    ...    style: {        fontFamily: "'Open Sans', sans-serif",        ...    } } The specified Google web font will now be loaded every time a chart with our custom theme is initialized, hence eliminating the need to manually insert the required font style sheet inside the <head> tag. This screenshot shows a chart with 'Open Sans' Google web font. Summary In this article, you learned about incorporating Google fonts and jQuery UI easing into our chart for enhanced styling. Resources for Article: Further resources on this subject: Integrating with other Frameworks [Article] Highcharts [Article] More Line Charts, Area Charts, and Scatter Plots [Article]
Read more
  • 0
  • 0
  • 8224

article-image-data-visualization
Packt
27 Oct 2014
8 min read
Save for later

Data visualization

Packt
27 Oct 2014
8 min read
Data visualization is one of the most important tasks in data science track. Through effective visualization we can easily uncover underlying pattern among variables with doing any sophisticated statistical analysis. In this cookbook we have focused on graphical analysis using R in a very simple way with each independent example. We have covered default R functionality along with more advance visualization techniques such as lattice, ggplot2, and three-dimensional plots. Readers will not only learn the code to produce the graph but also learn why certain code has been written with specific examples. R Graphs Cookbook Second Edition written by Jaynal Abedin and Hrishi V. Mittal is such a book where the user will learn how to produce various graphs using R and how to customize them and finally how to make ready for publication. This practical recipe book starts with very brief description about R graphics system and then gradually goes through basic to advance plots with examples. Beside the R default graphics this recipe book introduces advance graphic system such as lattice and ggplot2; the grammar of graphics. We have also provided examples on how to inspect large dataset using advanced visualization such as tableplot and three dimensional visualizations. We also cover the following topics: How to create various types of bar charts using default R functions, lattice and ggplot2 How to produce density plots along with histograms using lattice and ggplot2 and customized them for publication How to produce graphs of frequency tabulated data How to inspect large dataset by simultaneously visualizing numeric and categorical variables in a single plot How to annotate graphs using ggplot2 (For more resources related to this topic, see here.) This recipe book is targeted to those reader groups who already exposed to R programming and want to learn effective graphics with the power of R and its various libraries. This hands-on guide starts with very short introduction to R graphics system and then gets straight to the point – actually creating graphs, instead of just theoretical learning. Each recipe is specifically tailored to full fill reader’s appetite for visually representing the data in the best way possible. Now, we will present few examples so that you can have an idea about the content of this recipe book: The ggplot2 R package is based on The Grammar of Graphics by Leland Wilkinson, Springer). Using this package, we can produce a variety of traditional graphics, and the user can produce their customized graphs as well. The beauty of this package is in its layered graphics facilities; through the use of layered graphics utilities, we can produce almost any kind of data visualization. Recently, ggplot2 is the most searched keyword in the R community, including the most popular R blog (www.r-bloggers.com). The comprehensive theme system allows the user to produce publication quality graphs with a variety of themes of choice. If we want to explain this package in a single sentence, then we can say that if whatever we can think about data visualization can be structured in a data frame, the visualization is a matter of few seconds. In the specific chapter on ggplot2 , we will see different examples and use themes to produce publication quality graphs. However, in this introductory chapter, we will show you one of the important features of the ggplot2 package that produces various types of graphs. The main function is ggplot(), but with the help of a different geom function, we can easily produce different types of graphs, such as the following: geom_point(): This will create scatter plot geom_line(): This will create a line chart geom_bar(): This will create a bar chart geom_boxplot(): This will create a box plot geom_text(): This will write certain text inside the plot area Now, we will see a simple example of the use of different geom functions with the default R mtcars dataset: # loading ggplot2 library library(ggplot2) # creating a basic ggplot object p <- ggplot(data=mtcars) # Creating scatter plot of mpg and disp variable p1 <- p+geom_point(aes(x=disp,y=mpg)) # creating line chart from the same ggplot object but different # geom function p2 <- p+geom_line(aes(x=disp,y=mpg)) # creating bar chart of mpg variable p3 <- p+geom_bar(aes(x=mpg)) # creating boxplot of mpg over gear p4 <- p+geom_boxplot(aes(x=factor(gear),y=mpg)) # writing certain text into the scatter plot p5 <- p1+geom_text(x=200,y=25,label="Scatter plot") The visualization of the preceding five plot will look like the following figure: Visualizing an empirical Cumulative Distribution function The empirical Cumulative Distribution function (CDF) is the non-parametric maximum-likelihood estimation of the CDF. In this recipe, we will see how the empirical CDF can be produced. Getting ready To produce this plot, we need to use the latticeExtra library. We will use the simulated dataset as shown in the following code: # Set a seed value to make the data reproducible set.seed(12345) qqdata <-data.frame(disA=rnorm(n=100,mean=20,sd=3),                disB=rnorm(n=100,mean=25,sd=4),                disC=rnorm(n=100,mean=15,sd=1.5),                age=sample((c(1,2,3,4)),size=100,replace=T),                sex=sample(c("Male","Female"),size=100,replace=T),                 econ_status=sample(c("Poor","Middle","Rich"),                size=100,replace=T)) How to do it… To plot an empirical CDF, we first need to call the latticeExtra library (note that this library has a dependency on RColorBrewer). Now, to plot the empirical CDF, we can use the following simple code: library(latticeExtra) ecdfplot(~disA|sex,data=qqdata) Graph annotation with ggplot To produce publication-quality data visualization, we often need to annotate the graph with various texts, symbols, or even shapes. In this recipe, we will see how we can easily annotate an existing graph. Getting ready In this recipe, we will use the disA and disD variables from ggplotdata. Let's call ggplotdata for this recipe. We also need to call the grid and gridExtra libraries for this recipe. How to do it... In this recipe, we will execute the following annotation on an existing scatter plot. So, the whole procedure will be as follows: Create a scatter plot Add customized text within the plot Highlight certain region to indicate extreme values Draw a line segment with an arrow within the scatter plot to indicate a single extreme observation Now, we will implement each of the steps one by one: library(grid) library(gridExtra) # creating scatter plot and print it annotation_obj <- ggplot(data=ggplotdata,aes(x=disA,y=disD))+geom_point() annotation_obj # Adding custom text at (18,29) position annotation_obj1 <- annotation_obj + annotate(geom="text",x=18,y=29,label="Extreme value",size=3) annotation_obj1 # Highlight certain regions with a box annotation_obj2 <- annotation_obj1+ annotate("rect", xmin = 24, xmax = 27,ymin=17,ymax=22,alpha = .2) annotation_obj2 # Drawing line segment with arrow annotation_obj3 <- annotation_obj2+ annotate("segment",x = 16,xend=17.5,y=25,yend=27.5,colour="red", arrow = arrow(length = unit(0.5, "cm")),size=2) annotation_obj3 The preceding four steps are displayed in the following single graph: How it works... The annotate() function takes input of a geom such as “segment”, “text” etc, and then it takes another input regarding position of that geom that is where to draw or where to place.. In this particular recipe, we used three geom instances, such as text to write customized text within the plot, rect to highlight a certain region in the plot, and segment to draw an arrow. The alpha argument represents the transparency of the region and size argument to represent the size of the text and line width of the line segment. Summary This article just gives a sample recipe of what kind of recipes are included in the book, and how the structure of each recipe is. Resources for Article: Further resources on this subject: Using R for Statistics, Research, and Graphics [Article] First steps with R [Article] Aspects of Data Manipulation in R [Article]
Read more
  • 0
  • 0
  • 3747

article-image-emr-architecture
Packt
27 Oct 2014
6 min read
Save for later

The EMR Architecture

Packt
27 Oct 2014
6 min read
This article is written by Amarkant Singh and Vijay Rayapati, the authors of Learning Big Data with Amazon Elastic MapReduce. The goal of this article is to introduce you to the EMR architecture and EMR use cases. (For more resources related to this topic, see here.) Traditionally, very few companies had access to large-scale infrastructure to build Big Data applications. However, cloud computing has democratized the access to infrastructure allowing developers and companies to quickly perform new experiments without worrying about the need for setting up or scaling infrastructure. A cloud provides an infrastructure as a service platform to allow businesses to build applications and host them reliably with scalable infrastructure. It includes a variety of application-level services to help developers to accelerate their development and deployment times. Amazon EMR is one of the hosted services provided by AWS and is built on top of a scalable AWS infrastructure to build Big Data applications. The EMR architecture Let's get familiar with the EMR. This section outlines the key concepts of EMR. Hadoop offers distributed processing by using the MapReduce framework for execution of tasks on a set of servers or compute nodes (also known as a cluster). One of the nodes in the Hadoop cluster will be controlling the distribution of tasks to other nodes and it's called the Master Node. The nodes executing the tasks using MapReduce are called Slave Nodes: Amazon EMR is designed to work with many other AWS services such as S3 for input/output data storage, DynamoDB, and Redshift for output data. EMR uses AWS CloudWatch metrics to monitor the cluster performance and raise notifications for user-specified alarms. We can create on-demand Hadoop clusters using EMR while storing the input and output data in S3 without worrying about managing a 24*7 cluster or HDFS for data storage. The Amazon EMR job flow is shown in the following diagram: Types of nodes Amazon EMR provides three different roles for the servers or nodes in the cluster and they map to the Hadoop roles of master and slave nodes. When you create an EMR cluster, then it's called a Job Flow, which has been created to execute a set of jobs or job steps one after the other: Master node: This node controls and manages the cluster. It distributes the MapReduce tasks to nodes in the cluster and monitors the status of task execution. Every EMR cluster will have only one master node in a master instance group. Core nodes: These nodes will execute MapReduce tasks and provide HDFS for storing the data related to task execution. The EMR cluster will have core nodes as part of it in a core instance group. The core node is related to the slave node in Hadoop. So, basically these nodes have two-fold responsibility: the first one is to execute the map and reduce tasks allocated by the master and the second is to hold the data blocks. Task nodes: These nodes are used for only MapReduce task execution and they are optional while launching the EMR cluster. The task node is related to the slave node in Hadoop and is part of a task instance group in EMR. When you scale down your clusters, you cannot remove any core nodes. This is because EMR doesn't want to let you lose your data blocks. You can remove nodes from a task group while scaling down your cluster. You should also be using only task instance groups to have spot instances, as spot instances can be taken away as per your bid price and you would not want to lose your data blocks. You can launch a cluster having just one node, that is, with just one master node and no other nodes. In that case, the same node will act as both master and core nodes. For simplicity, you can assume a node as EC2 server in EMR. EMR use cases Amazon EMR can be used to build a variety of applications such as recommendation engines, data analysis, log processing, event/click stream analysis, data transformations (ETL), fraud detection, scientific simulations, genomics, financial analysis, or data correlation in various industries. The following section outlines some of the use cases in detail. Web log processing We can use EMR to process logs to understand the usage of content such as video, file downloads, top web URLs accessed by end users, user consumption from different parts of the world, and many more. We can process any web or mobile application logs using EMR to understand specific business insights relevant for your business. We can move all our web access application or mobile logs to Amazon S3 for analysis using EMR even if we are not using AWS for running our production applications. Clickstream analysis By using clickstream analysis, we can segment users into different groups and understand their behaviors with respect to advertisements or application usage. Ad networks or advertisers can perform clickstream analysis on ad-impression logs to deliver more effective campaigns or advertisements to end users. Reports generated from this analysis can include various metrics such as source traffic distribution, purchase funnel, lead source ROI, and abandoned carts among others. Product recommendation engine Recommendation engines can be built using EMR for e-commerce, retail, or web businesses. Many of the e-commerce businesses have a large inventory of products across different categories while regularly adding new products or categories. It will be very difficult for end users to search and identify the products quickly. With recommendation engines, we can help end users to quickly find relevant products or suggest products based on what they are viewing and so on. We may also want to notify users via an e-mail based on their past purchase behavior. Scientific simulations When you need distributed processing with large-scale infrastructure for scientific or research simulations, EMR can be of great help. We can quickly launch large clusters in a matter of minutes and install specific MapReduce programs for analysis using EMR. AWS also offers genomics datasets for free on S3. Data transformations We can perform complex extract, transform, and load (ETL) processes using EMR for either data analysis or data warehousing needs. It can be as simple as transforming XML file data into JSON data for further usage or moving all financial transaction records of a bank into a common date-time format for archiving purposes. You can also use EMR to move data between different systems in AWS such as DynamoDB, Redshift, S3, and many more. Summary In this article, we learned about the EMR architecture. We understood the concepts related to EMR for various node types in detail. Resources for Article: Further resources on this subject: Introduction to MapReduce [Article] Understanding MapReduce [Article] HDFS and MapReduce [Article]
Read more
  • 0
  • 0
  • 7334

article-image-clustering-k-means
Packt
27 Oct 2014
9 min read
Save for later

Clustering with K-Means

Packt
27 Oct 2014
9 min read
In this article by Gavin Hackeling, the author of Mastering Machine Learning with scikit-Learn, we will discuss an unsupervised learning task called clustering. Clustering is used to find groups of similar observations within a set of unlabeled data. We will discuss the K-Means clustering algorithm, apply it to an image compression problem, and learn to measure its performance. Finally, we will work through a semi-supervised learning problem that combines clustering with classification. Clustering, or cluster analysis, is the task of grouping observations such that members of the same group, or cluster, are more similar to each other by some metric than they are to the members of the other clusters. As with supervised learning, we will represent an observation as an n-dimensional vector. For example, assume that your training data consists of the samples plotted in the following figure: Clustering might reveal the following two groups, indicated by squares and circles. Clustering could also reveal the following four groups: Clustering is commonly used to explore a data set. Social networks can be clustered to identify communities, and to suggest missing connections between people. In biology, clustering is used to find groups of genes with similar expression patterns. Recommendation systems sometimes employ clustering to identify products or media that might appeal to a user. In marketing, clustering is used to find segments of similar consumers. In the following sections we will work through an example of using the K-Means algorithm to cluster a data set. Clustering with the K-Means Algorithm The K-Means algorithm is a clustering method that is popular because of its speed and scalability. K-Means is an iterative process of moving the centers of the clusters, or the centroids, to the mean position of their constituent points, and re-assigning instances to their closest clusters. The titular K is a hyperparameter that specifies the number of clusters that should be created; K-Means automatically assigns observations to clusters but cannot determine the appropriate number of clusters. K must be a positive integer that is less than the number of instances in the training set. Sometimes the number of clusters is specified by the clustering problem's context. For example, a company that manufactures shoes might know that it is able to support manufacturing three new models. To understand what groups of customers to target with each model, it surveys customers and creates three clusters from the results. That is, the value of K was specified by the problem's context. Other problems may not require a specific number of clusters, and the optimal number of clusters may be ambiguous. We will discuss a heuristic for estimating the optimal number of clusters called the elbow method later in this article. The parameters of K-Means are the positions of the clusters' centroids and the observations that are assigned to each cluster. Like generalized linear models and decision trees, the optimal values of K-Means' parameters are found by minimizing a cost function. The cost function for K-Means is given by the following equation: where µk is the centroid for cluster k. The cost function sums the distortions of the clusters. Each cluster's distortion is equal to the sum of the squared distances between its centroid and its constituent instances. The distortion is small for compact clusters, and large for clusters that contain scattered instances. The parameters that minimize the cost function are learned through an iterative process of assigning observations to clusters and then moving the clusters. First, the clusters' centroids are initialized to random positions. In practice, setting the centroids' positions equal to the positions of randomly selected observations yields the best results. During each iteration, K-Means assigns observations to the cluster that they are closest to, and then moves the centroids to their assigned observations' mean location. Let's work through an example by hand using the training data shown in the following table. Instance X0 X1 1 7 5 2 5 7 3 7 7 4 3 3 5 4 6 6 1 4 7 0 0 8 2 2 9 8 7 10 6 8 11 5 5 12 3 7 There are two explanatory variables; each instance has two features. The instances are plotted in the following figure. Assume that K-Means initializes the centroid for the first cluster to the fifth instance and the centroid for the second cluster to the eleventh instance. For each instance, we will calculate its distance to both centroids, and assign it to the cluster with the closest centroid. The initial assignments are shown in the “Cluster” column of the following table. Instance X0 X1 C1 distance C2 distance Last Cluster Cluster Changed? 1 7 5 3.16228 2 None C2 Yes 2 5 7 1.41421 2 None C1 Yes 3 7 7 3.16228 2.82843 None C2 Yes 4 3 3 3.16228 2.82843 None C2 Yes 5 4 6 0 1.41421 None C1 Yes 6 1 4 3.60555 4.12311 None C1 Yes 7 0 0 7.21110 7.07107 None C2 Yes 8 2 2 4.47214 4.24264 None C2 Yes 9 8 7 4.12311 3.60555 None C2 Yes 10 6 8 2.82843 3.16228 None C1 Yes 11 5 5 1.41421 0 None C2 Yes 12 3 7 1.41421 2.82843 None C1 Yes C1 centroid 4 6           C2 centroid 5 5           The plotted centroids and the initial cluster assignments are shown in the following graph. Instances assigned to the first cluster are marked with “Xs”, and instances assigned to the second cluster are marked with dots. The markers for the centroids are larger than the markers for the instances. Now we will move both centroids to the means of their constituent instances, re-calculate the distances of the training instances to the centroids, and re-assign the instances to the closest centroids. Instance X0 X1 C1 distance C2 distance Last Cluster New Cluster Changed? 1 7 5 3.492850 2.575394 C2 C2 No 2 5 7 1.341641 2.889107 C1 C1 No 3 7 7 3.255764 3.749830 C2 C1 Yes 4 3 3 3.492850 1.943067 C2 C2 No 5 4 6 0.447214 1.943067 C1 C1 No 6 1 4 3.687818 3.574285 C1 C2 Yes 7 0 0 7.443118 6.169378 C2 C2 No 8 2 2 4.753946 3.347250 C2 C2 No 9 8 7 4.242641 4.463000 C2 C1 Yes 10 6 8 2.720294 4.113194 C1 C1 No 11 5 5 1.843909 0.958315 C2 C2 No 12 3 7 1 3.260775 C1 C1 No C1 centroid 3.8 6.4           C2 centroid 4.571429 4.142857           The new clusters are plotted in the following graph. Note that the centroids are diverging, and several instances have changed their assignments. Now we will move the centroids to the means of their constituents' locations again, and re-assign the instances to their nearest centroids. The centroids continue to diverge, as shown in the following figure. None of the instances' centroid assignments will change in the next iteration; K-Means will continue iterating until some stopping criteria is satisfied. Usually, this criteria is either a threshold for the difference between the values of the cost function for subsequent iterations, or a threshold for the change in the positions of the centroids between subsequent iterations. If these stopping criteria are small enough, K-Means will converge on an optimum. This optimum will not necessarily be the global optimum. Local Optima Recall that K-Means initially sets the positions of the clusters' centroids to the positions of randomly selected observations. Sometimes the random initialization is unlucky, and the centroids are set to positions that cause K-Means to converge to a local optimum. For example, assume that K-Means randomly initializes two cluster centroids to the following positions: K-Means will eventually converge on a local optimum like that shown in the following figure. These clusters may be informative, but it is more likely that the top and bottom groups of observations are more informative clusters. To avoid local optima, K-Means is often repeated dozens or hundreds of times. In each iteration, it is randomly initialized to different starting cluster positions. The initialization that minimizes the cost function best is selected. The Elbow Method If K is not specified by the problem's context, the optimal number of clusters can be estimated using a technique called the elbow method. The elbow method plots the value of the cost function produced by different values of K. As K increases, the average distortion will decrease; each cluster will have fewer constituent instances, and the instances will be closer to their respective centroids. However, the improvements to the average distortion will decline as K increases. The value of K at which the improvement to the distortion declines the most is called the elbow. Let's use the elbow method to choose the number of clusters for a data set. The following scatter plot visualizes a data set with two obvious clusters. We will calculate and plot the mean distortion of the clusters for each value of K from one to ten with the following: >>> import numpy as np>>> from sklearn.cluster import KMeans>>> from scipy.spatial.distance import cdist>>> import matplotlib.pyplot as plt>>> cluster1 = np.random.uniform(0.5, 1.5, (2, 10))>>> cluster2 = np.random.uniform(3.5, 4.5, (2, 10))>>> X = np.hstack((cluster1, cluster2)).T>>> X = np.vstack((x, y)).T>>> K = range(1, 10)>>> meandistortions = []>>> for k in K:>>> kmeans = KMeans(n_clusters=k)>>> kmeans.fit(X)>>> meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])>>> plt.plot(K, meandistortions, 'bx-')>>> plt.xlabel('k')>>> plt.ylabel('Average distortion')>>> plt.title('Selecting k with the Elbow Method')>>> plt.show() The average distortion improves rapidly as we increase K from one to two. There is little improvement for values of K greater than two. Now let's use the elbow method on the following data set with three clusters: The following is the elbow plot for the data set. From this we can see that the rate of improvement to the average distortion declines the most when adding a fourth cluster. That is, the elbow method confirms that K should be set to three for this data set. Summary In this article we explained what clustering is and we talked about the two methods available for clustering Resources for Article: Further resources on this subject: Machine Learning in IPython with scikit-learn [Article] Machine Learning in Bioinformatics [Article] Specialized Machine Learning Topics [Article]
Read more
  • 0
  • 0
  • 10283
article-image-installing-numpy-scipy-matplotlib-ipython
Packt Editorial Staff
12 Oct 2014
7 min read
Save for later

Installing NumPy, SciPy, matplotlib, and IPython

Packt Editorial Staff
12 Oct 2014
7 min read
This article written by Ivan Idris, author of the book, Python Data Analysis, will guide you to install NumPy, SciPy, matplotlib, and IPython. We can find a mind map describing software that can be used for data analysis at https://www.xmind.net/m/WvfC/. Obviously, we can't install all of this software in this article. We will install NumPy, SciPy, matplotlib, and IPython on different operating systems. [box type="info" align="" class="" width=""]Packt has the following books that are focused on NumPy: NumPy Beginner's Guide Second Edition, Ivan Idris NumPy Cookbook, Ivan Idris Learning NumPy Array, Ivan Idris [/box] SciPy is a scientific Python library, which supplements and slightly overlaps NumPy. NumPy and SciPy, historically shared their codebase but were later separated. matplotlib is a plotting library based on NumPy. IPython provides an architecture for interactive computing. The most notable part of this project is the IPython shell. Software used The software used in this article is based on Python, so it is required to have Python installed. On some operating systems, Python is already installed. You, however, need to check whether the Python version is compatible with the software version you want to install. There are many implementations of Python, including commercial implementations and distributions. [box type="note" align="" class="" width=""]You can download Python from https://www.python.org/download/. On this website, we can find installers for Windows and Mac OS X, as well as source archives for Linux, Unix, and Mac OS X.[/box] The software we will install has binary installers for Windows, various Linux distributions, and Mac OS X. There are also source distributions, if you prefer that. You need to have Python 2.4.x or above installed on your system. Python 2.7.x is currently the best Python version to have because most Scientific Python libraries support it. Python 2.7 will be supported and maintained until 2020. After that, we will have to switch to Python 3. Installing software and setup on Windows Installing on Windows is, fortunately, a straightforward task that we will cover in detail. You only need to download an installer, and a wizard will guide you through the installation steps. We will give steps to install NumPy here. The steps to install the other libraries are similar. The actions we will take are as follows: Download installers for Windows from the SourceForge website (refer to the following table). The latest release versions may change, so just choose the one that fits your setup best. Library URL Latest Version NumPy http://sourceforge.net/projects/numpy/files/ 1.8.1 SciPy http://sourceforge.net/projects/scipy/files/ 0.14.0 matplotlib http://sourceforge.net/projects/matplotlib/files/ 1.3.1 IPython http://archive.ipython.org/release/ 2.0.0 Choose the appropriate version. In this example, we chose numpy-1.8.1-win32-superpack-python2.7.exe. Open the EXE installer by double-clicking on it. Now, we can see a description of NumPy and its features. Click on the Next button.If you have Python installed, it should automatically be detected. If it is not detected, maybe your path settings are wrong. Click on the Next button if Python is found; otherwise, click on the Cancel button and install Python (NumPy cannot be installed without Python). Click on the Next button. This is the point of no return. Well, kind of, but it is best to make sure that you are installing to the proper directory and so on and so forth. Now the real installation starts. This may take a while. [box type="note" align="" class="" width=""]The situation around installers is rapidly evolving. Other alternatives exist in various stage of maturity (see https://www.scipy.org/install.html). It might be necessary to put the msvcp71.dll file in your C:Windowssystem32 directory. You can get it from http://www.dll-files.com/dllindex/dll-files.shtml?msvcp71.[/box] Installing software and setup on Linux Installing the recommended software on Linux depends on the distribution you have. We will discuss how you would install NumPy from the command line, although, you could probably use graphical installers; it depends on your distribution (distro). The commands to install matplotlib, SciPy, and IPython are the same – only the package names are different. Installing matplotlib, SciPy, and IPython is recommended, but optional. Most Linux distributions have NumPy packages. We will go through the necessary steps for some of the popular Linux distros: Run the following instructions from the command line for installing NumPy on Red Hat: $ yum install python-numpy To install NumPy on Mandriva, run the following command-line instruction: $ urpmi python-numpy To install NumPy on Gentoo run the following command-line instruction: $ sudo emerge numpy To install NumPy on Debian or Ubuntu, we need to type the following: $ sudo apt-get install python-numpy The following table gives an overview of the Linux distributions and corresponding package names for NumPy, SciPy, matplotlib, and IPython. Linux distribution NumPy SciPy matplotlib IPython Arch Linux python-numpy python-scipy python-matplotlib Ipython Debian python-numpy python-scipy python-matplotlib Ipython Fedora numpy python-scipy python-matplotlib Ipython Gentoo dev-python/numpy scipy matplotlib ipython OpenSUSE python-numpy, python-numpy-devel python-scipy python-matplotlib ipython Slackware numpy scipy matplotlib ipython Installing software and setup on Mac OS X You can install NumPy, matplotlib, and SciPy on the Mac with a graphical installer or from the command line with a port manager such as MacPorts, depending on your preference. Prerequisite is to install XCode as it is not part of OS X releases. We will install NumPy with a GUI installer using the following steps: We can get a NumPy installer from the SourceForge website http://sourceforge.net/projects/numpy/files/. Similar files exist for matplotlib and SciPy. Just change numpy in the previous URL to scipy or matplotlib. IPython didn't have a GUI installer at the time of writing. Download the appropriate DMG file usually the latest one is the best.Another alternative is the SciPy Superpack (https://github.com/fonnesbeck/ScipySuperpack). Whichever option you choose it is important to make sure that updates which impact the system Python library don't negatively influence already installed software by not building against the Python library provided by Apple. Open the DMG file (in this example, numpy-1.8.1-py2.7-python.org-macosx10.6.dmg). Double-click on the icon of the opened box, the one having a subscript that ends with .mpkg. We will be presented with the welcome screen of the installer. Click on the Continue button to go to the Read Me screen, where we will be presented with a short description of NumPy. Click on the Continue button to the License the screen. Read the license, click on the Continue button and then on the Accept button, when prompted to accept the license. Continue through the next screens and click on the Finish button at the end. Alternatively, we can install NumPy, SciPy, matplotlib, and IPython through the MacPorts route, with Fink or Homebrew. The following installation steps shown, installs all these packages. [box type="info" align="" class="" width=""]For installing with MacPorts, type the following command: sudo port install py-numpy py-scipy py-matplotlib py- ipython [/box] Installing with setuptools If you have pip you can install NumPy, SciPy, matplotlib and IPython with the following commands. pip install numpy pip install scipy pip install matplotlib pip install ipython It may be necessary to prepend sudo to these commands, if your current user doesn't have sufficient rights on your system. Summary In this article, we installed NumPy, SciPy, matplotlib and IPython on Windows, Mac OS X and Linux. Resources for Article: Further resources on this subject: Plotting Charts with Images and Maps Importing Dynamic Data Python 3: Designing a Tasklist Application
Read more
  • 0
  • 0
  • 64635

article-image-indexing-and-performance-tuning
Packt
10 Oct 2014
40 min read
Save for later

Indexing and Performance Tuning

Packt
10 Oct 2014
40 min read
In this article by Hans-Jürgen Schönig, author of the book PostgreSQL Administration Essentials, you will be guided through PostgreSQL indexing, and you will learn how to fix performance issues and find performance bottlenecks. Understanding indexing will be vital to your success as a DBA—you cannot count on software engineers to get this right straightaway. It will be you, the DBA, who will face problems caused by bad indexing in the field. For the sake of your beloved sleep at night, this article is about PostgreSQL indexing. (For more resources related to this topic, see here.) Using simple binary trees In this section, you will learn about simple binary trees and how the PostgreSQL optimizer treats the trees. Once you understand the basic decisions taken by the optimizer, you can move on to more complex index types. Preparing the data Indexing does not change user experience too much, unless you have a reasonable amount of data in your database—the more data you have, the more indexing can help to boost things. Therefore, we have to create some simple sets of data to get us started. Here is a simple way to populate a table: test=# CREATE TABLE t_test (id serial, name text);CREATE TABLEtest=# INSERT INTO t_test (name) SELECT 'hans' FROM   generate_series(1, 2000000);INSERT 0 2000000test=# INSERT INTO t_test (name) SELECT 'paul' FROM   generate_series(1, 2000000);INSERT 0 2000000 In our example, we created a table consisting of two columns. The first column is simply an automatically created integer value. The second column contains the name. Once the table is created, we start to populate it. It's nice and easy to generate a set of numbers using the generate_series function. In our example, we simply generate two million numbers. Note that these numbers will not be put into the table; we will still fetch the numbers from the sequence using generate_series to create two million hans and rows featuring paul, shown as follows: test=# SELECT * FROM t_test LIMIT 3;id | name----+------1 | hans2 | hans3 | hans(3 rows) Once we create a sufficient amount of data, we can run a simple test. The goal is to simply count the rows we have inserted. The main issue here is: how can we find out how long it takes to execute this type of query? The timing command will do the job for you: test=# timingTiming is on. As you can see, timing will add the total runtime to the result. This makes it quite easy for you to see if a query turns out to be a problem or not: test=# SELECT count(*) FROM t_test;count---------4000000(1 row)Time: 316.628 ms As you can see in the preceding code, the time required is approximately 300 milliseconds. This might not sound like a lot, but it actually is. 300 ms means that we can roughly execute three queries per CPU per second. On an 8-Core box, this would translate to roughly 25 queries per second. For many applications, this will be enough; but do you really want to buy an 8-Core box to handle just 25 concurrent users, and do you want your entire box to work just on this simple query? Probably not! Understanding the concept of execution plans It is impossible to understand the use of indexes without understanding the concept of execution plans. Whenever you execute a query in PostgreSQL, it generally goes through four central steps, described as follows: Parser: PostgreSQL will check the syntax of the statement. Rewrite system: PostgreSQL will rewrite the query (for example, rules and views are handled by the rewrite system). Optimizer or planner: PostgreSQL will come up with a smart plan to execute the query as efficiently as possible. At this step, the system will decide whether or not to use indexes. Executor: Finally, the execution plan is taken by the executor and the result is generated. Being able to understand and read execution plans is an essential task of every DBA. To extract the plan from the system, all you need to do is use the explain command, shown as follows: test=# explain SELECT count(*) FROM t_test;                             QUERY PLAN                            ------------------------------------------------------Aggregate (cost=71622.00..71622.01 rows=1 width=0)   -> Seq Scan on t_test (cost=0.00..61622.00                         rows=4000000 width=0)(2 rows)Time: 0.370 ms In our case, it took us less than a millisecond to calculate the execution plan. Once you have the plan, you can read it from right to left. In our case, PostgreSQL will perform a sequential scan and aggregate the data returned by the sequential scan. It is important to mention that each step is assigned to a certain number of costs. The total cost for the sequential scan is 61,622 penalty points (more details about penalty points will be outlined a little later). The overall cost of the query is 71,622.01. What are costs? Well, costs are just an arbitrary number calculated by the system based on some rules. The higher the costs, the slower a query is expected to be. Always keep in mind that these costs are just a way for PostgreSQL to estimate things—they are in no way a reliable number related to anything in the real world (such as time or amount of I/O needed). In addition to the costs, PostgreSQL estimates that the sequential scan will yield around four million rows. It also expects the aggregation to return just a single row. These two estimates happen to be precise, but it is not always so. Calculating costs When in training, people often ask how PostgreSQL does its cost calculations. Consider a simple example like the one we have next. It works in a pretty simple way. Generally, there are two types of costs: I/O costs and CPU costs. To come up with I/O costs, we have to figure out the size of the table we are dealing with first: test=# SELECT pg_relation_size('t_test'),   pg_size_pretty(pg_relation_size('t_test'));pg_relation_size | pg_size_pretty------------------+----------------       177127424 | 169 MB(1 row) The pg_relation_size command is a fast way to see how large a table is. Of course, reading a large number (many digits) is somewhat hard, so it is possible to fetch the size of the table in a much prettier format. In our example, the size is roughly 170 MB. Let's move on now. In PostgreSQL, a table consists of 8,000 blocks. If we divide the size of the table by 8,192 bytes, we will end up with exactly 21,622 blocks. This is how PostgreSQL estimates I/O costs of a sequential scan. If a table is read completely, each block will receive exactly one penalty point, or any number defined by seq_page_cost: test=# SHOW seq_page_cost;seq_page_cost---------------1(1 row) To count this number, we have to send four million rows through the CPU (cpu_tuple_cost), and we also have to count these 4 million rows (cpu_operator_cost). So, the calculation looks like this: For the sequential scan: 21622*1 + 4000000*0.01 (cpu_tuple_cost) = 61622 For the aggregation: 61622 + 4000000*0.0025 (cpu_operator_cost) = 71622 This is exactly the number that we see in the plan. Drawing important conclusions Of course, you will never do this by hand. However, there are some important conclusions to be drawn: The cost model in PostgreSQL is a simplification of the real world The costs can hardly be translated to real execution times The cost of reading from a slow disk is the same as the cost of reading from a fast disk It is hard to take caching into account If the optimizer comes up with a bad plan, it is possible to adapt the costs either globally in postgresql.conf, or by changing the session variables, shown as follows: test=# SET seq_page_cost TO 10;SET This statement inflated the costs at will. It can be a handy way to fix the missed estimates, leading to bad performance and, therefore, to poor execution times. This is what the query plan will look like using the inflated costs: test=# explain SELECT count(*) FROM t_test;                     QUERY PLAN                              -------------------------------------------------------Aggregate (cost=266220.00..266220.01 rows=1 width=0)   -> Seq Scan on t_test (cost=0.00..256220.00          rows=4000000 width=0)(2 rows) It is important to understand the PostgreSQL code model in detail because many people have completely wrong ideas about what is going on inside the PostgreSQL optimizer. Offering a basic explanation will hopefully shed some light on this important topic and allow administrators a deeper understanding of the system. Creating indexes After this introduction, we can deploy our first index. As we stated before, runtimes of several hundred milliseconds for simple queries are not acceptable. To fight these unusually high execution times, we can turn to CREATE INDEX, shown as follows: test=# h CREATE INDEXCommand:     CREATE INDEXDescription: define a new indexSyntax:CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ name ]ON table_name [ USING method ]   ( { column_name | ( expression ) }[ COLLATE collation ] [ opclass ][ ASC | DESC ] [ NULLS { FIRST | LAST } ]   [, ...] )   [ WITH ( storage_parameter = value [, ... ] ) ]   [ TABLESPACE tablespace_name ]   [ WHERE predicate ] In the most simplistic case, we can create a normal B-tree index on the ID column and see what happens: test=# CREATE INDEX idx_id ON t_test (id);CREATE INDEXTime: 3996.909 ms B-tree indexes are the default index structure in PostgreSQL. Internally, they are also called B+ tree, as described by Lehman-Yao. On this box (AMD, 4 Ghz), we can build the B-tree index in around 4 seconds, without any database side tweaks. Once the index is in place, the SELECT command will be executed at lightning speed: test=# SELECT * FROM t_test WHERE id = 423423;   id   | name--------+------423423 | hans(1 row)Time: 0.384 ms The query executes in less than a millisecond. Keep in mind that this already includes displaying the data, and the query is a lot faster internally. Analyzing the performance of a query How do we know that the query is actually a lot faster? In the previous section, you saw EXPLAIN in action already. However, there is a little more to know about this command. You can add some instructions to EXPLAIN to make it a lot more verbose, as shown here: test=# h EXPLAINCommand:     EXPLAINDescription: show the execution plan of a statementSyntax:EXPLAIN [ ( option [, ...] ) ] statementEXPLAIN [ ANALYZE ] [ VERBOSE ] statement In the preceding code, the term option can be one of the following:    ANALYZE [ boolean ]   VERBOSE [ boolean ]   COSTS [ boolean ]   BUFFERS [ boolean ]   TIMING [ boolean ]   FORMAT { TEXT | XML | JSON | YAML } Consider the following example: test=# EXPLAIN (ANALYZE TRUE, VERBOSE true, COSTS TRUE,   TIMING true) SELECT * FROM t_test WHERE id = 423423;               QUERY PLAN        ------------------------------------------------------Index Scan using idx_id on public.t_test(cost=0.43..8.45 rows=1 width=9)(actual time=0.016..0.018 rows=1 loops=1)   Output: id, name   Index Cond: (t_test.id = 423423)Total runtime: 0.042 ms(4 rows)Time: 0.536 ms The ANALYZE function does a special form of execution. It is a good way to figure out which part of the query burned most of the time. Again, we can read things inside out. In addition to the estimated costs of the query, we can also see the real execution time. In our case, the index scan takes 0.018 milliseconds. Fast, isn't it? Given these timings, you can see that displaying the result actually takes a huge fraction of the time. The beauty of EXPLAIN ANALYZE is that it shows costs and execution times for every step of the process. This is important for you to familiarize yourself with this kind of output because when a programmer hits your desk complaining about bad performance, it is necessary to dig into this kind of stuff quickly. In many cases, the secret to performance is hidden in the execution plan, revealing a missing index or so. It is recommended to pay special attention to situations where the number of expected rows seriously differs from the number of rows really processed. Keep in mind that the planner is usually right, but not always. Be cautious in case of large differences (especially if this input is fed into a nested loop). Whenever a query feels slow, we always recommend to take a look at the plan first. In many cases, you will find missing indexes. The internal structure of a B-tree index Before we dig further into the B-tree indexes, we can briefly discuss what an index actually looks like under the hood. Understanding the B-tree internals Consider the following image that shows how things work: In PostgreSQL, we use the so-called Lehman-Yao B-trees (check out http://www.cs.cmu.edu/~dga/15-712/F07/papers/Lehman81.pdf). The main advantage of the B-trees is that they can handle concurrency very nicely. It is possible that hundreds or thousands of concurrent users modify the tree at the same time. Unfortunately, there is not enough room in this book to explain precisely how this works. The two most important issues of this tree are the facts that I/O is done in 8,000 chunks and that the tree is actually a sorted structure. This allows PostgreSQL to apply a ton of optimizations. Providing a sorted order As we stated before, a B-tree provides the system with sorted output. This can come in quite handy. Here is a simple query to make use of the fact that a B-tree provides the system with sorted output: test=# explain SELECT * FROM t_test ORDER BY id LIMIT 3;                   QUERY PLAN                                    ------------------------------------------------------Limit (cost=0.43..0.67 rows=3 width=9)   -> Index Scan using idx_id on t_test(cost=0.43..320094.43 rows=4000000 width=9)(2 rows) In this case, we are looking for the three smallest values. PostgreSQL will read the index from left to right and stop as soon as enough rows have been returned. This is a very common scenario. Many people think that indexes are only about searching, but this is not true. B-trees are also present to help out with sorting. Why do you, the DBA, care about this stuff? Remember that this is a typical use case where a software developer comes to your desk, pounds on the table, and complains. A simple index can fix the problem. Combined indexes Combined indexes are one more source of trouble if they are not used properly. A combined index is an index covering more than one column. Let's drop the existing index and create a combined index (make sure your seq_page_cost variable is set back to default to make the following examples work): test=# DROP INDEX idx_combined;DROP INDEXtest=# CREATE INDEX idx_combined ON t_test (name, id);CREATE INDEX We defined a composite index consisting of two columns. Remember that we put the name before the ID. A simple query will return the following execution plan: test=# explain analyze SELECT * FROM t_test   WHERE id = 10;               QUERY PLAN                                              -------------------------------------------------Seq Scan on t_test (cost=0.00..71622.00 rows=1   width=9)(actual time=181.502..351.439 rows=1 loops=1)   Filter: (id = 10)   Rows Removed by Filter: 3999999Total runtime: 351.481 ms(4 rows) There is no proper index for this, so the system will fall back to a sequential scan. Why is there no proper index? Well, try to look up for first names only in the telephone book. This is not going to work because a telephone book is sorted by location, last name, and first name. The same applies to our index. A B-tree works basically on the same principles as an ordinary paper phone book. It is only useful if you look up the first couple of values, or simply all of them. Here is an example: test=# explain analyze SELECT * FROM t_test   WHERE id = 10 AND name = 'joe';     QUERY PLAN                                                    ------------------------------------------------------Index Only Scan using idx_combined on t_test   (cost=0.43..6.20 rows=1 width=9)(actual time=0.068..0.068 rows=0 loops=1)   Index Cond: ((name = 'joe'::text) AND (id = 10))   Heap Fetches: 0Total runtime: 0.108 ms(4 rows) In this case, the combined index comes up with a high speed result of 0.1 ms, which is not bad. After this small example, we can turn to an issue that's a little bit more complex. Let's change the costs of a sequential scan to 100-times normal: test=# SET seq_page_cost TO 100;SET Don't let yourself be fooled into believing that an index is always good: test=# explain analyze SELECT * FROM t_testWHERE id = 10;                   QUERY PLAN                ------------------------------------------------------Index Only Scan using idx_combined on t_test(cost=0.43..91620.44 rows=1 width=9)(actual time=0.362..177.952 rows=1 loops=1)   Index Cond: (id = 10)   Heap Fetches: 1Total runtime: 177.983 ms(4 rows) Just look at the execution times. We are almost as slow as a sequential scan here. Why does PostgreSQL use the index at all? Well, let's assume we have a very broad table. In this case, sequentially scanning the table is expensive. Even if we have to read the entire index, it can be cheaper than having to read the entire table, at least if there is enough hope to reduce the amount of data by using the index somehow. So, in case you see an index scan, also take a look at the execution times and the number of rows used. The index might not be perfect, but it's just an attempt by PostgreSQL to avoid the worse to come. Keep in mind that there is no general rule (for example, more than 25 percent of data will result in a sequential scan) for sequential scans. The plans depend on a couple of internal issues, such as physical disk layout (correlation) and so on. Partial indexes Up to now, an index covered the entire table. This is not always necessarily the case. There are also partial indexes. When is a partial index useful? Consider the following example: test=# CREATE TABLE t_invoice (   id     serial,   d     date,   amount   numeric,   paid     boolean);CREATE TABLEtest=# CREATE INDEX idx_partial   ON   t_invoice (paid)   WHERE   paid = false;CREATE INDEX In our case, we create a table storing invoices. We can safely assume that the majority of the invoices are nicely paid. However, we expect a minority to be pending, so we want to search for them. A partial index will do the job in a highly space efficient way. Space is important because saving on space has a couple of nice side effects, such as cache efficiency and so on. Dealing with different types of indexes Let's move on to an important issue: not everything can be sorted easily and in a useful way. Have you ever tried to sort circles? If the question seems odd, just try to do it. It will not be easy and will be highly controversial, so how do we do it best? Would we sort by size or coordinates? Under any circumstances, using a B-tree to store circles, points, or polygons might not be a good idea at all. A B-tree does not do what you want it to do because a B-tree depends on some kind of sorting order. To provide end users with maximum flexibility and power, PostgreSQL provides more than just one index type. Each index type supports certain algorithms used for different purposes. The following list of index types is available in PostgreSQL (as of Version 9.4.1): btree: These are the high-concurrency B-trees gist: This is an index type for geometric searches (GIS data) and for KNN-search gin: This is an index type optimized for Full-Text Search (FTS) sp-gist: This is a space-partitioned gist As we mentioned before, each type of index serves different purposes. We highly encourage you to dig into this extremely important topic to make sure that you can help software developers whenever necessary. Unfortunately, we don't have enough room in this book to discuss all the index types in greater depth. If you are interested in finding out more, we recommend checking out information on my website at http://www.postgresql-support.de/slides/2013_dublin_indexing.pdf. Alternatively, you can look up the official PostgreSQL documentation, which can be found at http://www.postgresql.org/docs/9.4/static/indexes.html. Detecting missing indexes Now that we have covered the basics and some selected advanced topics of indexing, we want to shift our attention to a major and highly important administrative task: hunting down missing indexes. When talking about missing indexes, there is one essential query I have found to be highly valuable. The query is given as follows: test=# xExpanded display (expanded) is on.test=# SELECT   relname, seq_scan, seq_tup_read,     idx_scan, idx_tup_fetch,     seq_tup_read / seq_scanFROM   pg_stat_user_tablesWHERE   seq_scan > 0ORDER BY seq_tup_read DESC;-[ RECORD 1 ]-+---------relname       | t_user  seq_scan     | 824350      seq_tup_read | 2970269443530idx_scan     | 0      idx_tup_fetch | 0      ?column?     | 3603165 The pg_stat_user_tables option contains statistical information about tables and their access patterns. In this example, we found a classic problem. The t_user table has been scanned close to 1 million times. During these sequential scans, we processed close to 3 trillion rows. Do you think this is unusual? It's not nearly as unusual as you might think. In the last column, we divided seq_tup_read through seq_scan. Basically, this is a simple way to figure out how many rows a typical sequential scan has used to finish. In our case, 3.6 million rows had to be read. Do you remember our initial example? We managed to read 4 million rows in a couple of hundred milliseconds. So, it is absolutely realistic that nobody noticed the performance bottleneck before. However, just consider burning, say, 300 ms for every query thousands of times. This can easily create a heavy load on a totally unnecessary scale. In fact, a missing index is the key factor when it comes to bad performance. Let's take a look at the table description now: test=# d t_user                         Table "public.t_user"Column | Type   |       Modifiers                    ----------+---------+-------------------------------id      | integer | not null default               nextval('t_user_id_seq'::regclass)email   | text   |passwd   | text   |Indexes:   "t_user_pkey" PRIMARY KEY, btree (id) This is really a classic example. It is hard to tell how often I have seen this kind of example in the field. The table was probably called customer or userbase. The basic principle of the problem was always the same: we got an index on the primary key, but the primary key was never checked during the authentication process. When you log in to Facebook, Amazon, Google, and so on, you will not use your internal ID, you will rather use your e-mail address. Therefore, it should be indexed. The rules here are simple: we are searching for queries that needed many expensive scans. We don't mind sequential scans as long as they only read a handful of rows or as long as they show up rarely (caused by backups, for example). We need to keep expensive scans in mind, however ("expensive" in terms of "many rows needed"). Here is an example code snippet that should not bother us at all: -[ RECORD 1 ]-+---------relname       | t_province  seq_scan     | 8345345      seq_tup_read | 100144140idx_scan     | 0      idx_tup_fetch | 0      ?column?     | 12 The table has been read 8 million times, but in an average, only 12 rows have been returned. Even if we have 1 million indexes defined, PostgreSQL will not use them because the table is simply too small. It is pretty hard to tell which columns might need an index from inside PostgreSQL. However, taking a look at the tables and thinking about them for a minute will, in most cases, solve the riddle. In many cases, things are pretty obvious anyway, and developers will be able to provide you with a reasonable answer. As you can see, finding missing indexes is not hard, and we strongly recommend checking this system table once in a while to figure out whether your system works nicely. There are a couple of tools, such as pgbadger, out there that can help us to monitor systems. It is recommended that you make use of such tools. There is not only light, there is also some shadow. Indexes are not always good. They can also cause considerable overhead during writes. Keep in mind that when you insert, modify, or delete data, you have to touch the indexes as well. The overhead of useless indexes should never be underestimated. Therefore, it makes sense to not just look for missing indexes, but also for spare indexes that don't serve a purpose anymore. Detecting slow queries Now that we have seen how to hunt down tables that might need an index, we can move on to the next example and try to figure out the queries that cause most of the load on your system. Sometimes, the slowest query is not the one causing a problem; it is a bunch of small queries, which are executed over and over again. In this section, you will learn how to track down such queries. To track down slow operations, we can rely on a module called pg_stat_statements. This module is available as part of the PostgreSQL contrib section. Installing a module from this section is really easy. Connect to PostgreSQL as a superuser, and execute the following instruction (if contrib packages have been installed): test=# CREATE EXTENSION pg_stat_statements;CREATE EXTENSION This module will install a system view that will contain all the relevant information we need to find expensive operations: test=# d pg_stat_statements         View "public.pg_stat_statements"       Column       |       Type       | Modifiers---------------------+------------------+-----------userid             | oid             |dbid               | oid             |queryid             | bigint          |query               | text             |calls               | bigint           |total_time         | double precision |rows               | bigint           |shared_blks_hit     | bigint           |shared_blks_read   | bigint           |shared_blks_dirtied | bigint           |shared_blks_written | bigint           |local_blks_hit     | bigint           |local_blks_read     | bigint           |local_blks_dirtied | bigint           |local_blks_written | bigint           |temp_blks_read     | bigint           |temp_blks_written   | bigint           |blk_read_time       | double precision |blk_write_time     | double precision | In this view, we can see the queries we are interested in, the total execution time (total_time), the number of calls, and the number of rows returned. Then, we will get some information about the I/O behavior (more on caching later) of the query as well as information about temporary data being read and written. Finally, the last two columns will tell us how much time we actually spent on I/O. The final two fields are active when track_timing in postgresql.conf has been enabled and will give vital insights into potential reasons for disk wait and disk-related speed problems. The blk_* prefix will tell us how much time a certain query has spent reading and writing to the operating system. Let's see what happens when we want to query the view: test=# SELECT * FROM pg_stat_statements;ERROR: pg_stat_statements must be loaded via   shared_preload_libraries The system will tell us that we have to enable this module; otherwise, data won't be collected. All we have to do to make this work is to add the following line to postgresql.conf: shared_preload_libraries = 'pg_stat_statements' Then, we have to restart the server to enable it. We highly recommend adding this module to the configuration straightaway to make sure that a restart can be avoided and that this data is always around. Don't worry too much about the performance overhead of this module. Tests have shown that the impact on performance is so low that it is even too hard to measure. Therefore, it might be a good idea to have this module activated all the time. If you have configured things properly, finding the most time-consuming queries should be simple: SELECT *FROM   pg_stat_statementsORDER   BY total_time DESC; The important part here is that PostgreSQL can nicely group queries. For instance: SELECT * FROM foo WHERE bar = 1;SELECT * FROM foo WHERE bar = 2; PostgreSQL will detect that this is just one type of query and replace the two numbers in the WHERE clause with a placeholder indicating that a parameter was used here. Of course, you can also sort by any other criteria: highest I/O time, highest number of calls, or whatever. The pg_stat_statement function has it all, and things are available in a way that makes the data very easy and efficient to use. How to reset statistics Sometimes, it is necessary to reset the statistics. If you are about to track down a problem, resetting can be very beneficial. Here is how it works: test=# SELECT pg_stat_reset();pg_stat_reset---------------(1 row)test=# SELECT pg_stat_statements_reset();pg_stat_statements_reset--------------------------(1 row) The pg_stat_reset command will reset the entire system statistics (for example, pg_stat_user_tables). The second call will wipe out pg_stat_statements. Adjusting memory parameters After we find the slow queries, we can do something about them. The first step is always to fix indexing and make sure that sane requests are sent to the database. If you are requesting stupid things from PostgreSQL, you can expect trouble. Once the basic steps have been performed, we can move on to the PostgreSQL memory parameters, which need some tuning. Optimizing shared buffers One of the most essential memory parameters is shared_buffers. What are shared buffers? Let's assume we are about to read a table consisting of 8,000 blocks. PostgreSQL will check if the buffer is already in cache (shared_buffers), and if it is not, it will ask the underlying operating system to provide the database with the missing 8,000 blocks. If we are lucky, the operating system has a cached copy of the block. If we are not so lucky, the operating system has to go to the disk system and fetch the data (worst case). So, the more data we have in cache, the more efficient we will be. Setting shared_buffers to the right value is more art than science. The general guideline is that shared_buffers should consume 25 percent of memory, but not more than 16 GB. Very large shared buffer settings are known to cause suboptimal performance in some cases. It is also not recommended to starve the filesystem cache too much on behalf of the database system. Mentioning the guidelines does not mean that it is eternal law—you really have to see this as a guideline you can use to get started. Different settings might be better for your workload. Remember, if there was an eternal law, there would be no setting, but some autotuning magic. However, a contrib module called pg_buffercache can give some insights into what is in cache at the moment. It can be used as a basis to get started on understanding what is going on inside the PostgreSQL shared buffer. Changing shared_buffers can be done in postgresql.conf, shown as follows: shared_buffers = 4GB In our example, shared buffers have been set to 4GB. A database restart is needed to activate the new value. In PostgreSQL 9.4, some changes were introduced. Traditionally, PostgreSQL used a classical System V shared memory to handle the shared buffers. Starting with PostgreSQL 9.3, mapped memory was added, and finally, it was in PostgreSQL 9.4 that a config variable was introduced to configure the memory technique PostgreSQL will use, shown as follows: dynamic_shared_memory_type = posix# the default is the first option     # supported by the operating system:     #   posix     #   sysv     #   windows     #   mmap     # use none to disable dynamic shared memory The default value on the most common operating systems is basically fine. However, feel free to experiment with the settings and see what happens performance wise. Considering huge pages When a process uses RAM, the CPU marks this memory as used by this process. For efficiency reasons, the CPU usually allocates RAM by chunks of 4,000 bytes. These chunks are called pages. The process address space is virtual, and the CPU and operating system have to remember which process belongs to which page. The more pages you have, the more time it takes to find where the memory is mapped. When a process uses 1 GB of memory, it means that 262.144 blocks have to be looked up. Most modern CPU architectures support bigger pages, and these pages are called huge pages (on Linux). To tell PostgreSQL that this mechanism can be used, the following config variable can be changed in postgresql.conf: huge_pages = try                     # on, off, or try Of course, your Linux system has to know about the use of huge pages. Therefore, you can do some tweaking, as follows: grep Hugepagesize /proc/meminfoHugepagesize:     2048 kB In our case, the size of the huge pages is 2 MB. So, if there is 1 GB of memory, 512 huge pages are needed. The number of huge pages can be configured and activated by setting nr_hugepages in the proc filesystem. Consider the following example: echo 512 > /proc/sys/vm/nr_hugepages Alternatively, we can use the sysctl command or change things in /etc/sysctl.conf: sysctl -w vm.nr_hugepages=512 Huge pages can have a significant impact on performance. Tweaking work_mem There is more to PostgreSQL memory configuration than just shared buffers. The work_mem parameter is widely used for operations such as sorting, aggregating, and so on. Let's illustrate the way work_mem works with a short, easy-to-understand example. Let's assume it is an election day and three parties have taken part in the elections. The data is as follows: test=# CREATE TABLE t_election (id serial, party text);test=# INSERT INTO t_election (party)SELECT 'socialists'   FROM generate_series(1, 439784);test=# INSERT INTO t_election (party)SELECT 'conservatives'   FROM generate_series(1, 802132);test=# INSERT INTO t_election (party)SELECT 'liberals'   FROM generate_series(1, 654033); We add some data to the table and try to count how many votes each party has: test=# explain analyze SELECT party, count(*)   FROM   t_election   GROUP BY 1;       QUERY PLAN                                                        ------------------------------------------------------HashAggregate (cost=39461.24..39461.26 rows=3     width=11) (actual time=609.456..609.456   rows=3 loops=1)     Group Key: party   -> Seq Scan on t_election (cost=0.00..29981.49     rows=1895949 width=11)   (actual time=0.007..192.934 rows=1895949   loops=1)Planning time: 0.058 msExecution time: 609.481 ms(5 rows) First of all, the system will perform a sequential scan and read all the data. This data is passed on to a so-called HashAggregate. For each party, PostgreSQL will calculate a hash key and increment counters as the query moves through the tables. At the end of the operation, we will have a chunk of memory with three values and three counters. Very nice! As you can see, the explain analyze statement does not take more than 600 ms. Note that the real execution time of the query will be a lot faster. The explain analyze statement does have some serious overhead. Still, it will give you valuable insights into the inner workings of the query. Let's try to repeat this same example, but this time, we want to group by the ID. Here is the execution plan: test=# explain analyze SELECT id, count(*)   FROM   t_election     GROUP BY 1;       QUERY PLAN                                                          ------------------------------------------------------GroupAggregate (cost=253601.23..286780.33 rows=1895949     width=4) (actual time=1073.769..1811.619     rows=1895949 loops=1)     Group Key: id   -> Sort (cost=253601.23..258341.10 rows=1895949   width=4) (actual time=1073.763..1288.432   rows=1895949 loops=1)         Sort Key: id       Sort Method: external sort Disk: 25960kB         -> Seq Scan on t_election         (cost=0.00..29981.49 rows=1895949 width=4)     (actual time=0.013..235.046 rows=1895949     loops=1)Planning time: 0.086 msExecution time: 1928.573 ms(8 rows) The execution time rises by almost 2 seconds and, more importantly, the plan changes. In this scenario, there is no way to stuff all the 1.9 million hash keys into a chunk of memory because we are limited by work_mem. Therefore, PostgreSQL has to find an alternative plan. It will sort the data and run GroupAggregate. How does it work? If you have a sorted list of data, you can count all equal values, send them off to the client, and move on to the next value. The main advantage is that we don't have to keep the entire result set in memory at once. With GroupAggregate, we can basically return aggregations of infinite sizes. The downside is that large aggregates exceeding memory will create temporary files leading to potential disk I/O. Keep in mind that we are talking about the size of the result set and not about the size of the underlying data. Let's try the same thing with more work_mem: test=# SET work_mem TO '1 GB';SETtest=# explain analyze SELECT id, count(*)   FROM t_election   GROUP BY 1;         QUERY PLAN                                                        ------------------------------------------------------HashAggregate (cost=39461.24..58420.73 rows=1895949     width=4) (actual time=857.554..1343.375   rows=1895949 loops=1)   Group Key: id   -> Seq Scan on t_election (cost=0.00..29981.49   rows=1895949 width=4)   (actual time=0.010..201.012   rows=1895949 loops=1)Planning time: 0.113 msExecution time: 1478.820 ms(5 rows) In this case, we adapted work_mem for the current session. Don't worry, changing work_mem locally does not change the parameter for other database connections. If you want to change things globally, you have to do so by changing things in postgresql.conf. Alternatively, 9.4 offers a command called ALTER SYSTEM SET work_mem TO '1 GB'. Once SELECT pg_reload_conf() has been called, the config parameter is changed as well. What you see in this example is that the execution time is around half a second lower than before. PostgreSQL switches back to the more efficient plan. However, there is more; work_mem is also in charge of efficient sorting: test=# explain analyze SELECT * FROM t_election ORDER BY id DESC;     QUERY PLAN                                                          ------------------------------------------------------Sort (cost=227676.73..232416.60 rows=1895949 width=15)   (actual time=695.004..872.698 rows=1895949   loops=1)   Sort Key: id   Sort Method: quicksort Memory: 163092kB   -> Seq Scan on t_election (cost=0.00..29981.49   rows=1895949 width=15) (actual time=0.013..188.876rows=1895949 loops=1)Planning time: 0.042 msExecution time: 995.327 ms(6 rows) In our example, PostgreSQL can sort the entire dataset in memory. Earlier, we had to perform a so-called "external sort Disk", which is way slower because temporary results have to be written to disk. The work_mem command is used for some other operations as well. However, sorting and aggregation are the most common use cases. Keep in mind that work_mem should not be abused, and work_mem can be allocated to every sorting or grouping operation. So, more than just one work_mem amount of memory might be allocated by a single query. Improving maintenance_work_mem To control the memory consumption of administrative tasks, PostgreSQL offers a parameter called maintenance_work_mem. It is used to handle index creations as well as VACUUM. Usually, creating an index (B-tree) is mostly related to sorting, and the idea of maintenance_work_mem is to speed things up. However, things are not as simple as they might seem. People might assume that increasing the parameter will always speed things up, but this is not necessarily true; in fact, smaller values might even be beneficial. We conducted some research to solve this riddle. The in-depth results of this research can be found at http://www.cybertec.at/adjusting-maintenance_work_mem/. However, indexes are not the only beneficiaries. The maintenance_work_mem command is also here to help VACUUM clean out indexes. If maintenance_work_mem is too low, you might see VACUUM scanning tables repeatedly because dead items cannot be stored in memory during VACUUM. This is something that should basically be avoided. Just like all other memory parameters, maintenance_work_mem can be set per session, or it can be set globally in postgresql.conf. Adjusting effective_cache_size The number of shared_buffers assigned to PostgreSQL is not the only cache in the system. The operating system will also cache data and do a great job of improving speed. To make sure that the PostgreSQL optimizer knows what to expect from the operation system, effective_cache_size has been introduced. The idea is to tell PostgreSQL how much cache there is going to be around (shared buffers + operating system side cache). The optimizer can then adjust its costs and estimates to reflect this knowledge. It is recommended to always set this parameter; otherwise, the planner might come up with suboptimal plans. Summary In this article, you learned how to detect basic performance bottlenecks. In addition to this, we covered the very basics of the PostgreSQL optimizer and indexes. At the end of the article, some important memory parameters were presented. Resources for Article: Further resources on this subject: PostgreSQL 9: Reliable Controller and Disk Setup [article] Running a PostgreSQL Database Server [article] PostgreSQL: Tips and Tricks [article]
Read more
  • 0
  • 0
  • 3541

article-image-machine-learning-ipython-scikit-learn-0
Packt
25 Sep 2014
13 min read
Save for later

Machine Learning in IPython with scikit-learn

Packt
25 Sep 2014
13 min read
This article written by Daniele Teti, the author of Ipython Interactive Computing and Visualization Cookbook, the basics of the machine learning scikit-learn package (http://scikit-learn.org) is introduced. Its clean API makes it really easy to define, train, and test models. Plus, scikit-learn is specifically designed for speed and (relatively) big data. (For more resources related to this topic, see here.) A very basic example of linear regression in the context of curve fitting is shown here. This toy example will allow to illustrate key concepts such as linear models, overfitting, underfitting, regularization, and cross-validation. Getting ready You can find all instructions to install scikit-learn on the main documentation. For more information, refer to http://scikit-learn.org/stable/install.html. With anaconda, you can type conda install scikit-learn in a terminal. How to do it... We will generate a one-dimensional dataset with a simple model (including some noise), and we will try to fit a function to this data. With this function, we can predict values on new data points. This is a curve fitting regression problem. First, let's make all the necessary imports: In [1]: import numpy as np        import scipy.stats as st        import sklearn.linear_model as lm        import matplotlib.pyplot as plt        %matplotlib inline We now define a deterministic nonlinear function underlying our generative model: In [2]: f = lambda x: np.exp(3 * x) We generate the values along the curve on [0,2]: In [3]: x_tr = np.linspace(0., 2, 200)        y_tr = f(x_tr) Now, let's generate data points within [0,1]. We use the function f and we add some Gaussian noise: In [4]: x = np.array([0, .1, .2, .5, .8, .9, 1])        y = f(x) + np.random.randn(len(x)) Let's plot our data points on [0,1]: In [5]: plt.plot(x_tr[:100], y_tr[:100], '--k')        plt.plot(x, y, 'ok', ms=10) In the image, the dotted curve represents the generative model. Now, we use scikit-learn to fit a linear model to the data. There are three steps. First, we create the model (an instance of the LinearRegression class). Then, we fit the model to our data. Finally, we predict values from our trained model. In [6]: # We create the model.        lr = lm.LinearRegression()        # We train the model on our training dataset.        lr.fit(x[:, np.newaxis], y)        # Now, we predict points with our trained model.        y_lr = lr.predict(x_tr[:, np.newaxis]) We need to convert x and x_tr to column vectors, as it is a general convention in scikit-learn that observations are rows, while features are columns. Here, we have seven observations with one feature. We now plot the result of the trained linear model. We obtain a regression line in green here: In [7]: plt.plot(x_tr, y_tr, '--k')        plt.plot(x_tr, y_lr, 'g')        plt.plot(x, y, 'ok', ms=10)        plt.xlim(0, 1)        plt.ylim(y.min()-1, y.max()+1)        plt.title("Linear regression") The linear fit is not well-adapted here, as the data points are generated according to a nonlinear model (an exponential curve). Therefore, we are now going to fit a nonlinear model. More precisely, we will fit a polynomial function to our data points. We can still use linear regression for this, by precomputing the exponents of our data points. This is done by generating a Vandermonde matrix, using the np.vander function. We will explain this trick in How it works…. In the following code, we perform and plot the fit: In [8]: lrp = lm.LinearRegression()        plt.plot(x_tr, y_tr, '--k')        for deg in [2, 5]:            lrp.fit(np.vander(x, deg + 1), y)            y_lrp = lrp.predict(np.vander(x_tr, deg + 1))            plt.plot(x_tr, y_lrp,                      label='degree ' + str(deg))             plt.legend(loc=2)            plt.xlim(0, 1.4)            plt.ylim(-10, 40)            # Print the model's coefficients.            print(' '.join(['%.2f' % c for c in                            lrp.coef_]))        plt.plot(x, y, 'ok', ms=10)      plt.title("Linear regression") 25.00 -8.57 0.00 -132.71 296.80 -211.76 72.80 -8.68 0.00 We have fitted two polynomial models of degree 2 and 5. The degree 2 polynomial appears to fit the data points less precisely than the degree 5 polynomial. However, it seems more robust; the degree 5 polynomial seems really bad at predicting values outside the data points (look for example at the x 1 portion). This is what we call overfitting; by using a too complex model, we obtain a better fit on the trained dataset, but a less robust model outside this set. Note the large coefficients of the degree 5 polynomial; this is generally a sign of overfitting. We will now use a different learning model, called ridge regression. It works like linear regression except that it prevents the polynomial's coefficients from becoming too big. This is what happened in the previous example. By adding a regularization term in the loss function, ridge regression imposes some structure on the underlying model. We will see more details in the next section. The ridge regression model has a meta-parameter, which represents the weight of the regularization term. We could try different values with trials and errors, using the Ridge class. However, scikit-learn includes another model called RidgeCV, which includes a parameter search with cross-validation. In practice, it means that we don't have to tweak this parameter by hand—scikit-learn does it for us. As the models of scikit-learn always follow the fit-predict API, all we have to do is replace lm.LinearRegression() by lm.RidgeCV() in the previous code. We will give more details in the next section. In [9]: ridge = lm.RidgeCV()        plt.plot(x_tr, y_tr, '--k')               for deg in [2, 5]:             ridge.fit(np.vander(x, deg + 1), y);            y_ridge = ridge.predict(np.vander(x_tr, deg+1))            plt.plot(x_tr, y_ridge,                      label='degree ' + str(deg))            plt.legend(loc=2)            plt.xlim(0, 1.5)           plt.ylim(-5, 80)            # Print the model's coefficients.            print(' '.join(['%.2f' % c                            for c in ridge.coef_]))               plt.plot(x, y, 'ok', ms=10)        plt.title("Ridge regression") 11.36 4.61 0.00 2.84 3.54 4.09 4.14 2.67 0.00 This time, the degree 5 polynomial seems more precise than the simpler degree 2 polynomial (which now causes underfitting). Ridge regression reduces the overfitting issue here. Observe how the degree 5 polynomial's coefficients are much smaller than in the previous example. How it works... In this section, we explain all the aspects covered in this article. The scikit-learn API scikit-learn implements a clean and coherent API for supervised and unsupervised learning. Our data points should be stored in a (N,D) matrix X, where N is the number of observations and D is the number of features. In other words, each row is an observation. The first step in a machine learning task is to define what the matrix X is exactly. In a supervised learning setup, we also have a target, an N-long vector y with a scalar value for each observation. This value is continuous or discrete, depending on whether we have a regression or classification problem, respectively. In scikit-learn, models are implemented in classes that have the fit() and predict() methods. The fit() method accepts the data matrix X as input, and y as well for supervised learning models. This method trains the model on the given data. The predict() method also takes data points as input (as a (M,D) matrix). It returns the labels or transformed points as predicted by the trained model. Ordinary least squares regression Ordinary least squares regression is one of the simplest regression methods. It consists of approaching the output values yi with a linear combination of Xij: Here, w = (w1, ..., wD) is the (unknown) parameter vector. Also, represents the model's output. We want this vector to match the data points y as closely as possible. Of course, the exact equality cannot hold in general (there is always some noise and uncertainty—models are always idealizations of reality). Therefore, we want to minimize the difference between these two vectors. The ordinary least squares regression method consists of minimizing the following loss function: This sum of the components squared is called the L2 norm. It is convenient because it leads to differentiable loss functions so that gradients can be computed and common optimization procedures can be performed. Polynomial interpolation with linear regression Ordinary least squares regression fits a linear model to the data. The model is linear both in the data points Xiand in the parameters wj. In our example, we obtain a poor fit because the data points were generated according to a nonlinear generative model (an exponential function). However, we can still use the linear regression method with a model that is linear in wj, but nonlinear in xi. To do this, we need to increase the number of dimensions in our dataset by using a basis of polynomial functions. In other words, we consider the following data points: Here, D is the maximum degree. The input matrix X is therefore the Vandermonde matrix associated to the original data points xi. For more information on the Vandermonde matrix, refer to http://en.wikipedia.org/wiki/Vandermonde_matrix. Here, it is easy to see that training a linear model on these new data points is equivalent to training a polynomial model on the original data points. Ridge regression Polynomial interpolation with linear regression can lead to overfitting if the degree of the polynomials is too large. By capturing the random fluctuations (noise) instead of the general trend of the data, the model loses some of its predictive power. This corresponds to a divergence of the polynomial's coefficients wj. A solution to this problem is to prevent these coefficients from growing unboundedly. With ridge regression (also known as Tikhonov regularization) this is done by adding a regularization term to the loss function. For more details on Tikhonov regularization, refer to http://en.wikipedia.org/wiki/Tikhonov_regularization. By minimizing this loss function, we not only minimize the error between the model and the data (first term, related to the bias), but also the size of the model's coefficients (second term, related to the variance). The bias-variance trade-off is quantified by the hyperparameter , which specifies the relative weight between the two terms in the loss function. Here, ridge regression led to a polynomial with smaller coefficients, and thus a better fit. Cross-validation and grid search A drawback of the ridge regression model compared to the ordinary least squares model is the presence of an extra hyperparameter . The quality of the prediction depends on the choice of this parameter. One possibility would be to fine-tune this parameter manually, but this procedure can be tedious and can also lead to overfitting problems. To solve this problem, we can use a grid search; we loop over many possible values for , and we evaluate the performance of the model for each possible value. Then, we choose the parameter that yields the best performance. How can we assess the performance of a model with a given value? A common solution is to use cross-validation. This procedure consists of splitting the dataset into a train set and a test set. We fit the model on the train set, and we test its predictive performance on the test set. By testing the model on a different dataset than the one used for training, we reduce overfitting. There are many ways to split the initial dataset into two parts like this. One possibility is to remove one sample to form the train set and to put this one sample into the test set. This is called Leave-One-Out cross-validation. With N samples, we obtain N sets of train and test sets. The cross-validated performance is the average performance on all these set decompositions. As we will see later, scikit-learn implements several easy-to-use functions to do cross-validation and grid search. In this article, there exists a special estimator called RidgeCV that implements a cross-validation and grid search procedure that is specific to the ridge regression model. Using this model ensures that the best hyperparameter is found automatically for us. There's more… Here are a few references about least squares: Ordinary least squares on Wikipedia, available at http://en.wikipedia.org/wiki/Ordinary_least_squares Linear least squares on Wikipedia, available at http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics) Here are a few references about cross-validation and grid search: Cross-validation in scikit-learn's documentation, available at http://scikit-learn.org/stable/modules/cross_validation.html Grid search in scikit-learn's documentation, available at http://scikit-learn.org/stable/modules/grid_search.html Cross-validation on Wikipedia, available at http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 Here are a few references about scikit-learn: scikit-learn basic tutorial available at http://scikit-learn.org/stable/tutorial/basic/tutorial.html scikit-learn tutorial given at the SciPy 2013 conference available at https://github.com/jakevdp/sklearn_scipy2013 Summary Using the scikit-learn Python package, this article illustrates fundamental data mining and machine learning concepts such as supervised and unsupervised learning, classification, regression, feature selection, feature extraction, overfitting, regularization, cross-validation, and grid search. Resources for Article: Further resources on this subject: Driving Visual Analyses with Automobile Data (Python) [Article] Fast Array Operations with NumPy [Article] Python 3: Designing a Tasklist Application [Article]
Read more
  • 0
  • 0
  • 3237
article-image-how-to-build-a-recommender-by-running-mahout-on-spark
Pat Ferrel
24 Sep 2014
7 min read
Save for later

How to Build a Recommender by Running Mahout on Spark

Pat Ferrel
24 Sep 2014
7 min read
Mahout on Spark: Recommenders There are big changes happening in Apache Mahout. For several years it was the go-to machine learning library for Hadoop. It contained most of the best-in-class algorithms for scalable machine learning, which means clustering, classification, and recommendations. But it was written for Hadoop and MapReduce. Today a number of new parallel execution engines show great promise in speeding calculations by 10-100x (Spark, H2O, Flink). That means that instead of buying 10 computers for a cluster, one will do. That should get your manager’s attention. After releasing Mahout 0.9, the team decided to begin an aggressive retool using Spark, building in the flexibility to support other engines, and both H2O and Flink have shown active interest. This post is about moving the heart of Mahout’s item-based collaborative filtering recommender to Spark. Where we are Mahout is currently on the 1.0 snapshot version, meaning we are working on what will be released as 1.0. For the past year or, some of the team has been working on a Scala-based DSL (Domain Specific Language), which looks like Scala with R-like algebraic expressions. Since Scala supports not only operator overloading but functional programming, it is a natural choice for building distributed code with rich linear algebra expressions. Currently we have an interactive shell that runs Scala with all of the R-like expression support. Think of it as R but supporting truly huge data in a completely distributed way. Many algorithms—the ones that can be expressed as simple linear algebra equations—are implemented with relative ease (SSVD, PCA). Scala also has lazy evaluation, which allows Mahout to slide a modern optimizer underneath the DSL. When an end product of a calculation is needed, the optimizer figures out the best path to follow and spins off the most efficient Spark jobs to accomplish the whole. Recommenders One of the first things we want to implement is the popular item-based recommenders. But here, too, we’ll introduce many innovations. It still starts from some linear algebra. Let’s take the case of recommending purchases on an e-commerce site. The problem can be defined in the following code example: r_p = recommendations for purchases for a given user. This is a vector of item-ids and strengths of recommendation. h_p = history of purchases for a given user A = the matrix of all purchases by all users. Rows are users, columns items, for now we will just flag a purchase so the matrix is all ones and zeros. r_p = [A’A]h_p A’A is the matrix A transposed then multiplied by A. This is the core cooccurrence or indicator matrix used in this style of recommender. Using the Mahout Scala DSL we could write the recommender as: val recs = (A.t %*% A) * userHist This would produce a reasonable recommendation, but from experience we know that A’A is better calculated using a method called the Log Likelihood Ratio, which is a probabilistic measure of the importance of a cooccurrence (http://en.wikipedia.org/wiki/Likelihood-ratio_test). In general, when you see something like A’A, it can be replaced with a similarity comparison for each row with every other row. This will produce a matrix whose rows are items and whose columns are the same items. The magnitude of the value in the matrix determines the strength of similarity of row item to the column item. In recommenders, the more similar the items, the more they were purchased by similar people. The previous line of code is replaced by the following: val recs = CooccurrenceAnalysis.cooccurence(A)* userHist However, this would take time to execute for each user as they visit the e-commerce site, so we’ll handle that outside of Mahout. First let’s talk about data preparation. Item Similarity Driver Creating the indicator matrix (A’A) is the core of this type of recommender. We have a quick flexible way to create this using text log files and creating output that is in an easy form to digest. The job of data prep is greatly streamlined in the Mahout 1.0 snapshot. In the past a user would have to do all the data prep themselves. This requires translating their own user and item IDs into Mahout IDs, putting the data into text tuple files and feeding them to the recommender. Out the other end you’d get a Hadoop binary file called a sequence file and you’d have to translate the Mahout IDs into something your application could understand. This is no longer required. To make this process much simpler we created the spark-itemsimilarity command line tool. After installing Mahout, Hadoop, and Spark, and assuming you have logged user purchases in some directories in HDFS, we can probably read them in, calculate the indicator matrix, and write it out with no other prep required. The spark-itemsimilarity command line takes in text-delimited files, extracts the user ID and item ID, runs the cooccurrence analysis, and outputs a text file with your application’s user and item IDs restored. Here is the sample input file where we’ve specified a simple comma-delimited format that field holds the user ID, the item ID, and the filter—purchase: Thu Jul 10 10:52:10.996,u1,purchase,iphone Fri Jul 11 13:52:51.018,u1,purchase,ipad Fri Jul 11 21:42:26.232,u2,purchase,nexus Sat Jul 12 09:43:12.434,u2,purchase,galaxy Sat Jul 12 19:58:09.975,u3,purchase,surface Sat Jul 12 23:08:03.457,u4,purchase,iphone Sun Jul 13 14:43:38.363,u4,purchase,galaxy spark-itemsimilarity will create a Spark distributed dataset (rdd) to back the Mahout DRM (distributed row matrix) that holds this data: User/item iPhone IPad Nexus Galaxy Surface u1 1 1 0 0 0 u2 0 0 1 1 0 u3 0 0 0 0 1 u4 1 0 0 1 0 The output of the job is the LLR computed “indicator matrix” and will contain this data: Item/item iPhone iPad Nexus Galaxy Surface iPhone 0 1.726092435 0 0 0 iPad 1.726092435 0 0 0 0 Nexus 0 0 0 1.726092435 0 Galaxy 0 0 1.726092435 0 0 Surface 0 0 0 0 0 Reading this, we see that self-similarities have been removed so the diagonal is all zeros. The iPhone is similar to the iPad and the Galaxy is similar to the Nexus. The output of the spark-itemsimilarity job can be formatted in various ways but by default it looks like this: galaxy<tab>nexus:1.7260924347106847 ipad<tab>iphone:1.7260924347106847 nexus<tab>galaxy:1.7260924347106847 iphone<tab>ipad:1.7260924347106847 surface On the e-commerce site for the page displaying the Nexus we can show that the Galaxy was purchased by the same people. Notice that application specific IDs are preserved here and the text file is very easy to parse in text-delimited format. The numeric values are the strength of similarity, and for the cases where there are many similar products, you can sort on that value if you want to show only the highest weighted recommendations. Still, this is only part way to an individual recommender. We have done the [A’A] part, but now we need to do the [A’A]h_p. Using the current user’s purchase history will personalize the recommendations. The next post in this series will talk about using a search engine to take this last step. About the author Pat is a serial entrepreneur, consultant, and Apache Mahout committer working on the next generation of Spark-based recommenders. He lives in Seattle and can be contacted through his site https://finderbots.com or @occam on Twitter. Want more Spark? We've got you covered - click here.
Read more
  • 0
  • 0
  • 5915

article-image-visualization-tool-understand-data
Packt
22 Sep 2014
23 min read
Save for later

Visualization as a Tool to Understand Data

Packt
22 Sep 2014
23 min read
In this article by Nazmus Saquib, the author of Mathematica Data Visualization, we will look at a few simple examples that demonstrate the importance of data visualization. We will then discuss the types of datasets that we will encounter over the course of this book, and learn about the Mathematica interface to get ourselves warmed up for coding. (For more resources related to this topic, see here.) In the last few decades, the quick growth in the volume of information we produce and the capacity of digital information storage have opened a new door for data analytics. We have moved on from the age of terabytes to that of petabytes and exabytes. Traditional data analysis is now augmented with the term big data analysis, and computer scientists are pushing the bounds for analyzing this huge sea of data using statistical, computational, and algorithmic techniques. Along with the size, the types and categories of data have also evolved. Along with the typical and popular data domain in Computer Science (text, image, and video), graphs and various categorical data that arise from Internet interactions have become increasingly interesting to analyze. With the advances in computational methods and computing speed, scientists nowadays produce an enormous amount of numerical simulation data that has opened up new challenges in the field of Computer Science. Simulation data tends to be structured and clean, whereas data collected or scraped from websites can be quite unstructured and hard to make sense of. For example, let's say we want to analyze some blog entries in order to find out which blogger gets more follows and referrals from other bloggers. This is not as straightforward as getting some friends' information from social networking sites. Blog entries consist of text and HTML tags; thus, a combination of text analytics and tag parsing, coupled with a careful observation of the results would give us our desired outcome. Regardless of whether the data is simulated or empirical, the key word here is observation. In order to make intelligent observations, data scientists tend to follow a certain pipeline. The data needs to be acquired and cleaned to make sure that it is ready to be analyzed using existing tools. Analysis may take the route of visualization, statistics, and algorithms, or a combination of any of the three. Inference and refining the analysis methods based on the inference is an iterative process that needs to be carried out several times until we think that a set of hypotheses is formed, or a clear question is asked for further analysis, or a question is answered with enough evidence. Visualization is a very effective and perceptive method to make sense of our data. While statistics and algorithmic techniques provide good insights about data, an effective visualization makes it easy for anyone with little training to gain beautiful insights about their datasets. The power of visualization resides not only in the ease of interpretation, but it also reveals visual trends and patterns in data, which are often hard to find using statistical or algorithmic techniques. It can be used during any step of the data analysis pipeline—validation, verification, analysis, and inference—to aid the data scientist. How have you visualized your data recently? If you still have not, it is okay, as this book will teach you exactly that. However, if you had the opportunity to play with any kind of data already, I want you to take a moment and think about the techniques you used to visualize your data so far. Make a list of them. Done? Do you have 2D and 3D plots, histograms, bar charts, and pie charts in the list? If yes, excellent! We will learn how to style your plots and make them more interactive using Mathematica. Do you have chord diagrams, graph layouts, word cloud, parallel coordinates, isosurfaces, and maps somewhere in that list? If yes, then you are already familiar with some modern visualization techniques, but if you have not had the chance to use Mathematica as a data visualization language before, we will explore how visualization prototypes can be built seamlessly in this software using very little code. The aim of this book is to teach a Mathematica beginner the data-analysis and visualization powerhouse built into Mathematica, and at the same time, familiarize the reader with some of the modern visualization techniques that can be easily built with Mathematica. We will learn how to load, clean, and dissect different types of data, visualize the data using Mathematica's built-in tools, and then use the Mathematica graphics language and interactivity functions to build prototypes of a modern visualization. The importance of visualization Visualization has a broad definition, and so does data. The cave paintings drawn by our ancestors can be argued as visualizations as they convey historical data through a visual medium. Map visualizations were commonly used in wars since ancient times to discuss the past, present, and future states of a war, and to come up with new strategies. Astronomers in the 17th century were believed to have built the first visualization of their statistical data. In the 18th century, William Playfair invented many of the popular graphs we use today (line, bar, circle, and pie charts). Therefore, it appears as if many, since ancient times, have recognized the importance of visualization in giving some meaning to data. To demonstrate the importance of visualization in a simple mathematical setting, consider fitting a line to a given set of points. Without looking at the data points, it would be unwise to try to fit them with a model that seemingly lowers the error bound. It should also be noted that sometimes, the data needs to be changed or transformed to the correct form that allows us to use a particular tool. Visualizing the data points ensures that we do not fall into any trap. The following screenshot shows the visualization of a polynomial as a circle: Figure1.1 Fitting a polynomial In figure 1.1, the points are distributed around a circle. Imagine we are given these points in a Cartesian space (orthogonal x and y coordinates), and we are asked to fit a simple linear model. There is not much benefit if we try to fit these points to any polynomial in a Cartesian space; what we really need to do is change the parameter space to polar coordinates. A 1-degree polynomial in polar coordinate space (essentially a circle) would nicely fit these points when they are converted to polar coordinates, as shown in figure 1.1. Visualizing the data points in more complicated but similar situations can save us a lot of trouble. The following is a screenshot of Anscombe's quartet: Figure1.2 Anscombe's quartet, generated using Mathematica Downloading the color images of this book We also provide you a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from: https://www.packtpub.com/sites/default/files/downloads/2999OT_coloredimages.PDF. Anscombe's quartet (figure 1.2), named after the statistician Francis Anscombe, is a classic example of how simple data visualization like plotting can save us from making wrong statistical inferences. The quartet consists of four datasets that have nearly identical statistical properties (such as mean, variance, and correlation), and gives rise to the same linear model when a regression routine is run on these datasets. However, the second dataset does not really constitute a linear relationship; a spline would fit the points better. The third dataset (at the bottom-left corner of figure 1.2) actually has a different regression line, but the outlier exerts enough influence to force the same regression line on the data. The fourth dataset is not even a linear relationship, but the outlier enforces the same regression line again. These two examples demonstrate the importance of "seeing" our data before we blindly run algorithms and statistics. Fortunately, for visualization scientists like us, the world of data types is quite vast. Every now and then, this gives us the opportunity to create new visual tools other than the traditional graphs, plots, and histograms. These visual signatures and tools serve the same purpose that the graph plotting examples previously just did—spy and investigate data to infer valuable insights—but on different types of datasets other than just point clouds. Another important use of visualization is to enable the data scientist to interactively explore the data. Two features make today's visualization tools very attractive—the ability to view data from different perspectives (viewing angles) and at different resolutions. These features facilitate the investigator in understanding both the micro- and macro-level behavior of their dataset. Types of datasets There are many different types of datasets that a visualization scientist encounters in their work. This book's aim is to prepare an enthusiastic beginner to delve into the world of data visualization. Certainly, we will not comprehensively cover each and every visualization technique out there. Our aim is to learn to use Mathematica as a tool to create interactive visualizations. To achieve that, we will focus on a general classification of datasets that will determine which Mathematica functions and programming constructs we should learn in order to visualize the broad class of data covered in this book. Tables The table is one of the most common data structures in Computer Science. You might have already encountered this in a computer science, database, or even statistics course, but for the sake of completeness, we will describe the ways in which one could use this structure to represent different kinds of data. Consider the following table as an example:   Attribute 1 Attribute 2 … Item 1       Item 2       Item 3       When storing datasets in tables, each row in the table represents an instance of the dataset, and each column represents an attribute of that data point. For example, a set of two-dimensional Cartesian vectors can be represented as a table with two attributes, where each row represents a vector, and the attributes are the x and y coordinates relative to an origin. For three-dimensional vectors or more, we could just increase the number of attributes accordingly. Tables can be used to store more advanced forms of scientific, time series, and graph data. We will cover some of these datasets over the course of this book, so it is a good idea for us to get introduced to them now. Here, we explain the general concepts. Scalar fields There are many kinds of scientific dataset out there. In order to aid their investigations, scientists have created their own data formats and mathematical tools to analyze the data. Engineers have also developed their own visualization language in order to convey ideas in their community. In this book, we will cover a few typical datasets that are widely used by scientists and engineers. We will eventually learn how to create molecular visualizations and biomedical dataset exploration tools when we feel comfortable manipulating these datasets. In practice, multidimensional data (just like vectors in the previous example) is usually augmented with one or more characteristic variable values. As an example, let's think about how a physicist or an engineer would keep track of the temperature of a room. In order to tackle the problem, they would begin by measuring the geometry and the shape of the room, and put temperature sensors at certain places to measure the temperature. They will note the exact positions of those sensors relative to the room's coordinate system, and then, they will be all set to start measuring the temperature. Thus, the temperature of a room can be represented, in a discrete sense, by using a set of points that represent the temperature sensor locations and the actual temperature at those points. We immediately notice that the data is multidimensional in nature (the location of a sensor can be considered as a vector), and each data point has a scalar value associated with it (temperature). Such a discrete representation of multidimensional data is quite widely used in the scientific community. It is called a scalar field. The following screenshot shows the representation of a scalar field in 2D and 3D: Figure1.3 In practice, scalar fields are discrete and ordered Figure 1.3 depicts how one would represent an ordered scalar field in 2D or 3D. Each point in the 2D field has a well-defined x and y location, and a single temperature value gets associated with it. To represent a 3D scalar field, we can think of it as a set of 2D scalar field slices placed at a regular interval along the third dimension. Each point in the 3D field is a point that has {x, y, z} values, along with a temperature value. A scalar field can be represented using a table. We will denote each {x, y} point (for 2D) or {x, y, z} point values (for 3D) as a row, but this time, an additional attribute for the scalar value will be created in the table. Thus, a row will have the attributes {x, y, z, T}, where T is the temperature associated with the point defined by the x, y, and z coordinates. This is the most common representation of scalar fields. A widely used visualization technique to analyze scalar fields is to find out the isocontours or isosurfaces of interest. However, for now, let's take a look at the kind of application areas such analysis will enable one to pursue. Instead of temperature, one could think of associating regularly spaced points with any relevant scalar value to form problem-specific scalar fields. In an electrochemical simulation, it is important to keep track of the charge density in the simulation space. Thus, the chemist would create a scalar field with charge values at specific points. For an aerospace engineer, it is quite important to understand how air pressure varies across airplane wings; they would keep track of the pressure by forming a scalar field of pressure values. Scalar field visualization is very important in many other significant areas, ranging from from biomedical analysis to particle physics. In this book, we will cover how to visualize this type of data using Mathematica. Time series Another widely used data type is the time series. A time series is a sequence of data points that are measured usually over a uniform interval of time. Time series arise in many fields, but in today's world, they are mostly known for their applications in Economics and Finance. Other than these, they are frequently used in statistics, weather prediction, signal processing, astronomy, and so on. It is not the purpose of this book to describe the theory and mathematics of time series data. However, we will cover some of Mathematica's excellent capabilities for visualizing time series, and in the course of this book, we will construct our own visualization tool to view time series data. Time series can be easily represented using tables. Each row of the time series table will represent one point in the series, with one attribute denoting the time stamp—the time at which the data point was recorded, and the other attribute storing the actual data value. If the starting time and the time interval are known, then we can get rid of the time attribute and simply store the data value in each row. The actual timestamp of each value can be calculated using the initial time and time interval. Images and videos can be represented as tables too, with pixel-intensity values occupying each entry of the table. As we focus on visualization and not image processing, we will skip those types of data. Graphs Nowadays, graphs arise in all contexts of computer science and social science. This particular data structure provides a way to convert real-world problems into a set of entities and relationships. Once we have a graph, we can use a plethora of graph algorithms to find beautiful insights about the dataset. Technically, a graph can be stored as a table. However, Mathematica has its own graph data structure, so we will stick to its norm. Sometimes, visualizing the graph structure reveals quite a lot of hidden information. Graph visualization itself is a challenging problem, and is an active research area in computer science. A proper visualization layout, along with proper color maps and size distribution, can produce very useful outputs. Text The most common form of data that we encounter everywhere is text. Mathematica does not provide any specific visualization package for state-of-the-art text visualization methods. Cartographic data As mentioned before, map visualization is one of the ancient forms of visualization known to us. Nowadays, with the advent of GPS, smartphones, and publicly available country-based data repositories, maps are providing an excellent way to contrast and compare different countries, cities, or even communities. Cartographic data comes in various forms. A common form of a single data item is one that includes latitude, longitude, location name, and an attribute (usually numerical) that records a relevant quantity. However, instead of a latitude and longitude coordinate, we may be given a set of polygons that describe the geographical shape of the place. The attributable quantity may not be numerical, but rather something qualitative, like text. Thus, there is really no standard form that one can expect when dealing with cartographic data. Fortunately, Mathematica provides us with excellent data-mining and dissecting capabilities to build custom formats out of the data available to us. . Mathematica as a tool for visualization At this point, you might be wondering why Mathematica is suited for visualizing all the kinds of datasets that we have mentioned in the preceding examples. There are many excellent tools and packages out there to visualize data. Mathematica is quite different from other languages and packages because of the unique set of capabilities it presents to its user. Mathematica has its own graphics language, with which graphics primitives can be interactively rendered inside the worksheet. This makes Mathematica's capability similar to many widely used visualization languages. Mathematica provides a plethora of functions to combine these primitives and make them interactive. Speaking of interactivity, Mathematica provides a suite of functions to interactively display any of its process. Not only visualization, but any function or code evaluation can be interactively visualized. This is particularly helpful when managing and visualizing big datasets. Mathematica provides many packages and functions to visualize the kinds of datasets we have mentioned so far. We will learn to use the built-in functions to visualize structured and unstructured data. These functions include point, line, and surface plots; histograms; standard statistical charts; and so on. Other than these, we will learn to use the advanced functions that will let us build our own visualization tools. Another interesting feature is the built-in datasets that this software provides to its users. This feature provides a nice playground for the user to experiment with different datasets and visualization functions. From our discussion so far, we have learned that visualization tools are used to analyze very large datasets. While Mathematica is not really suited for dealing with petabytes or exabytes of data (and many other popularly used visualization tools are not suited for that either), often, one needs to build quick prototypes of such visualization tools using smaller sample datasets. Mathematica is very well suited to prototype such tools because of its efficient and fast data-handling capabilities, along with its loads of convenient functions and user-friendly interface. It also supports GPU and other high-performance computing platforms. Although it is not within the scope of this book, a user who knows how to harness the computing power of Mathematica can couple that knowledge with visualization techniques to build custom big data visualization solutions. Another feature that Mathematica presents to a data scientist is the ability to keep the workflow within one worksheet. In practice, many data scientists tend to do their data analysis with one package, visualize their data with another, and export and present their findings using something else. Mathematica provides a complete suite of a core language, mathematical and statistical functions, a visualization platform, and versatile data import and export features inside a single worksheet. This helps the user focus on the data instead of irrelevant details. By now, I hope you are convinced that Mathematica is worth learning for your data-visualization needs. If you still do not believe me, I hope I will be able to convince you again at the end of the book, when we will be done developing several visualization prototypes, each requiring only few lines of code! Getting started with Mathematica We will need to know a few basic Mathematica notebook essentials. Assuming you already have Mathematica installed on your computer, let's open a new notebook by navigating to File|New|Notebook, and do the following experiments. Creating and selecting cells In Mathematica, a chunk of code or any number of mathematical expressions can be written within a cell. Each cell in the notebook can be evaluated to see the output immediately below it. To start a new cell, simply start typing at the position of the blinking cursor. Each cell can be selected by clicking on the respective rightmost bracket. To select multiple cells, press Ctrl + right-mouse button in Windows or Linux (or cmd + right-mouse button on a Mac) on each of the cells. The following screenshot shows several cells selected together, along with the output from each cell: Figure1.4 Selecting and evaluating cells in Mathematica We can place a new cell in between any set of cells in order to change the sequence of instruction execution. Use the mouse to place the cursor in between two cells, and start typing your commands to create a new cell. We can also cut, copy, and paste cells by selecting them and applying the usual shortcuts (for example, Ctrl + C, Ctrl + X, and Ctrl + V in Windows/Linux, or cmd + C, cmd + X, and cmd + V in Mac) or using the Edit menu bar. In order to delete cell(s), select the cell(s) and press the Delete key. Evaluating a cell A cell can be evaluated by pressing Shift + Enter. Multiple cells can be selected and evaluated in the same way. To evaluate the full notebook, press Ctrl + A (to select all the cells) and then press Shift + Enter. In this case, the cells will be evaluated one after the other in the sequence in which they appear in the notebook. To see examples of notebooks filled with commands, code, and mathematical expressions, you can open the notebooks supplied with this article, which are the polar coordinates fitting and Anscombe's quartet examples, and select each cell (or all of them) and evaluate them. If we evaluate a cell that uses variables declared in a previous cell, and the previous cell was not already evaluated, then we may get errors. It is possible that Mathematica will treat the unevaluated variables as a symbolic expression; in that case, no error will be displayed, but the results will not be numeric anymore. Suppressing output from a cell If we don't wish to see the intermediate output as we load data or assign values to variables, we can add a semicolon (;) at the end of each line that we want to leave out from the output. Cell formatting Mathematica input cells treat everything inside them as mathematical and/or symbolic expressions. By default, every new cell you create by typing at the horizontal cursor will be an input expression cell. However, you can convert the cell to other formats for convenient typesetting. In order to change the format of cell(s), select the cell(s) and navigate to Format|Style from the menu bar, and choose a text format style from the available options. You can add mathematical symbols to your text by selecting Palettes|Basic Math Assistant. Note that evaluating a text cell will have no effect/output. Commenting We can write any comment in a text cell as it will be ignored during the evaluation of our code. However, if we would like to write a comment inside an input cell, we use the (* operator to open a comment and the *) operator to close it, as shown in the following code snippet: (* This is a comment *) The shortcut Ctrl + / (cmd + / in Mac) is used to comment/uncomment a chunk of code too. This operation is also available in the menu bar. Downloading the example code You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. Aborting evaluation We can abort the currently running evaluation of a cell by navigating to Evaluation|Abort Evaluation in the menu bar, or simply by pressing Alt + . (period). This is useful when you want to end a time-consuming process that you suddenly realize will not give you the correct results at the end of the evaluation, or end a process that might use up the available memory and shut down the Mathematica kernel. Further reading The history of visualization deserves a separate book, as it is really fascinating how the field has matured over the centuries, and it is still growing very strongly. Michael Friendly, from York University, published a historical development paper that is freely available online, titled Milestones in History of Data Visualization: A Case Study in Statistical Historiography. This is an entertaining compilation of the history of visualization methods. The book The Visual Display of Quantitative Information by Edward R. Tufte published by Graphics Press USA, is an excellent resource and a must-read for every data visualization practitioner. This is a classic book on the theory and practice of data graphics and visualization. Since we will not have the space to discuss the theory of visualization, the interested reader can consider reading this book for deeper insights. Summary In this article, we discussed the importance of data visualization in different contexts. We also introduced the types of dataset that will be visualized over the course of this book. The flexibility and power of Mathematica as a visualization package was discussed, and we will see the demonstration of these properties throughout the book with beautiful visualizations. Finally, we have taken the first step to writing code in Mathematica. Resources for Article: Further resources on this subject: Driving Visual Analyses with Automobile Data (Python) [article] Importing Dynamic Data [article] Interacting with Data for Dashboards [article]
Read more
  • 0
  • 0
  • 9283
Modal Close icon
Modal Close icon