Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-introducing-interactive-plotting

20 Mar 2015

29 min read

Introducing Interactive Plotting

20 Mar 2015

This article is written by Benjamin V. Root, the author of Interactive Applications using Matplotlib. The goal of any interactive application is to provide as much information as possible while minimizing complexity. If it can't provide the information the users need, then it is useless to them. However, if the application is too complex, then the information's signal gets lost in the noise of the complexity. A graphical presentation often strikes the right balance. The Matplotlib library can help you present your data as graphs in your application. Anybody can make a simple interactive application without knowing anything about draw buffers, event loops, or even what a GUI toolkit is. And yet, the Matplotlib library will cede as much control as desired to allow even the most savvy GUI developer to create a masterful application from scratch. Like much of the Python language, Matplotlib's philosophy is to give the developer full control, but without being stupidly unhelpful and tedious. (For more resources related to this topic, see here.) Installing Matplotlib There are many ways to install Matplotlib on your system. While the library used to have a reputation for being difficult to install on non-Linux systems, it has come a long way since then, along with the rest of the Python ecosystem. Refer to the following command: $ pip install matplotlib Most likely, the preceding command would work just fine from the command line. Python Wheels (the next-generation Python package format that has replaced "eggs") for Matplotlib are now available from PyPi for Windows and Mac OS X systems. This method would also work for Linux users; however, it might be more favorable to install it via the system's built-in package manager. While the core Matplotlib library can be installed with few dependencies, it is a part of a much larger scientific computing ecosystem known as SciPy. Displaying your data is often the easiest part of your application. Processing it is much more difficult, and the SciPy ecosystem most likely has the packages you need to do that. For basic numerical processing and N-dimensional data arrays, there is NumPy. For more advanced but general data processing tools, there is the SciPy package (the name was so catchy, it ended up being used to refer to many different things in the community). For more domain-specific needs, there are "Sci-Kits" such as scikit-learn for artificial intelligence, scikit-image for image processing, and statsmodels for statistical modeling. Another very useful library for data processing is pandas. This was just a short summary of the packages available in the SciPy ecosystem. Manually managing all of their installations, updates, and dependencies would be difficult for many who just simply want to use the tools. Luckily, there are several distributions of the SciPy Stack available that can keep the menagerie under control. The following are Python distributions that include the SciPy Stack along with many other popular Python packages or make the packages easily available through package management software: Anaconda from Continuum Analytics Canopy from Enthought SciPy Superpack Python(x, y) (Windows only) WinPython (Windows only) Pyzo (Python 3 only) Algorete Loopy from Dartmouth College Show() your work With Matplotlib installed, you are now ready to make your first simple plot. Matplotlib has multiple layers. Pylab is the topmost layer, often used for quick one-off plotting from within a live Python session. Start up your favorite Python interpreter and type the following: >>> from pylab import * >>> plot([1, 2, 3, 2, 1]) Nothing happened! This is because Matplotlib, by default, will not display anything until you explicitly tell it to do so. The Matplotlib library is often used for automated image generation from within Python scripts, with no need for any interactivity. Also, most users would not be done with their plotting yet and would find it distracting to have a plot come up automatically. When you are ready to see your plot, use the following command: >>> show() Interactive navigation A figure window should now appear, and the Python interpreter is not available for any additional commands. By default, showing a figure will block the execution of your scripts and interpreter. However, this does not mean that the figure is not interactive. As you mouse over the plot, you will see the plot coordinates in the lower right-hand corner. The figure window will also have a toolbar: From left to right, the following are the tools: Home, Back, and Forward: These are similar to that of a web browser. These buttons help you navigate through the previous views of your plot. The "Home" button will take you back to the first view when the figure was opened. "Back" will take you to the previous view, while "Forward" will return you to the previous views. Pan (and zoom): This button has two modes: pan and zoom. Press the left mouse button and hold it to pan the figure. If you press x or y while panning, the motion will be constrained to just the x or y axis, respectively. Press the right mouse button to zoom. The plot will be zoomed in or out proportionate to the right/left and up/down movements. Use the X, Y, or Ctrl key to constrain the zoom to the x axis or the y axis or preserve the aspect ratio, respectively. Zoom-to-rectangle: Press the left mouse button and drag the cursor to a new location and release. The axes view limits will be zoomed to the rectangle you just drew. Zoom out using your right mouse button, placing the current view into the region defined by the rectangle you just drew. Subplot configuration: This button brings up a tool to modify plot spacing. Save: This button brings up a dialog that allows you to save the current figure. The figure window would also be responsive to the keyboard. The default keymap is fairly extensive (and will be covered fully later), but some of the basic hot keys are the Home key for resetting the plot view, the left and right keys for back and forward actions, p for pan/zoom mode, o for zoom-to-rectangle mode, and Ctrl + s to trigger a file save. When you are done viewing your figure, close the window as you would close any other application window, or use Ctrl + w. Interactive plotting When we did the previous example, no plots appeared until show() was called. Furthermore, no new commands could be entered into the Python interpreter until all the figures were closed. As you will soon learn, once a figure is closed, the plot it contains is lost, which means that you would have to repeat all the commands again in order to show() it again, perhaps with some modification or additional plot. Matplotlib ships with its interactive plotting mode off by default. There are a couple of ways to turn the interactive plotting mode on. The main way is by calling the ion() function (for Interactive ON). Interactive plotting mode can be turned on at any time and turned off with ioff(). Once this mode is turned on, the next plotting command will automatically trigger an implicit show() command. Furthermore, you can continue typing commands into the Python interpreter. You can modify the current figure, create new figures, and close existing ones at any time, all from the current Python session. Scripted plotting Python is known for more than just its interactive interpreters; it is also a fully fledged programming language that allows its users to easily create programs. Having a script to display plots from daily reports can greatly improve your productivity. Alternatively, you perhaps need a tool that can produce some simple plots of the data from whatever mystery data file you have come across on the network share. Here is a simple example of how to use Matplotlib's pyplot API and the argparse Python standard library tool to create a simple CSV plotting script called plotfile.py. Code: chp1/plotfile.py#!/usr/bin/env python from argparse import ArgumentParserimport matplotlib.pyplot as pltif __name__ == '__main__': parser = ArgumentParser(description="Plot a CSV file") parser.add_argument("datafile", help="The CSV File") # Require at least one column name parser.add_argument("columns", nargs='+', help="Names of columns to plot") parser.add_argument("--save", help="Save the plot as...") parser.add_argument("--no-show", action="store_true", help="Don't show the plot") args = parser.parse_args() plt.plotfile(args.datafile, args.columns) if args.save: plt.savefig(args.save) if not args.no_show: plt.show() Note the two optional command-line arguments: --save and --no-show. With the --save option, the user can have the plot automatically saved (the graphics format is determined automatically from the filename extension). Also, the user can choose not to display the plot, which when coupled with the --save option might be desirable if the user is trying to plot several CSV files. When calling this script to show a plot, the execution of the script will stop at the call to plt.show(). If the interactive plotting mode was on, then the execution of the script would continue past show(), terminating the script, thus automatically closing out any figures before the user has had a chance to view them. This is why the interactive plotting mode is turned off by default in Matplotlib. Also note that the call to plt.savefig() is before the call to plt.show(). As mentioned before, when the figure window is closed, the plot is lost. You cannot save a plot after it has been closed. Getting help We have covered how to install Matplotlib and went over how to make very simple plots from a Python session or a Python script. Most likely, this went very smoothly for you.. You may be very curious and want to learn more about the many kinds of plots this library has to offer, or maybe you want to learn how to make new kinds of plots. Help comes in many forms. The Matplotlib website (http://matplotlib.org) is the primary online resource for Matplotlib. It contains examples, FAQs, API documentation, and, most importantly, the gallery. Gallery Many users of Matplotlib are often faced with the question, "I want to make a plot that has this data along with that data in the same figure, but it needs to look like this other plot I have seen." Text-based searches on graphing concepts are difficult, especially if you are unfamiliar with the terminology. The gallery showcases the variety of ways in which one can make plots, all using the Matplotlib library. Browse through the gallery, click on any figure that has pieces of what you want in your plot, and see the code that generated it. Soon enough, you will be like a chef, mixing and matching components to produce that perfect graph. Mailing lists and forums When you are just simply stuck and cannot figure out how to get something to work or just need some hints on how to get started, you will find much of the community at the Matplotlib-users mailing list. This mailing list is an excellent resource of information with many friendly members who just love to help out newcomers. Be persistent! While many questions do get answered fairly quickly, some will fall through the cracks. Try rephrasing your question or with a plot showing your attempts so far. The people at Matplotlib-users love plots, so an image that shows what is wrong often gets the quickest response. A newer community resource is StackOverflow, which has many very knowledgeable users who are able to answer difficult questions. From front to backend So far, we have shown you bits and pieces of two of Matplotlib's topmost abstraction layers: pylab and pyplot. The layer below them is the object-oriented layer (the OO layer). To develop any type of application, you will want to use this layer. Mixing the pylab/pyplot layers with the OO layer will lead to very confusing behaviors when dealing with multiple plots and figures. Below the OO layer is the backend interface. Everything above this interface level in Matplotlib is completely platform-agnostic. It will work the same regardless of whether it is in an interactive GUI or comes from a driver script running on a headless server. The backend interface abstracts away all those considerations so that you can focus on what is most important: writing code to visualize your data. There are several backend implementations that are shipped with Matplotlib. These backends are responsible for taking the figures represented by the OO layer and interpreting it for whichever "display device" they implement. The backends are chosen automatically but can be explicitly set, if needed. Interactive versus non-interactive There are two main classes of backends: ones that provide interactive figures and ones that don't. Interactive backends are ones that support a particular GUI, such as Tcl/Tkinter, GTK, Qt, Cocoa/Mac OS X, wxWidgets, and Cairo. With the exception of the Cocoa/Mac OS X backend, all interactive backends can be used on Windows, Linux, and Mac OS X. Therefore, when you make an interactive Matplotlib application that you wish to distribute to users of any of those platforms, unless you are embedding Matplotlib, you will not have to concern yourself with writing a single line of code for any of these toolkits—it has already been done for you! Non-interactive backends are used to produce image files. There are backends to produce Postscript/EPS, Adobe PDF, and Scalable Vector Graphics (SVG) as well as rasterized image files such as PNG, BMP, and JPEGs. Anti-grain geometry The open secret behind the high quality of Matplotlib's rasterized images is its use of the Anti-Grain Geometry (AGG) library (http://agg.sourceforge.net/antigrain.com/index.html). The quality of the graphics generated from AGG is far superior than most other toolkits available. Therefore, not only is AGG used to produce rasterized image files, but it is also utilized in most of the interactive backends as well. Matplotlib maintains and ships with its own fork of the library in order to ensure you have consistent, high quality image products across all platforms and toolkits. What you see on your screen in your interactive figure window will be the same as the PNG file that is produced when you call savefig(). Selecting your backend When you install Matplotlib, a default backend is chosen for you based upon your OS and the available GUI toolkits. For example, on Mac OS X systems, your installation of the library will most likely set the default interactive backend to MacOSX or CocoaAgg for older Macs. Meanwhile, Windows users will most likely have a default of TkAgg or Qt5Agg. In most situations, the choice of interactive backends will not matter. However, in certain situations, it may be necessary to force a particular backend to be used. For example, on a headless server without an active graphics session, you would most likely need to force the use of the non-interactive Agg backend: import matplotlibmatplotlib.use("Agg") When done prior to any plotting commands, this will avoid loading any GUI toolkits, thereby bypassing problems that occur when a GUI fails on a headless server. Any call to show() effectively becomes a no-op (and the execution of the script is not blocked). Another purpose of setting your backend is for scenarios when you want to embed your plot in a native GUI application. Therefore, you will need to explicitly state which GUI toolkit you are using. Finally, some users simply like the look and feel of some GUI toolkits better than others. They may wish to change the default backend via the backend parameter in the matplotlibrc configuration file. Most likely, your rc file can be found in the .matplotlib directory or the .config/matplotlib directory under your home folder. If you can't find it, then use the following set of commands: >>> import matplotlib >>> matplotlib.matplotlib_fname() u'/home/ben/.config/matplotlib/matplotlibrc' Here is an example of the relevant section in my matplotlibrc file: #### CONFIGURATION BEGINS HERE # the default backend; one of GTK GTKAgg GTKCairo GTK3Agg # GTK3Cairo CocoaAgg MacOSX QtAgg Qt4Agg TkAgg WX WXAgg Agg Cairo # PS PDF SVG # You can also deploy your own backend outside of matplotlib by # referring to the module name (which must be in the PYTHONPATH) # as 'module://my_backend' #backend : GTKAgg #backend : QT4Agg backend : TkAgg # If you are using the Qt4Agg backend, you can choose here # to use the PyQt4 bindings or the newer PySide bindings to # the underlying Qt4 toolkit. #backend.qt4 : PyQt4 # PyQt4 | PySide This is the global configuration file that is used if one isn't found in the current working directory when Matplotlib is imported. The settings contained in this configuration serves as default values for many parts of Matplotlib. In particular, we see that the choice of backends can be easily set without having to use a single line of code. The Matplotlib figure-artist hierarchy Everything that can be drawn in Matplotlib is called an artist. Any artist can have child artists that are also drawable. This forms the basis of a hierarchy of artist objects that Matplotlib sends to a backend for rendering. At the root of this artist tree is the figure. In the examples so far, we have not explicitly created any figures. The pylab and pyplot interfaces will create the figures for us. However, when creating advanced interactive applications, it is highly recommended that you explicitly create your figures. You will especially want to do this if you have multiple figures being displayed at the same time. This is the entry into the OO layer of Matplotlib: fig = plt.figure() Canvassing the figure The figure is, quite literally, your canvas. Its primary component is the FigureCanvas instance upon which all drawing occurs. Unless you are embedding your Matplotlib figures into a GUI application, it is very unlikely that you will need to interact with this object directly. Instead, as plotting commands are issued, artist objects are added to the canvas automatically. While any artist can be added directly to the figure, usually only Axes objects are added. A figure can have many axes objects, typically called subplots. Much like the figure object, our examples so far have not explicitly created any axes objects to use. This is because the pylab and pyplot interfaces will also automatically create and manage axes objects for a figure if needed. For the same reason as for figures, you will want to explicitly create these objects when building your interactive applications. If an axes or a figure is not provided, then the pyplot layer will have to make assumptions about which axes or figure you mean to apply a plotting command to. While this might be fine for simple situations, these assumptions get hairy very quickly in non-trivial applications. Luckily, it is easy to create both your figure and its axes using a single command: fig, axes = plt.subplots(2, 1) # 2x1 grid of subplots These objects are highly advanced complex units that most developers will utilize for their plotting needs. Once placed on the figure canvas, the axes object will provide the ticks, axis labels, axes title(s), and the plotting area. An axes is an artist that manages all of its scale and coordinate transformations (for example, log scaling and polar coordinates), automated tick labeling, and automated axis limits. In addition to these responsibilities, an axes object provides a wide assortment of plotting functions. A sampling of plotting functions is as follows: Function Description bar Make a bar plot barbs Plot a two-dimensional field of barbs boxplot Make a box and whisker plot cohere Plot the coherence between x and y contour Plot contours errorbar Plot an errorbar graph hexbin Make a hexagonal binning plot hist Plot a histogram imshow Display an image on the axes pcolor Create a pseudocolor plot of a two-dimensional array pcolormesh Plot a quadrilateral mesh pie Plot a pie chart plot Plot lines and/or markers quiver Plot a two-dimensional field of arrows sankey Create a Sankey flow diagram scatter Make a scatter plot of x versus y stem Create a stem plot streamplot Draw streamlines of a vector flow This application will be a storm track editing application. Given a series of radar images, the user can circle each storm cell they see in the radar image and link those storm cells across time. The application will need the ability to save and load track data and provide the user with mechanisms to edit the data. Along the way, we will learn about Matplotlib's structure, its artists, the callback system, doing animations, and finally, embedding this application within a larger GUI application. So, to begin, we first need to be able to view a radar image. There are many ways to load data into a Python program but one particular favorite among meteorologists is the Network Common Data Form (NetCDF) file. The SciPy package has built-in support for NetCDF version 3, so we will be using an hour's worth of radar reflectivity data prepared using this format from a NEXRAD site near Oklahoma City, OK on the evening of May 10, 2010, which produced numerous tornadoes and severe storms. The NetCDF binary file is particularly nice to work with because it can hold multiple data variables in a single file, with each variable having an arbitrary number of dimensions. Furthermore, metadata can be attached to each variable and to the dataset itself, allowing you to self-document data files. This particular data file has three variables, namely Reflectivity, lat, and lon to record the radar reflectivity values and the latitude and longitude coordinates of each pixel in the reflectivity data. The reflectivity data is three-dimensional, with the first dimension as time and the other two dimensions as latitude and longitude. The following code example shows how easy it is to load this data and display the first image frame using SciPy and Matplotlib. Code: chp1/simple_radar_viewer.py import matplotlib.pyplot as plt from scipy.io import netcdf_file ncf = netcdf_file('KTLX_20100510_22Z.nc') data = ncf.variables['Reflectivity'] lats = ncf.variables['lat'] lons = ncf.variables['lon'] i = 0 cmap = plt.get_cmap('gist_ncar') cmap.set_under('lightgrey') fig, ax = plt.subplots(1, 1) im = ax.imshow(data[i], origin='lower', extent=(lons[0], lons[-1], lats[0], lats[-1]), vmin=0.1, vmax=80, cmap='gist_ncar') cb = fig.colorbar(im) cb.set_label('Reflectivity (dBZ)') ax.set_xlabel('Longitude') ax.set_ylabel('Latitude') plt.show() Running this script should result in a figure window that will display the first frame of our storms. The plot has a colorbar and the axes ticks label the latitudes and longitudes of our data. What is probably most important in this example is the imshow() call. Being an image, traditionally, the origin of the image data is shown in the upper-left corner and Matplotlib follows this tradition by default. However, this particular dataset was saved with its origin in the lower-left corner, so we need to state this with the origin parameter. The extent parameter is a tuple describing the data extent of the image. By default, it is assumed to be at (0, 0) and (N – 1, M – 1) for an MxN shaped image. The vmin and vmax parameters are a good way to ensure consistency of your colormap regardless of your input data. If these two parameters are not supplied, then imshow() will use the minimum and maximum of the input data to determine the colormap. This would be undesirable as we move towards displaying arbitrary frames of radar data. Finally, one can explicitly specify the colormap to use for the image. The gist_ncar colormap is very similar to the official NEXRAD colormap for radar data, so we will use it here: The gist_ncar colormap, along with some other colormaps packaged with Matplotlib such as the default jet colormap, are actually terrible for visualization. See the Choosing Colormaps page of the Matplotlib website for an explanation of why, and guidance on how to choose a better colormap. The menagerie of artists Whenever a plotting function is called, the input data and parameters are processed to produce new artists to represent the data. These artists are either primitives or collections thereof. They are called primitives because they represent basic drawing components such as lines, images, polygons, and text. It is with these primitives that your data can be represented as bar charts, line plots, errorbars, or any other kinds of plots. Primitives There are four drawing primitives in Matplotlib: Line2D, AxesImage, Patch, and Text. It is through these primitive artists that all other artist objects are derived from, and they comprise everything that can be drawn in a figure. A Line2D object uses a list of coordinates to draw line segments in between. Typically, the individual line segments are straight, and curves can be approximated with many vertices; however, curves can be specified to draw arcs, circles, or any other Bezier-approximated curves. An AxesImage class will take two-dimensional data and coordinates and display an image of that data with a colormap applied to it. There are actually other kinds of basic image artists available besides AxesImage, but they are typically for very special uses. AxesImage objects can be very tricky to deal with, so it is often best to use the imshow() plotting method to create and return these objects. A Patch object is an arbitrary two-dimensional object that has a single color for its "face." A polygon object is a specific instance of the slightly more general patch. These objects have a "path" (much like a Line2D object) that specifies segments that would enclose a face with a single color. The path is known as an "edge," and can have its own color as well. Besides the Polygons that one sees for bar plots and pie charts, Patch objects are also used to create arrows, legend boxes, and the markers used in scatter plots and elsewhere. Finally, the Text object takes a Python string, a point coordinate, and various font parameters to form the text that annotates plots. Matplotlib primarily uses TrueType fonts. It will search for fonts available on your system as well as ship with a few FreeType2 fonts, and it uses Bitstream Vera by default. Additionally, a Text object can defer to LaTeX to render its text, if desired. While specific artist classes will have their own set of properties that make sense for the particular art object they represent, there are several common properties that can be set. The following table is a listing of some of these properties. Property Meaning alpha 0 represents transparent and 1 represents opaque color Color name or other color specification visible boolean to flag whether to draw the artist or not zorder value of the draw order in the layering engine Let's extend the radar image example by loading up already saved polygons of storm cells in the tutorial.py file. Code: chp1/simple_storm_cell_viewer.py import matplotlib.pyplot as plt from scipy.io import netcdf_file from matplotlib.patches import Polygon from tutorial import polygon_loader ncf = netcdf_file('KTLX_20100510_22Z.nc') data = ncf.variables['Reflectivity'] lats = ncf.variables['lat'] lons = ncf.variables['lon'] i = 0 cmap = plt.get_cmap('gist_ncar') cmap.set_under('lightgrey') fig, ax = plt.subplots(1, 1) im = ax.imshow(data[i], origin='lower', extent=(lons[0], lons[-1], lats[0], lats[-1]), vmin=0.1, vmax=80, cmap='gist_ncar') cb = fig.colorbar(im) polygons = polygon_loader('polygons.shp') for poly in polygons[i]: p = Polygon(poly, lw=3, fc='k', ec='w', alpha=0.45) ax.add_artist(p) cb.set_label("Reflectivity (dBZ)") ax.set_xlabel("Longitude") ax.set_ylabel("Latitude") plt.show() The polygon data returned from polygon_loader() is a dictionary of lists keyed by a frame index. The list contains Nx2 numpy arrays of vertex coordinates in longitude and latitude. The vertices form the outline of a storm cell. The Polygon constructor, like all other artist objects, takes many optional keyword arguments. First, lw is short for linewidth, (referring to the outline of the polygon), which we specify to be three points wide. Next is fc, which is short for facecolor, and is set to black ('k'). This is the color of the filled-in region of the polygon. Then edgecolor (ec) is set to white ('w') to help the polygons stand out against a dark background. Finally, we set the alpha argument to be slightly less than half to make the polygon fairly transparent so that one can still see the reflectivity data beneath the polygons. Note a particular difference between how we plotted the image using imshow() and how we plotted the polygons using polygon artists. For polygons, we called a constructor and then explicitly called ax.add_artist() to add each polygon instance as a child of the axes. Meanwhile, imshow() is a plotting function that will do all of the hard work in validating the inputs, building the AxesImage instance, making all necessary modifications to the axes instance (such as setting the limits and aspect ratio), and most importantly, adding the artist object to the axes. Finally, all plotting functions in Matplotlib return artists or a list of artist objects that it creates. In most cases, you will not need to save this return value in a variable because there is nothing else to do with them. In this case, we only needed the returned AxesImage so that we could pass it to the fig.colorbar() method. This is so that it would know what to base the colorbar upon. The plotting functions in Matplotlib exist to provide convenience and simplicity to what can often be very tricky to get right by yourself. They are not magic! They use the same OO interface that is accessible to application developers. Therefore, anyone can write their own plotting functions to make complicated plots easier to perform. Collections Any artist that has child artists (such as a figure or an axes) is called a container. A special kind of container in Matplotlib is called a Collection. A collection usually contains a list of primitives of the same kind that should all be treated similarly. For example, a CircleCollection would have a list of Circle objects, all with the same color, size, and edge width. Individual values for artists in the collection can also be set. A collection makes management of many artists easier. This becomes especially important when considering the number of artist objects that may be needed for scatter plots, bar charts, or any other kind of plot or diagram. Some collections are not just simply a list of primitives, but are artists in their own right. These special kinds of collections take advantage of various optimizations that can be assumed when rendering similar or identical things. RegularPolyCollection, for example, just needs to know the points of a single polygon relative to its center (such as a star or box) and then just needs a list of all the center coordinates, avoiding the need to store all the vertices of every polygon in its collection in memory. In the following example, we will display storm tracks as LineCollection. Note that instead of using ax.add_artist() (which would work), we will use ax.add_collection() instead. This has the added benefit of performing special handling on the object to determine its bounding box so that the axes object can incorporate the limits of this collection with any other plotted objects to automatically set its own limits which we trigger with the ax.autoscale(True) call. Code: chp1/linecoll_track_viewer.py import matplotlib.pyplot as plt from matplotlib.collections import LineCollection from tutorial import track_loader tracks = track_loader('polygons.shp') # Filter out non-tracks (unassociated polygons given trackID of -9) tracks = {tid: t for tid, t in tracks.items() if tid != -9} fig, ax = plt.subplots(1, 1) lc = LineCollection(tracks.values(), color='b') ax.add_collection(lc) ax.autoscale(True) ax.set_xlabel("Longitude") ax.set_ylabel("Latitude") plt.show() Much easier than the radar images, Matplotlib took care of all the limit setting automatically. Such features are extremely useful for writing generic applications that do not wish to concern themselves with such details. Summary In this article, we introduced you to the foundational concepts of Matplotlib. Using show(), you showed your first plot with only three lines of Python. With this plot up on your screen, you learned some of the basic interactive features built into Matplotlib, such as panning, zooming, and the myriad of key bindings that are available. Then we discussed the difference between interactive and non-interactive plotting modes and the difference between scripted and interactive plotting. You now know where to go online for more information, examples, and forum discussions of Matplotlib when it comes time for you to work on your next Matplotlib project. Next, we discussed the architectural concepts of Matplotlib: backends, figures, axes, and artists. Then we started our construction project, an interactive storm cell tracking application. We saw how to plot a radar image using a pre-existing plotting function, as well as how to display polygons and lines as artists and collections. While creating these objects, we had a glimpse of how to customize the properties of these objects for our display needs, learning some of the property and styling names. We also learned some of the steps one needs to consider when creating their own plotting functions, such as autoscaling. Resources for Article: Further resources on this subject: The plot function [article] First Steps [article] Machine Learning in IPython with scikit-learn [article]

0
2
7670

Packt

13 Oct 2016

25 min read

Reconstructing 3D Scenes

Packt

13 Oct 2016

25 min read

0
0
7668

Packt

30 Oct 2013

9 min read

Getting started with Haskell

Packt

30 Oct 2013

9 min read

(For more resources related to this topic, see here.) So what is Haskell? It is a fast, type-safe, purely functional programming language with a powerful type inference. Having said that, let us try to understand what it gives us. First, a purely functional programming language means that, in general, functions in Haskell don't have side effects. There is a special type for impure functions, or functions with side effects. Then, Haskell has a strong, static type system with an automatic and robust type inference. This, in practice, means that you do not usually need to specify types of functions and also the type checker does not allow passing incompatible types. In strongly typed languages, types are considered to be a specification, due to the Curry-Howard correspondence, the direct relationship between programs and mathematical proofs. Under this great simplification, the theorem states that if a value of the type exists (or is inhabited), then the corresponding mathematical proof is correct. Or jokingly saying, if a program compiles, then there is 99 percent probability that it works according to specification. Though the question if the types conform, the specification in natural language is still open; Haskell won't help you with it. The Haskell platform The glorious Glasgow Haskell Compilation System, or simply Glasgow Haskell Compiler (GHC), is the most widely used Haskell compiler. It is the current de facto standard. The compiler is packaged into the Haskell platform that follows Python's principle, "Batteries included". The platform is updated twice a year with new compilers and libraries. It usually includes a compiler, an interactive Read-Evaluate-Print Loop (REPL) interpreter, Haskell 98/2010 libraries (so-called Prelude) that includes most of the common definitions and functions, and a set of commonly used libraries. If you are on Windows or Mac OS X, it is strongly recommended to use prepackaged installers of the Haskell platform at http://www.haskell.org/platform/. Quick tour of Haskell To start with development, first we should be familiar with a few basic features of Haskell. We really need to know about laziness, datatypes, pattern matching, type classes, and basic notion of monads to start with Haskell. Laziness Haskell is a language with lazy evaluation. From a programmer's point of view that means that the value is evaluated if and only if it is really needed. Imperative languages usually have a strict evaluation, that is, function arguments are evaluated before function application. To see the difference, let's take a look at a simple expression in Haskell: let x = 2 + 3 In a strict or eager language, the 2 + 3 expression would be immediately evaluated to 5, whereas in Haskell, only a promise to do this evaluation will be created and the value will be evaluated only when it is needed. In other words, this statement just introduces definition of x which might be used afterwards, unlike in strict language where it is an operator that assigns the computed value, 5 to a memory cell named x. Also, this strategy allows sharing of evaluations, because laziness assumes that computations can be executed whenever it is needed and therefore, the result can be memorized. It might reduce the running time by exponential factor over strict evaluation. Laziness also allows to manipulate with infinite data structures. For instance, we can construct an infinite list of natural numbers as follows: let ns = [1..] And moreover, we can manipulate it as if it is a normal list, even though some caution is needed, as you can get an infinite loop. We can take the first five elements of this infinite list by means of the built-in function, take: take 5 ns By running this example in GHCi you will get [1,2,3,4,5]. Functions as first-class citizens The notion of function is one of the core ideas in functional languages and Haskell is not an exception at all. The definition of a function includes a body of function and an optional type declaration. For instance, the take function is defined in Prelude as follows: take :: Int -> [a] -> [a] take = ... Here, the type declaration says that the function takes an integer as the argument and a list of objects of the a type, and returns a new list of the same type. Also Haskell allows partial application of a function. For example, you can construct a function that takes first five elements of the list as follows: take5 :: [a] -> [a] take5 = take 5 Also functions are themselves objects, that is, you may pass a function as an argument to another function. Prelude defines map function as a function of a function: map :: (a -> b) -> [a] -> [b] map takes a function and applies it to each element of the list. Thus functions are first-class citizens in the language and it is possible to manipulate with them as if they were normal objects. Data types Data type is a core of a strongly typed language as Haskell. The distinctive feature of Haskell data types is that they all are immutable, i.e. after object constructed it cannot be changed. It might be weird on the first sight but in the long run it has few positive consequences. First, it enables computation parallelization. Second, all data are referentially transparent, i.e. there is no difference between reference to object and object itself. Those two properties allow compiler to reason about code optimization on higher level than C/C++ compiler can. Let us consider the following data type definitions in Haskell: type TimeStamp = Integer newtype Price = Price Double data Quote = AskQuote TimeStamp Price | BidQuote TimeStamp Price There are three common ways to define data types in Haskell: The first declaration just creates a synonym for an existing data type and type checker won’t prevent you from using Integer instead of TimeStamp. Also you can use a value of type TimeStamp with any function that expects to work with an Integer. The second declaration creates a new type for prices and you are not allowed to use Double instead of Price. The compiler will raise an error in such cases and thus it will enforce the correct usage. The last declaration introduces Algebraic Data Type (ADT) and says that type Quote might be constructed either by data constructor AskQuote, or by BidQuote with time stamp and price as its parameters. Quote itself is called a type constructor. Type constructor might be parameterized by type variables. For example, types Maybe and Either, quite often used standard types, are defined as follows: data Maybe a = Nothing | Just a data Either a b = Left a | Right b Type variables a and b can be substituted by any other type. Type classes Type classes in Haskell are not classes like in object-oriented languages. It is more like interfaces with optional implementation. You can find them in other languages named traits, mix-ins and so on but unlike in them, this feature in Haskell enables ad-hoc polymorphism, i.e. function could be applied to arguments of different types. It is also known as function overloading or operator overloading. A polymorphic function can specify different implementation for different types. In principle, type class consists of function declaration over some objects. Eq type class is a standard type class that specifies how to compare two objects of same type: class Eq a where (==) :: a -> a -> Bool (/=) :: a -> a -> Bool (/=) x y = not $ (==) x y Here Eq is the name of type class, a is a type variable and == and /= are the operations defined in the type class. This definition means that some type a is of class Eq if it has defined operation == and /=. Moreover, the definition provides default implementation of the operation /=. And if you decide to implement this class for some data type, you need provide the single operation implementation. There are numbers of useful type classes defined in Prelude. For example, Eq is used to define equality between 2 objects; Ord specifies total ordering; Show and Read introduce a String representation of object; Enum type class describes enumerations, i.e. data types with null constructors. It might be quite boring to implement this quite trivial but useful type classes for each data type, so Haskell supports automatic derivation of most of standard type classes: newtype Price = Price Double deriving (Eq, Show, Read, Ord) Now objects of type Price could be converted from/to String by means of read/show functions and compared with themselves using equality operators. Those who want to know all the details can look them up in Haskell language report. IO monad Being a pure functional language Haskell requires a marker of impure functions and IO monad is the marker. Function Main is an entry point for any Haskell program. It cannot be a pure function because it changes state of the “World”, at least by creating a new process in OS. Let us take a look at its type: main :: IO () main = do putStrLn "Hello, World!" IO () type signs that there is a computation that performs some I/O operation and return empty result. There is really only one way to perform I/O in Haskell: use it in main procedure of your program. Also it is not possible to execute I/O action from arbitrary function, unless that function is in the IO monad and called from main, directly or indirectly. Summary In this article we walked through the main unique features of Haskell language and learned a bit of its history. Resources for Article : Further resources on this subject: Core Data iOS: Designing a Data Model and Building Data Objects [Article] The OpenFlow Controllers [Article] Managing Network Layout [Article]

0
0
7664

Packt

16 Oct 2009

9 min read

Oracle Wallet Manager

Packt

16 Oct 2009

9 min read

0
0
7656

article-image-amazon-dynamodb-modelling-relationships-error-handling

Packt

20 Aug 2014

7 min read

Amazon DynamoDB - Modelling relationships, Error handling

Packt

20 Aug 2014

7 min read

In this article by Tanmay Deshpande author of Mastering DynamoDB we are going to revise our concepts about the DynamoDB and will try to discover more about its features and implementation. (For more resources related to this topic, see here.) Amazon DynamoDB is a fully managed, cloud hosted, NoSQL database. It provides fast and predictable performance with the ability to scale seamlessly. It allows you to store and retrieve any amount of data , serving any level of network traffic without having any operational burden. DynamoDB gives numerous other advantages like consistent and predictable performance, flexible data modeling and durability. With just few clicks on Amazon Web Service console, you would be able to create your own DynamoDB database table, scale up or scale down provision throughput without taking down your application even for a millisecond. DynamoDB uses solid state disks (SSD) to store the data which confirms the durability of the work you are doing. It also automatically replicates the data across other AWS Availability Zones which provides built-in high availability and reliability. Before we start discussion details about DynamoDB let's try to understand what NoSQL databases are and when to choose DynamoDB over RDBMS. With the rise in data volume, variety and velocity, RDBMS were neither designed to cope up with the scale and flexibility challenges developers are facing to build the modern day applications nor were they able to take advantage of cheap commodity hardware. Also we need to provide schema before we start adding data which was restricting developers from making their application flexible. On the other hand, NoSQL databases are fast, provide flexible schema operations and do the effective use of cheap storage. Considering all these things, NoSQL is becoming popular very quickly amongst the developer community. But one has to be very cautious about when to go for NoSQL and when to stick to RDBMS. Sticking to relational databases makes sense when you know the schema is more over static, strong consistency is must and the data is not going to be that big in volume. But when you want to build an application which is Internet scalable, schema is more likely to get evolved over the time, the storage is going to be really big and the operations involved are ok to be eventually consistent then NoSQL is the way to go. There are various types of NoSQL databases. Following is the list of NoSQL database types and popular examples Document Store – MongoDB, CouchDB, MarkLogic etc. Column Store – Hbase, Cassandra etc. Key Value Store – DynamoDB, Azure, Redis etc. Graph Databases – Neo4J, DEX etc. Most of these NoSQL solutions are open source except few like DynamoDB, Azure which are available as service over Internet. DynamoDB being a key-value store indexes data only upon primary keys and one need to go through primary key to access certain attribute. Let's start learning more about DynamoDB by having a look at its history. DynamoDB's History Amazon's ecommerce platform had a huge set of decoupled services developed and managed individually and each and every service had API to be used and consumed for others. Earlier each service had direct database access which was a major bottleneck. In terms of scalability, Amazon's requirements were more than any third party vendors could provide at that time. DynamoDB was built to address Amazon's high availability, extreme scalability and durability needs. Earlier Amazon used to store its production data in relational databases and services had been provided for all required operations. But later they realized that most of the services access data only through its primary key and need not use complex queries to fetch the required data, plus maintaining these RDBMS systems required high end hardware and skilled personnel. So to overcome all such issues, Amazon's engineering team built a NoSQL database which addresses all above mentioned issues. In 2007, Amazon released one research paper on Dynamo which was combining the best of ideas from database and key value store worlds which was inspiration for many open source projects at the time. Cassandra, Voldemort and Riak were one of them. You can find the above mentioned paper at http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf Even though Dynamo had great features which would take care of all engineering needs, it was not widely accepted at that time in Amazon itself as it was not a fully managed service. When Amazon released S3 and SimpleDB, engineering teams were quite excited to adopt those compared to Dynamo as DynamoDB was bit expensive at that time due to SSDs. So finally after rounds of improvement, Amazon released Dynamo as cloud based service and since then it is one the most widely used NoSQL database. Before releasing to public cloud in 2012, DynamoDB has been the core storage service for Amazon's e-commerce platform, which was started shopping cart and session management service. Any downtime or degradation in performance had major impact on Amazon's business and any financial impact was strictly not acceptable and DynamoDB proved itself to be the best choice at the end. Now let's try to understand in more detail about DynamoDB. What is DynamoDB? DynamoDB is a fully managed, Internet scalable and easily administered, cost effective NoSQL database. It is a part of database as a service offering pane of Amazon Web Services. The above diagram shows how Amazon offers its various cloud services and where DynamoDB is exactly placed. AWS RDS is relational database as a service over Internet from Amazon while Simple DB and DynamoDB are NoSQL database as services. Both SimpleDB and DynamoDB are fully managed, non-relational services. DynamoDB is build considering fast, seamless scalability, and high performance. It runs on SSDs to provide faster responses and has no limits on request capacity and storage. It automatically partitions your data throughout the cluster to meet the expectations while in SimpleDB we have storage limit of 10 GB and can only take limited requests per second. Also in SimpleDB we have to manage our own partitions. So depending upon your need you have to choose the correct solution. To use DynamoDB, the first and foremost requirement is having an AWS account. Through easy to use AWS management console, you can directly create new tables, providing necessary information and can start loading data into the tables in few minutes. Modelling Relationships Like any other database, modeling relationships is quite interesting even though DynamoDB is a NoSQL database. Most of the time, people get confused on how do I model the relationships between various tables, in this section, we are trying make an effort to simplify this problem. To understand the relationships better, let's try to understand that using our example of Book Store where we have entities like Book, Author, Publisher, and so on. One to One In this type of relationship, one entity record of a table is related only one entity record of other table. In our book store application, we have BookInfo and Book-Details tables, BookInfo table can have short information about the book which can be used to display book information on web page whereas BookDetails table would be used when someone explicitly needs to see all the details of book. This design helps us keeping our system healthy as even if there are high request on one table, the other table would always be to up and running to fulfil what it is supposed to do. Following diagram shows how the table structure would look like. One to many In this type of relationship, one record from an entity is related to more than one record in another entity. In book store application, we can have Publisher Book Table which would keep information about the book and publisher relationship. Here we can have Publisher Id as hash key and Book Id as range key. Following diagram shows how a table structure would like. Many to many Many to many relationship means many records from an entity is related to many records from another entity. In case of book store application, we can say that a book can be authored by multiple authors and an author can write multiple books. In this we should use two tables with both and range keys.

0
0
7638

article-image-roger-mcnamee-on-silicon-valleys-obsession-for-building-data-voodoo-dolls

Savia Lobo

05 Jun 2019

5 min read

Roger McNamee on Silicon Valley’s obsession for building “data voodoo dolls”

Savia Lobo

05 Jun 2019

5 min read

The Canadian Parliament's Standing Committee on Access to Information, Privacy and Ethics hosted the hearing of the International Grand Committee on Big Data, Privacy and Democracy from Monday May 27 to Wednesday May 29. Witnesses from at least 11 countries appeared before representatives to testify on how governments can protect democracy and citizen rights in the age of big data. This section of the hearing, which took place on May 28, includes Roger McNamee’s take on why Silicon Valley wants to build data voodoo dolls for users. Roger McNamee is the Author of Zucked: Waking up to the Facebook Catastrophe. His remarks in this section of the hearing builds on previous hearing presentations by Professor Zuboff, Professor Park Ben Scott and the previous talk by Jim Balsillie. Roger McNamee’s remarks build on previous hearing presentations by Professor Zuboff, Professor Park Ben Scott and the previous talk by Jim Balsillie. He started off by saying, “Beginning in 2004, I noticed a transformation in the culture of Silicon Valley and over the course of a decade customer focused models were replaced by the relentless pursuit of global scale, monopoly, and massive wealth.” McNamee says that Google wants to make the world more efficient, they want to eliminate user stress that results from too many choices. Now, Google knew that society would not permit a business model based on denying consumer choice and free will, so they covered their tracks. Beginning around 2012, Facebook adopted a similar strategy later followed by Amazon, Microsoft, and others. For Google and Facebook, the business is behavioral prediction using which they build a high-resolution data avatar of every consumer--a voodoo doll if you will. They gather a tiny amount of data from user posts and queries; but the vast majority of their data comes from surveillance, web tracking, scanning emails and documents, data from apps and third parties, and ambient surveillance from products like Alexa, Google assistant, sidewalk labs, and Pokemon go. Google and Facebook used data voodoo dolls to provide their customers who are marketers with perfect information about every consumer. They use the same data to manipulate consumer choices just as in China behavioral manipulation is the goal. The algorithms of Google and Facebook are tuned to keep users on site and active; preferably by pressing emotional buttons that reveal each user's true self. For most users, this means content that provokes fear or outrage. Hate speech, disinformation, and conspiracy theories are catnip for these algorithms. The design of these platforms treats all content precisely the same whether it be hard news from a reliable site, a warning about an emergency, or a conspiracy theory. The platforms make no judgments, users choose aided by algorithms that reinforce past behavior. The result is, 2.5 billion Truman shows on Facebook each a unique world with its own facts. In the U.S. nearly 40% of the population identifies with at least one thing that is demonstrably false; this undermines democracy. “The people at Google and Facebook are not evil they are the products of an American business culture with few rules where misbehavior seldom results in punishment”, he says. Unlike industrial businesses, internet platforms are highly adaptable and this is the challenge. If you take away one opportunity they will move on to the next one and they are moving upmarket getting rid of the middlemen. Today, they apply behavioral prediction to advertising but they have already set their sights on transportation and financial services. This is not an argument against undermining their advertising business but rather a warning that it may be a Pyrrhic victory. If a user’s goals are to protect democracy and personal liberty, McNamee tells them, they have to be bold. They have to force a radical transformation of the business model of internet platforms. That would mean, at a minimum banning web tracking, scanning of email and documents, third party commerce and data, and ambient surveillance. A second option would be to tax micro targeted advertising to make it economically unattractive. But you also need to create space for alternative business models using trust that longs last. Startups can happen anywhere they can come from each of your countries. At the end of the day, though the most effective path to reform would be to shut down the platforms at least temporarily as Sri Lanka did. Any country can go first. The platform's have left you no choice the time has come to call their bluff companies with responsible business models will emerge overnight to fill the void. McNamee explains, “when they (organizations) gather all of this data the purpose of it is to create a high resolution avatar of each and every human being. Doesn't matter whether they use their systems or not they collect it on absolutely everybody. In the Caribbean, Voodoo was essentially this notion that you create a doll, an avatar, such that you can poke it with a pin and the person would experience that pain right and so it becomes literally a representation of the human being.” To know more you can listen to the full hearing video titled, “Meeting No. 152 ETHI - Standing Committee on Access to Information, Privacy and Ethics” on ParlVU. Experts present most pressing issues facing global lawmakers on citizens’ privacy, democracy and rights to freedom of speech Time for data privacy: DuckDuckGo CEO Gabe Weinberg in an interview with Kara Swisher Over 19 years of ANU(Australian National University) students’ and staff data breached

0
0
7628

article-image-administration-rights-for-power-bi-users

Pravin Dhandre

02 Jul 2018

8 min read

Administration rights for Power BI users

Pravin Dhandre

02 Jul 2018

8 min read

In this tutorial, you will understand and learn administration rights/rules for Power BI users. This includes setting and monitoring rules like; who in the organization can utilize which feature, how Power BI Premium capacity is allocated and by whom, and other settings such as embed codes and custom visuals. This article is an excerpt from a book written by Brett Powell titled Mastering Microsoft Power BI. The admin portal is accessible to Office 365 Global Administrators and users mapped to the Power BI service administrator role. To open the admin portal, log in to the Power BI service and select the Admin portal item from the Settings (Gear icon) menu in the top right, as shown in the following screenshot: All Power BI users, including Power BI free users, are able to access the Admin portal. However, users who are not admins can only view the Capacity settings page. The Power BI service administrators and Office 365 global administrators have view and edit access to the following seven pages: Administrators of Power BI most commonly utilize the Tenant settings and Capacity settings as described in the Tenant Settings and Power BI Premium Capacities sections later in this tutorial. However, the admin portal can also be used to manage any approved custom visuals for the organization. Usage metrics The Usage metrics page of the Admin portal provides admins with a Power BI dashboard of several top metrics, such as the most consumed dashboards and the most consumed dashboards by workspace. However, the dashboard cannot be modified and the tiles of the dashboard are not linked to any underlying reports or separate dashboards to support further analysis. Given these limitations, alternative monitoring solutions are recommended, such as the Office 365 audit logs and usage metric datasets specific to Power BI apps. Details of both monitoring options are included in the app usage metrics and Power BI audit log activities sections later in this chapter. Users and Audit logs The Users and Audit logs pages only provide links to the Office 365 admin center. In the admin center, Power BI users can be added, removed and managed. If audit logging is enabled for the organization via the Create audit logs for internal activity and auditing and compliance tenant setting, this audit log data can be retrieved from the Office 365 Security & Compliance Center or via PowerShell. This setting is noted in the following section regarding the Tenant settings tab of the Power BI admin portal. An Office 365 license is not required to utilize the Office 365 admin center for Power BI license assignments or to retrieve Power BI audit log activity. Tenant settings The Tenant settings page of the Admin portal allows administrators to enable or disable various features of the Power BI web service. Likewise, the administrator could allow only a certain security group to embed Power BI content in SaaS applications such as SharePoint Online. The following diagram identifies the 18 tenant settings currently available in the admin portal and the scope available to administrators for configuring each setting: From a data security perspective, the first seven settings within the Export and Sharing and Content packs and apps groups are most important. For example, many organizations choose to disable the Publish to web feature for the entire organization. Additionally, only certain security groups may be allowed to export data or to print hard copies of reports and dashboards. As shown in the Scope column of the previous table and the following example, granular security group configurations are available to minimize risk and manage the overall deployment. Currently, only one tenant setting is available for custom visuals and this setting (Custom visuals settings) can be enabled or disabled for the entire organization only. For organizations that wish to restrict or prohibit custom visuals for security reasons, this setting can be used to eliminate the ability to add, view, share, or interact with custom visuals. More granular controls to this setting are expected later in 2018, such as the ability to define users or security groups of users who are allowed to use custom visuals. In the following screenshot from the Tenant settings page of the Admin portal, only the users within the BI Admin security group who are not also members of the BI Team security group are allowed to publish apps to the entire organization: For example, a report author who also helps administer the On-premises data gateway via the BI Admin security group would be denied the ability to publish apps to the organization given membership in the BI Team security group. Many of the tenant setting configurations will be more simple than this example, particularly for smaller organizations or at the beginning of Power BI deployments. However, as adoption grows and the team responsible for Power BI changes, it's important that the security groups created to help administer these settings are kept up to date. Embed Codes Embed Codes are created and stored in the Power BI service when the Publish to web feature is utilized. As described in the Publish to web section of the previous chapter, this feature allows a Power BI report to be embedded in any website or shared via URL on the public internet. Users with edit rights to the workspace of the published to web content are able to manage the embed codes themselves from within the workspace. However, the admin portal provides visibility and access to embed codes across all workspaces, as shown in the following screenshot: Via the Actions commands on the far right of the Embed Codes page, a Power BI Admin can view the report in a browser (diagonal arrow) or remove the embed code. The Embed Codes page can be helpful to periodically monitor the usage of the Publish to web feature and for scenarios in which data was included in a publish to web report that shouldn't have been, and thus needs to be removed. As shown in the Power BI Tenant settings table referenced in the previous section, this feature can be enabled or disabled for the entire organization or for specific users within security groups. Organizational Custom visuals The Custom Visuals page allows admins to upload and manage custom visuals (.pbiviz files) that have been approved for use within the organization. For example, an organization may have proprietary custom visuals developed internally, which it wishes to expose to business users. Alternatively, the organization may wish to define a set of approved custom visuals, such as only the custom visuals that have been certified by Microsoft. In the following screenshot, the Chiclet Slicer custom visual is added as an organizational custom visual from the Organizational visuals page of the Power BI admin portal: The Organizational visuals page provides a link (Add a custom visual) to launch the form and identifies all uploaded visuals, as well as their last update. Once a visual has been uploaded, it can be deleted but not updated or modified. Therefore, when a new version of an organizational visual becomes available, this visual can be added to the list of organizational visuals with a descriptive title (Chiclet Slicer v2.0). Deleting an organizational custom visual will cause any reports that use this visual to stop rendering. The following screenshot reflects the uploaded Chiclet Slicer custom visual on the Organization visuals page: Once the custom visual has been uploaded as an organizational custom visual, it will be accessible to users in Power BI Desktop. In the following screenshot from Power BI Desktop, the user has opened the MARKETPLACE of custom visuals and selected MY ORGANIZATION: In this screenshot, rather than searching through the MARKETPLACE, the user can go directly to visuals defined by the organization. The marketplace of custom visuals can be launched via either the Visualizations pane or the From Marketplace icon on the Home tab of the ribbon. Organizational custom visuals are not supported for reports or dashboards shared with external users. Additionally, organizational custom visuals used in reports that utilize the publish to web feature will not render outside the Power BI tenant. Moreover, Organizational custom visuals are currently a preview feature. Therefore, users must enable the My organization custom visuals feature via the Preview features tab of the Options window in Power BI Desktop. With this, we got you acquainted with features and processes applicable in administering Power BI for an organization. This includes the configuration of tenant settings in the Power BI admin portal, analyzing the usage of Power BI assets, and monitoring overall user activity via the Office 365 audit logs. If you found this tutorial useful, do check out the book Mastering Microsoft Power BI to develop visually rich, immersive, and interactive Power BI reports and dashboards. Unlocking the secrets of Microsoft Power BI A tale of two tools: Tableau and Power BI Building a Microsoft Power BI Data Model

0
0
7578

article-image-building-linear-regression-model-python-developers

Pravin Dhandre

07 Feb 2018

7 min read

Building a Linear Regression Model in Python for developers

Pravin Dhandre

07 Feb 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rodolfo Bonnin titled Machine Learning for Developers. This book is a systematic guide for developers to implement various Machine Learning techniques and develop efficient and intelligent applications.[/box] Let’s start using one of the most well-known toy datasets, explore it, and select one of the dimensions to learn how to build a linear regression model for its values. Let's start by importing all the libraries (scikit-learn, seaborn, and matplotlib); one of the excellent features of Seaborn is its ability to define very professional-looking style settings. In this case, we will use the whitegrid style: import numpy as np from sklearn import datasets import seaborn.apionly as sns %matplotlib inline import matplotlib.pyplot as plt sns.set(style='whitegrid', context='notebook') The Iris Dataset It’s time to load the Iris dataset. This is one of the most well-known historical datasets. You will find it in many books and publications. Given the good properties of the data, it is useful for classification and regression examples. The Iris dataset (https://archive.ics.uci.edu/ml/datasets/Iris) contains 50 records for each of the three types of iris, 150 lines in a total over five fields. Each line is a measurement of the following: Sepal length in cm Sepal width in cm Petal length in cm Petal width in cm The final field is the type of flower (setosa, versicolor, or virginica). Let’s use the load_dataset method to create a matrix of values from the dataset: iris2 = sns.load_dataset('iris') In order to understand the dependencies between variables, we will implement the covariance operation. It will receive two arrays as parameters and will return the covariance(x,y) value: def covariance (X, Y): xhat=np.mean(X) yhat=np.mean(Y) epsilon=0 for x,y in zip (X,Y): epsilon=epsilon+(x-xhat)*(y-yhat) return epsilon/(len(X)-1) Let's try the implemented function and compare it with the NumPy function. Note that we calculated cov (a,b), and NumPy generated a matrix of all the combinations cov(a,a), cov(a,b), so our result should be equal to the values (1,0) and (0,1) of that matrix: print (covariance ([1,3,4], [1,0,2])) print (np.cov([1,3,4], [1,0,2])) 0.5 [[ 2.33333333 0.5 ] [ 0.5 1. ]] Having done a minimal amount of testing of the correlation function as defined earlier, receive two arrays, such as covariance, and use them to get the final value: def correlation (X, Y): return (covariance(X,Y)/(np.std(X, ddof=1)*np.std(Y, ddof=1))) ##We have to indicate ddof=1 the unbiased std Let’s test this function with two sample arrays, and compare this with the (0,1) and (1,0) values of the correlation matrix from NumPy: print (correlation ([1,1,4,3], [1,0,2,2])) print (np.corrcoef ([1,1,4,3], [1,0,2,2])) 0.870388279778 [[ 1. 0.87038828] [ 0.87038828 1. ]] Getting an intuitive idea with Seaborn pairplot A very good idea when starting worke on a problem is to get a graphical representation of all the possible variable combinations. Seaborn’s pairplot function provides a complete graphical summary of all the variable pairs, represented as scatterplots, and a representation of the univariate distribution for the matrix diagonal. Let’s look at how this plot type shows all the variables dependencies, and try to look for a linear relationship as a base to test our regression methods: sns.pairplot(iris2, size=3.0) <seaborn.axisgrid.PairGrid at 0x7f8a2a30e828> Pairplot of all the variables in the dataset. Lets' select two variables that, from our initial analysis, have the property of being linearly dependent. They are petal_width and petal_length: X=iris2['petal_width'] Y=iris2['petal_length'] Let’s now take a look at this variable combination, which shows a clear linear tendency: plt.scatter(X,Y) This is the representation of the chosen variables, in a scatter type graph: This is the current distribution of data that we will try to model with our linear prediction function. Creating the prediction function First, let's define the function that will abstractedly represent the modeled data, in the form of a linear function, with the form y=beta*x+alpha: def predict(alpha, beta, x_i): return beta * x_i + alpha Defining the error function It’s now time to define the function that will show us the difference between predictions and the expected output during training. We have two main alternatives: measuring the absolute difference between the values (or L1), or measuring a variant of the square of the difference (or L2). Let’s define both versions, including the first formulation inside the second: def error(alpha, beta, x_i, y_i): #L1 return y_i - predict(alpha, beta, x_i) def sum_sq_e(alpha, beta, x, y): #L2 return sum(error(alpha, beta, x_i, y_i) ** 2 for x_i, y_i in zip(x, y)) Correlation fit Now, we will define a function implementing the correlation method to find the parameters for our regression: def correlation_fit(x, y): beta = correlation(x, y) * np.std(y, ddof=1) / np.std(x,ddof=1) alpha = np.mean(y) - beta * np.mean(x) return alpha, beta Let’s then run the fitting function and print the guessed parameters: alpha, beta = correlation_fit(X, Y) print(alpha) print(beta) 1.08355803285 2.22994049512 Let’s now graph the regressed line with the data in order to intuitively show the appropriateness of the solution: plt.scatter(X,Y) xr=np.arange(0,3.5) plt.plot(xr,(xr*beta)+alpha) This is the final plot we will get with our recently calculated slope and intercept: Final regressed line Polynomial regression and an introduction to underfitting and overfitting When looking for a model, one of the main characteristics we look for is the power of generalizing with a simple functional expression. When we increase the complexity of the model, it's possible that we are building a model that is good for the training data, but will be too optimized for that particular subset of data. Underfitting, on the other hand, applies to situations where the model is too simple, such as this case, which can be represented fairly well with a simple linear model. In the following example, we will work on the same problem as before, using the scikit- learn library to search higher-order polynomials to fit the incoming data with increasingly complex degrees. Going beyond the normal threshold of a quadratic function, we will see how the function looks to fit every wrinkle in the data, but when we extrapolate, the values outside the normal range are clearly out of range: from sklearn.linear_model import Ridge from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline ix=iris2['petal_width'] iy=iris2['petal_length'] # generate points used to represent the fitted function x_plot = np.linspace(0, 2.6, 100) # create matrix versions of these arrays X = ix[:, np.newaxis] X_plot = x_plot[:, np.newaxis] plt.scatter(ix, iy, s=30, marker='o', label="training points") for count, degree in enumerate([3, 6, 20]): model = make_pipeline(PolynomialFeatures(degree), Ridge()) model.fit(X, iy) y_plot = model.predict(X_plot) plt.plot(x_plot, y_plot, label="degree %d" % degree) plt.legend(loc='upper left') plt.show() The combined graph shows how the different polynomials' coefficients describe the data population in different ways. The 20 degree polynomial shows clearly how it adjusts perfectly for the trained dataset, and after the known values, it diverges almost spectacularly, going against the goal of generalizing for future data. Curve ﬁtting of the initial dataset, with polynomials of increasing values With this, we successfully explored how to develop an efficient linear regression model in Python and how you can make predictions using the designed model. We've reviewed ways to identify and optimize the correlation between the prediction and the expected output using simple and definite functions. If you enjoyed our post, you must check out Machine Learning for Developers to uncover advanced tools for building machine learning applications on your fingertips.

0
1
7566

Packt

05 Mar 2015

43 min read

Hadoop and MapReduce

Packt

05 Mar 2015

43 min read

0
0
7562

article-image-adjusting-key-parameters-r

Packt

24 Jan 2011

6 min read

Adjusting Key Parameters in R

Packt

24 Jan 2011

6 min read

R Graph Cookbook Detailed hands-on recipes for creating the most useful types of graphs in R – starting from the simplest versions to more advanced applications Learn to draw any type of graph or visual data representation in R Filled with practical tips and techniques for creating any type of graph you need; not just theoretical explanations All examples are accompanied with the corresponding graph images, so you know what the results look like Each recipe is independent and contains the complete explanation and code to perform the task as efficiently as possible Setting colors of points, lines, and bars In this recipe we will learn the simplest way to change the colors of points, lines, and bars in scatter plots, line plots, histograms, and bar plots. Getting ready All you need to try out this recipe is to run R and type the recipe at the command prompt. You can also choose to save the recipe as a script so that you can use it again later on. How to do it... The simplest way to change the color of any graph element is by using the col argument. For example, the plot() function takes the col argument: plot(rnorm(1000), col="red") The code file can be downloaded from here. If we choose plot type as line, then the color is applied to the plotted line. Let’s use the dailysales.csv example dataset. First, we need to load it: Sales <- read.csv("dailysales.csv",header=TRUE) plot(sales$units~as.Date(sales$date,"%d/%m/%y"), type="l", #Specify type of plot as l for line col="blue") Similarly, the points() and lines() functions apply the col argument's value to the plotted points and lines respectively. barplot() and hist() also take the col argument and apply them to the bars. So the following code would produce a bar plot with blue bars: barplot(sales$ProductA~sales$City, col="blue") The col argument for boxplot() is applied to the color of the boxes plotted. How it works... The col argument automatically applies the specified color to the elements being plotted, based on the plot type. So, if we do not specify a plot type or choose points, then the color is applied to points. Similarly, if we choose plot type as line then the color is applied to the plotted line and if we use the col argument in the barplot() or histogram() commands, then the color is applied to the bars. col accepts names of colors such as red, blue, and black. The colors() (or colours()) function lists all the built-in colors (more than 650) available in R. We can also specify colors as hexadecimal codes such as #FF0000 (for red), #0000FF (for blue), and #000000 (for black). If you have ever made any web pages¸ you would know that these hex codes are used in HTML to represent colors. col can also take numeric values. When it is set to a numeric value, the color corresponding to that index in the current color palette is used. For example, in the default color palette the first color is black and the second color is red. So col=1 and col=2 refers to black and red respectively. Index 0 corresponds to the background color. There's more... In many settings, col can also take a vector of multiple colors, instead of a single color. This is useful if you wish to use more than one color in a graph. The heat.colors() function takes a number as an argument and returns a vector of those many colors. So heat.colors(5) produces a vector of five colors. Type the following at the R prompt: heat.colors(5) You should get the following output: [1] "#FF0000FF" "#FF5500FF" "#FFAA00FF" "#FFFF00FF" "#FFFF80FF" Those are five colors in the hexadecimal format. Another way of specifying a vector of colors is to construct one: barplot(as.matrix(sales[,2:4]), beside=T, legend=sales$City, col=c("red","blue","green","orange","pink"), border="white") In the example, we set the value of col to c("red","blue","green","orange","pink"), which is a vector of five colors. We have to take care to make a vector matching the length of the number of elements, in this case bars we are plotting. If the two numbers don’t match, R will 'recycle’ values by repeating colors from the beginning of the vector. For example, if we had fewer colors in the vector than the number of elements, say if we had four colors in the previous plot, then R would apply the four colors to the first four bars and then apply the first color to the fifth bar. This is called recycling in R: barplot(as.matrix(sales[,2:4]), beside=T, legend=sales$City, col=c("red","blue","green","orange"), border="white") In the example, both the bars for the first and last data rows (Seattle and Mumbai) would be of the same color (red), making it difficult to distinguish one from the other. One good way to ensure that you always have the correct number of colors is to find out the length of the number of elements first and pass that as an argument to one of the color palette functions. For example, if we did not know the number of cities in the example we have just seen; we could do the following to make sure the number of colors matches the number of bars plotted: barplot(as.matrix(sales[,2:4]), beside=T, legend=sales$City, col=heat.colors(length(sales$City)), border="white") We used the length() function to find out the length or the number of elements in the vector sales$City and passed that as the argument to heat.colors(). So, regardless of the number of cities we will always have the right number of colors. See also In the next four recipes, we will see how to change the colors of other elements. The fourth recipe is especially useful where we look at color combinations and palettes.

0
0
7555

article-image-exporting-sap-businessobjects-dashboards-different-environments

Packt

02 Jun 2011

7 min read

Exporting SAP BusinessObjects Dashboards into Different Environments

Packt

02 Jun 2011

7 min read

SAP BusinessObjects Dashboards 4.0 Cookbook Introduction First, the visual model is compiled to a SWF file format. Compiling to a SWF file format ensures that the dashboard plays smoothly on different screen sizes and across different platforms. It also ensures that the users aren't given huge 10+ megabyte files. After compilation of the visual model to a SWF file, developers can then publish it to a format of their choice. The following are the available choices—Flash (SWF), AIR, SAP BusinessObjects Platform, HTML, PDF, PPT, Outlook, and Word. Once publishing is complete, the dashboard is ready to share! Exporting to a standard SWF, PPT, PDF, and so on After developing a Visual Model on Dashboard Design, we will need to somehow share it with users. We want to put it into a format that everyone can see on their machines. The simplest way is to export to a standard SWF file. One of the great features Dashboard Design has is to be able to embed dashboards into different office file formats. For example, a presenter could have a PowerPoint deck, and in the middle of the presentation, have a working dashboard that presents an important set of data values to the audience. Another example could be an executive level user who is viewing a Word document created by an analyst. The analyst could create a written document in Word and then embed a working dashboard with the most updated data to present important data values to the executive level user. You can choose to embed a dashboard in the following file types: PowerPoint Word PDF Outlook HTML Getting ready Make sure your visual model is complete and ready for sharing. How to do it... In the menu toolbar, go to File | Export | Flash (SWF). Select the directory in which you want the SWF to go to and the name of your SWF file How it works... Xcelsius compiles the visual model into an SWF file that everyone is able to see. Once the SWF file has been compiled, the dashboard will then be ready for sharing. It is mandatory that anyone viewing the dashboard have Adobe Flash installed. If not, they can download and install it from http://www.adobe.com/products/flashplayer/. If we export to PPT, we can then edit the PowerPoint file however we desire. If you have an existing PowerPoint presentation deck and want to append the dashboard to it, the easiest way is to first embed the dashboard SWF to a temporary PowerPoint file and then copy that slide to your existing PowerPoint file. There's more... Exporting to an SWF file makes it very easy for distribution, thus making the presentation of mockups great at a business level. Developers are able to work very closely with the business and iteratively come up with a visual model closest to the business goals. It is important though, when distributing SWF files, that everyone viewing the dashboards has the same version, otherwise confusion may occur. Thus, as a best practice, versioning every SWF that is distributed is very important. It is important to note that when the much anticipated Adobe Flash 10.1 was released, there were problems with embedding Dashboard Design dashboards in DOC, PPT, PDF, and so on. However, with the 10.1.82.76 Adobe Flash Player update, this has been fixed. Thus, it is important that if users have Adobe Flash Player 10.1+ installed, the version is higher than or equal to 10.1.82.76. When exporting to PDF, please take the following into account: In Dashboard Design 2008, the default format for exporting to PDF is Acrobat 9.0 (PDF 1.8). If Acrobat Reader 8.0 is installed, the default exported PDF cannot be opened. If using Acrobat Reader 8.0 or older, change the format to "Acrobat 6.0 (PDF 1.5)" before exporting to PDF. Exporting to SAP Business Objects Enterprise After Dashboard Design became a part of BusinessObjects, it was important to be able to export dashboards into the BusinessObjects Enterprise system. Once a dashboard is exported to BusinessObjects Enterprise, users can then easily access their dashboards through InfoView (now BI launch pad). On top of that, administrators are able control dashboard security. Getting ready Make sure your visual model is complete and ready for sharing. How to do it... From the menu toolbar, go to File | Export | Export to SAP BusinessObjects Platform. Enter your BusinessObjects login credentials and then select the location in the SAP BusinessObjects Enterprise system, where you want to store the SWF file, as shown in the following screenshot: Log into BI launch pad (formerly known as InfoView) and verify that you can access the dashboard. (Move the mouse over the image to enlarge.) How it works... When we export a dashboard to SAP BusinessObjects Enterprise, we basically place it in the SAP BusinessObjects Enterprise content management system. From there, we can control accessibility to the dashboard and make sure that we have one source of truth instead of sending out multiple dashboards through e-mail and possibly getting mixed up with what is the latest version. When we log into BI launch pad (formerly known as Infoview), it also passes the login token to the dashboard, so we don't have to enter our credentials again when connecting to SAP BusinessObjects Enterprise data. This is important because we don't have to manually create and pass any additional tokens once we have logged in. There's more... To give a true website type feel, developers can house their dashboards in a website type format using Dashboard Builder. This in turn provides a better experience for users, as they don't have to navigate through many folders in order to access the dashboard that they are looking for. Publishing to SAP BW This recipe shows you how to publish Dashboard Design dashboards to a SAP BW system. Once a dashboard is saved to the SAP BW system, it can be published within a SAP Enterprise Portal iView and made available for the users. Getting ready For this recipe, you will need an Dashboard Design dashboard model. This dashboard does not necessarily have to include a data connection to SAP BW. How to do it... Select Publish in the SAP menu. If you want to save the Xcelsius model with a different name, select the Publish As... option. If you are not yet connected to the SAP BW system, a pop up will appear. Select the appropriate system and fill in your username and password in the dialog box. If you want to disconnect from the SAP BW system and connect to a different system, select the Disconnect option from the SAP menu. Enter the Description and Technical Name of the dashboard. Select the location you want to save the dashboard to and click on Save. The dashboard is now published to the SAP BW system. To launch the dashboard and view it from the SAP BW environment, select the Launch option from the SAP menu. You will be asked to log in to the SAP BW system before you are able to view the dashboard. How it works... As we have seen in this recipe, the publishing of an Dashboard Design dashboard to SAP BW is quite straightforward. As the dashboard is part of the SAP BW environment after publishing, the model can be transported between SAP BW systems like all other SAP BW objects. There is more... After launching step 5, the Dashboard Design dashboard will load in your browser from the SAP BW server. You can add the displayed URL to an SAP Enterprise Portal iView to make the dashboard accessible for portal users.

0
0
7527

article-image-use-labview-data-acquisition

Fatema Patrawala

17 Feb 2018

14 min read

How to use LabVIEW for data acquisition

Fatema Patrawala

17 Feb 2018

14 min read

0
0
7491

article-image-store-access-social-media-data-mongodb

Amey Varangaonkar

26 Dec 2017

6 min read

How to store and access social media data in MongoDB

Amey Varangaonkar

26 Dec 2017

6 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Python Social Media Analytics, co-authored by Siddhartha Chatterjee and Michal Krystyanczuk.[/box] Our article explains how to effectively perform different operations using MongoDB and Python to effectively access and modify the data. According to the official MongoDB page: MongoDB is free and open-source, distributed database which allows ad hoc queries, indexing, and real time aggregation to access and analyze your data. It is published under the GNU Affero General Public License and stores data in flexible, JSON-like documents, meaning fields can vary from document to document and the data structure can be changed over time. Along with ease of use, MongoDB is recognized for the following advantages: Schema-less design: Unlike traditional relational databases, which require the data to fit its schema, MongoDB provides a flexible schema-less data model. The data model is based on documents and collections. A document is essentially a JSON structure and a collection is a group of documents. One links data within collections using specific identifiers. The document model is quite useful in this subject as most social media APIs provide their data in JSON format. High performance: Indexing and GRIDFS features of MongoDB provide fast access and storage. High availability: Duplication feature that allows us to make various copies of databases in different nodes confirms high availability in the case of node failures. Automatic scaling: The Sharding feature of MongoDB scales large data sets Automatically. You can access information on the implementation of Sharding in the official documentation of MongoDB: https://docs.mongodb.com/v3.0/sharding/ Installing MongoDB MongoDB can be downloaded and installed from the following link: http://www.mongodb.org/downloads?_ga=1.253005644.410512988.1432811016. Setting up the environment MongoDB requires a data directory to store all the data. The directory can be created in your working directory: md datadb Starting MongoDB We need to go to the folder where mongod.exe is stored and and run the following command: cmd binmongod.exe Once the MongoDB server is running in the background, we can switch to our Python environment to connect and start working. MongoDB using Python MongoDB can be used directly from the shell command or through programming languages. For the sake of our book we'll explain how it works using Python. MongoDB is accessed using Python through a driver module named PyMongo. We will not go into the detailed usage of MongoDB, which is beyond the scope of this book. We will see the most common functionalities required for analysis projects. We highly recommend reading the official MongoDB documentation. PyMongo can be installed using the following command: pip install pymongo Then the following command imports it in the Python script from pymongo import MongoClient client = MongoClient('localhost:27017') The database structure of MongoDB is similar to SQL languages, where you have databases, and inside databases you have tables. In MongoDB you have databases, and inside them you have collections. Collections are where you store the data, and databases store multiple collections. As MongoDB is a NoSQL database, your tables do not need to have a predefined structure, you can add documents of any composition as long as they are a JSON object. But by convention is it best practice to have a common general structure for documents in the same collections. To access a database named scrapper we simply have to do the following: db_scrapper = db.scrapper To access a collection named articles in the database scrapper we do this: db_scrapper = db.scrapper collection_articles = db_scrapper.articles Once you have the client object initiated you can access all the databases and the collections very easily. Now, we will see how to perform different operations: Insert: To insert a document into a collection we build a list of new documents to insert into the database: docs = [] for _ in range(0, 10): # each document must be of the python type dict docs.append({ "author": "...", "content": "...", "comment": ["...", ... ] }) Inserting all the docs at once: db.collection.insert_many(docs) Or you can insert them one by one: for doc in docs: db.collection.insert_one(doc) You can find more detailed documentation at: https://docs.mongodb.com/v3.2/tutorial/insert-documents/. Find: To fetch all documents within a collection: # as the find function returns a cursor we will iterate over the cursor to actually fetch # the data from the database docs = [d for d in db.collection.find()] To fetch all documents in batches of 100 documents: batch_size = 100 Iteration = 0 count = db.collection.count() # getting the total number of documents in the collection while iteration * batch_size < count: docs = [d for d in db.collection.find().skip(batch_size * iteration).limit(batch_size)] Iteration += 1 To fetch documents using search queries, where the author is Jean Francois: query = {'author': 'Jean Francois'} docs = [d for d in db.collection.find(query) Where the author field exists and is not null: query = {'author': {'$exists': True, '$ne': None}} docs = [d for d in db.collection.find(query)] There are many other different filtering methods that provide a wide variety of flexibility and precision; we highly recommend taking your time going through the different search operators. You can find more detailed documentation at: https://docs.mongodb.com/v3.2/reference/method/db.collection.find/ Update: To update a document where the author is Jean Francois and set the attribute published as True: query_search = {'author': 'Jean Francois'} query_update = {'$set': {'published': True}} db.collection.update_many(query_search, query_update) Or you can update just the first matching document: db.collection.update_one(query_search, query_update) Find more detailed documentation at: https://docs.mongodb.com/v3.2/reference/method/db.collection.update/ Remove: Remove all documents where the author is Jean Francois: query_search = {'author': 'Jean Francois'} db.collection.delete_many(query_search, query_update) Or remove the first matching document: db.collection.delete_one(query_search, query_update) Find more detailed documentation at: https://docs.mongodb.com/v3.2/tutorial/remove-documents/ Drop: You can drop collections by the following: db.collection.drop() Or you can drop the whole database: db.dropDatabase() We saw how to store and access data from MongoDB. MongoDB has gained a lot of popularity and is the preferred database choice for many, especially when it comes to working with social media data. If you found our post to be useful, do make sure to check out Python Social Media Analytics, which contains useful tips and tricks on leveraging the power of Python for effective data analysis from various social media sites such as YouTube, GitHub, Twitter etc.

0
0
7489

article-image-3-ways-use-structures-machine-learning-lise-getoor-nips-2017

Sugandha Lahoti

08 Dec 2017

11 min read

3 great ways to leverage Structures for Machine Learning problems by Lise Getoor at NIPS 2017

Sugandha Lahoti

08 Dec 2017

11 min read

0
0
7416

article-image-getting-started-with-apache-kafka-clusters

Amarabha Banerjee

23 Feb 2018

10 min read

Getting Started with Apache Kafka Clusters

Amarabha Banerjee

23 Feb 2018

10 min read

[box type="note" align="" class="" width=""]Below given article is a book excerpt from Apache Kafka 1.0 Cookbook written by Raúl Estrada. This book contains easy to follow recipes to help you set-up, configure and use Apache Kafka in the best possible manner.[/box] Here in this article, we are going to talk about how you can get started with Apache Kafka clusters and implement them seamlessly. In Apache Kafka there are three types of clusters: Single-node single-broker Single-node multiple-broker Multiple-node multiple-broker cluster The following four recipes show how to run Apache Kafka in these clusters. Configuring a single-node single-broker cluster – SNSB The first cluster configuration is single-node single-broker (SNSB). This cluster is very useful when a single point of entry is needed. Yes, its architecture resembles the singleton design pattern. A SNSB cluster usually satisfies three requirements: Controls concurrent access to a unique shared broker Access to the broker is requested from multiple, disparate producers There can be only one broker If the proposed design has only one or two of these requirements, a redesign is almost always the correct option. Sometimes, the single broker could become a bottleneck or a single point of failure. But it is useful when a single point of communication is needed. Getting ready Go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka How to do it... The diagram shows an example of an SNSB cluster: Starting ZooKeeper Kafka provides a simple ZooKeeper configuration file to launch a single ZooKeeper instance. To install the ZooKeeper instance, use this command: > bin/zookeeper-server-start.sh config/zookeeper.properties The main properties specified in the zookeeper.properties file are: clientPort: This is the listening port for client requests. By default, ZooKeeper listens on TCP port 2181: clientPort=2181 dataDir: This is the directory where ZooKeeper is stored: dataDir=/tmp/zookeeper means unbounded): maxClientCnxns=0 For more information about Apache ZooKeeper visit the project home page at: http://zookeeper.apache.org/. Starting the broker After ZooKeeper is started, start the Kafka broker with this command: > bin/kafka-server-start.sh config/server.properties The main properties specified in the server.properties file are: broker.id: The unique positive integer identifier for each broker: broker.id=0 log.dir: Directory to store log files: log.dir=/tmp/kafka10-logs num.partitions: The number of log partitions per topic: num.partitions=2 port: The port that the socket server listens on: port=9092 zookeeper.connect: The ZooKeeper URL connection: zookeeper.connect=localhost:2181 How it works Kafka uses ZooKeeper for storing metadata information about the brokers, topics, and partitions. Writes to ZooKeeper are performed only on changes of consumer group membership or on changes to the Kafka cluster itself. This amount of traffic is minimal, and there is no need for a dedicated ZooKeeper ensemble for a single Kafka cluster. Actually, many deployments use a single ZooKeeper ensemble to control multiple Kafka clusters (using a chroot ZooKeeper path for each cluster). SNSB – creating a topic, producer, and consumer The SNSB Kafka cluster is running; now let's create topics, producer, and consumer. Getting ready We need the previous recipe executed: Kafka already installed ZooKeeper up and running A Kafka server up and running Now, go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka How to do it The following steps will show you how to create an SNSB topic, producer, and consumer. Creating a topic As we know, Kafka has a command to create topics. Here we create a topic called SNSBTopic with one partition and one replica: > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 1 --partitions 1 --topic SNSBTopic We obtain the following output: Created topic "SNSBTopic". The command parameters are: --replication-factor 1: This indicates just one replica --partition 1: This indicates just one partition --zookeeper localhost:2181: This indicates the ZooKeeper URL As we know, to get the list of topics on a Kafka server we use the following command: > bin/kafka-topics.sh --list --zookeeper localhost:2181 We obtain the following output: SNSBTopic Starting the producer Kafka has a command to start producers that accepts inputs from the command line and publishes each input line as a message. By default, each new line is considered a message: > bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic SNSBTopic This command requires two parameters: broker-list: The broker URL to connect to topic: The topic name (to send a message to the topic subscribers) Now, type the following in the command line: The best thing about a boolean is [Enter] even if you are wrong [Enter] you are only off by a bit. [Enter] This output is obtained (as expected): The best thing about a boolean is even if you are wrong you are only off by a bit. The producer.properties file has the producer configuration. Some important properties defined in the producer.properties file are: metadata.broker.list: The list of brokers used for bootstrapping information on the rest of the cluster in the format host1:port1, host2:port2: metadata.broker.list=localhost:9092 compression.codec: The compression codec used. For example, none, gzip, and snappy: compression.codec=none Starting the consumer Kafka has a command to start a message consumer client. It shows the output in the command line as soon as it has subscribed to the topic: > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic SNSBTopic --from-beginning Note that the parameter from-beginning is to show the entire log: The best thing about a boolean is even if you are wrong you are only off by a bit. One important property defined in the consumer.properties file is: group.id: This string identifies the consumers in the same group: group.id=test-consumer-group There's more It is time to play with this technology. Open a new command-line window for ZooKeeper, a broker, two producers, and two consumers. Type some messages in the producers and watch them get displayed in the consumers. If you don't know or don't remember how to run the commands, run it with no arguments to display the possible values for the Parameters. Configuring a single-node multiple-broker cluster – SNMB The second cluster configuration is single-node multiple-broker (SNMB). This cluster is used when there is just one node but inner redundancy is needed. When a topic is created in Kafka, the system determines how each replica of a partition is mapped to each broker. In general, Kafka tries to spread the replicas across all available brokers. The messages are first sent to the first replica of a partition (to the current broker leader of that partition) before they are replicated to the remaining brokers. The producers may choose from different strategies for sending messages (synchronous or asynchronous mode). Producers discover the available brokers in a cluster and the partitions on each (all this by registering watchers in ZooKeeper). In practice, some of the high volume topics are configured with more than one partition per broker. Remember that having more partitions increases the I/O parallelism for writes and this increases the degree of parallelism for consumers (the partition is the unit for distributing data to consumers). On the other hand, increasing the number of partitions increases the overhead because: There are more files, so more open file handlers There are more offsets to be checked by consumers, so the ZooKeeper load is increased The art of this is to balance these tradeoffs. Getting ready Go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka The following diagram shows an example of an SNMB cluster: How to do it Begin starting the ZooKeeper server as follows: > bin/zookeeper-server-start.sh config/zookeeper.properties A different server.properties file is needed for each broker. Let's call them: server-1.properties, server-2.properties, server-3.properties, and so on (original, isn't it?). Each file is a copy of the original server.properties file. In the server-1.properties file set the following properties: broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 Similarly, in the server-2.properties file set the following properties: broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2 Finally, in the server-3.properties file set the following properties: broker.id=3 port=9095 log.dir=/tmp/kafka-logs-3 With ZooKeeper running, start the Kafka brokers with these commands: > bin/kafka-server-start.sh config/server-1.properties > bin/kafka-server-start.sh config/server-2.properties > bin/kafka-server-start.sh config/server-3.properties How it works Now the SNMB cluster is running. The brokers are running on the same Kafka node, on ports 9093, 9094, and 9095. SNMB – creating a topic, producer, and consumer The SNMB Kafka cluster is running; now let's create topics, producer, and consumer. Getting ready We need the previous recipe executed: Kafka already installed ZooKeeper up and running A Kafka server up and running Now, go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka How to do it The following steps will show you how to create an SNMB topic, producer, and consumer Creating a topic Using the command to create topics, let's create a topic called SNMBTopic with two partitions and two replicas: > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 2 --partitions 3 --topic SNMBTopic The following output is displayed: Created topic "SNMBTopic" This command has the following effects: Kafka will create three logical partitions for the topic. Kafka will create two replicas (copies) per partition. This means, for each partition it will pick two brokers that will host those replicas. For each partition, Kafka will randomly choose a broker Leader. Now ask Kafka for the list of available topics. The list now includes the new SNMBTopic: > bin/kafka-topics.sh --zookeeper localhost:2181 --list SNMBTopic Starting a producer Now, start the producers; indicating more brokers in the broker-list is easy: > bin/kafka-console-producer.sh --broker-list localhost:9093, localhost:9094, localhost:9095 --topic SNMBTopic If it's necessary to run multiple producers connecting to different brokers, specify a different broker list for each producer. Starting a consumer To start a consumer, use the following command: > bin/kafka-console-consumer.sh -- zookeeper localhost:2181 --frombeginning --topic SNMBTopic How it works The first important fact is the two parameters: replication-factor and partitions. The replication-factor is the number of replicas each partition will have in the topic created. The partitions parameter is the number of partitions for the topic created. There's more If you don't know the cluster configuration or don't remember it, there is a useful option for the kafka-topics command, the describe parameter: > bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic SNMBTopic The output is something similar to: Topic:SNMBTopic PartitionCount:3 ReplicationFactor:2 Configs: Topic: SNMBTopic Partition: 0 Leader: 2 Replicas: 2,3 Isr: 3,2 Topic: SNMBTopic Partition: 1 Leader: 3 Replicas: 3,1 Isr: 1,3 Topic: SNMBTopic Partition: 2 Leader: 1 Replicas: 1,2 Isr: 1,2 An explanation of the output: the first line gives a summary of all the partitions; each line gives information about one partition. Since we have three partitions for this topic, there are three lines: Leader: This node is responsible for all reads and writes for a particular partition. For a randomly selected section of the partitions each node is the leader. Replicas: This is the list of nodes that duplicate the log for a particular partition irrespective of whether it is currently alive. Isr: This is the set of in-sync replicas. It is a subset of the replicas currently alive and following the leader. In order to see the options for: create, delete, describe, or change a topic, type this command without parameters: > bin/kafka-topics.sh We discussed how to implement Apache Kafka clusters effectively. If you liked this post, be sure to check out Apache Kafka 1.0 Cookbook which consists of useful recipes to work with Apache Kafka installation.

0
1
7376

How-To Tutorials - Data

Introducing Interactive Plotting

Reconstructing 3D Scenes

Getting started with Haskell

Oracle Wallet Manager

Amazon DynamoDB - Modelling relationships, Error handling

Roger McNamee on Silicon Valley’s obsession for building “data voodoo dolls”

Administration rights for Power BI users

Building a Linear Regression Model in Python for developers

Hadoop and MapReduce

Adjusting Key Parameters in R

Trending Topics

Exporting SAP BusinessObjects Dashboards into Different Environments

How to use LabVIEW for data acquisition

How to store and access social media data in MongoDB

3 great ways to leverage Structures for Machine Learning problems by Lise Getoor at NIPS 2017

Getting Started with Apache Kafka Clusters

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access