Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-introducing-interactive-plotting
Packt
20 Mar 2015
29 min read
Save for later

Introducing Interactive Plotting

Packt
20 Mar 2015
29 min read
This article is written by Benjamin V. Root, the author of Interactive Applications using Matplotlib. The goal of any interactive application is to provide as much information as possible while minimizing complexity. If it can't provide the information the users need, then it is useless to them. However, if the application is too complex, then the information's signal gets lost in the noise of the complexity. A graphical presentation often strikes the right balance. The Matplotlib library can help you present your data as graphs in your application. Anybody can make a simple interactive application without knowing anything about draw buffers, event loops, or even what a GUI toolkit is. And yet, the Matplotlib library will cede as much control as desired to allow even the most savvy GUI developer to create a masterful application from scratch. Like much of the Python language, Matplotlib's philosophy is to give the developer full control, but without being stupidly unhelpful and tedious. (For more resources related to this topic, see here.) Installing Matplotlib There are many ways to install Matplotlib on your system. While the library used to have a reputation for being difficult to install on non-Linux systems, it has come a long way since then, along with the rest of the Python ecosystem. Refer to the following command: $ pip install matplotlib Most likely, the preceding command would work just fine from the command line. Python Wheels (the next-generation Python package format that has replaced "eggs") for Matplotlib are now available from PyPi for Windows and Mac OS X systems. This method would also work for Linux users; however, it might be more favorable to install it via the system's built-in package manager. While the core Matplotlib library can be installed with few dependencies, it is a part of a much larger scientific computing ecosystem known as SciPy. Displaying your data is often the easiest part of your application. Processing it is much more difficult, and the SciPy ecosystem most likely has the packages you need to do that. For basic numerical processing and N-dimensional data arrays, there is NumPy. For more advanced but general data processing tools, there is the SciPy package (the name was so catchy, it ended up being used to refer to many different things in the community). For more domain-specific needs, there are "Sci-Kits" such as scikit-learn for artificial intelligence, scikit-image for image processing, and statsmodels for statistical modeling. Another very useful library for data processing is pandas. This was just a short summary of the packages available in the SciPy ecosystem. Manually managing all of their installations, updates, and dependencies would be difficult for many who just simply want to use the tools. Luckily, there are several distributions of the SciPy Stack available that can keep the menagerie under control. The following are Python distributions that include the SciPy Stack along with many other popular Python packages or make the packages easily available through package management software: Anaconda from Continuum Analytics Canopy from Enthought SciPy Superpack Python(x, y) (Windows only) WinPython (Windows only) Pyzo (Python 3 only) Algorete Loopy from Dartmouth College Show() your work With Matplotlib installed, you are now ready to make your first simple plot. Matplotlib has multiple layers. Pylab is the topmost layer, often used for quick one-off plotting from within a live Python session. Start up your favorite Python interpreter and type the following: >>> from pylab import * >>> plot([1, 2, 3, 2, 1]) Nothing happened! This is because Matplotlib, by default, will not display anything until you explicitly tell it to do so. The Matplotlib library is often used for automated image generation from within Python scripts, with no need for any interactivity. Also, most users would not be done with their plotting yet and would find it distracting to have a plot come up automatically. When you are ready to see your plot, use the following command: >>> show() Interactive navigation A figure window should now appear, and the Python interpreter is not available for any additional commands. By default, showing a figure will block the execution of your scripts and interpreter. However, this does not mean that the figure is not interactive. As you mouse over the plot, you will see the plot coordinates in the lower right-hand corner. The figure window will also have a toolbar: From left to right, the following are the tools: Home, Back, and Forward: These are similar to that of a web browser. These buttons help you navigate through the previous views of your plot. The "Home" button will take you back to the first view when the figure was opened. "Back" will take you to the previous view, while "Forward" will return you to the previous views. Pan (and zoom): This button has two modes: pan and zoom. Press the left mouse button and hold it to pan the figure. If you press x or y while panning, the motion will be constrained to just the x or y axis, respectively. Press the right mouse button to zoom. The plot will be zoomed in or out proportionate to the right/left and up/down movements. Use the X, Y, or Ctrl key to constrain the zoom to the x axis or the y axis or preserve the aspect ratio, respectively. Zoom-to-rectangle: Press the left mouse button and drag the cursor to a new location and release. The axes view limits will be zoomed to the rectangle you just drew. Zoom out using your right mouse button, placing the current view into the region defined by the rectangle you just drew. Subplot configuration: This button brings up a tool to modify plot spacing. Save: This button brings up a dialog that allows you to save the current figure. The figure window would also be responsive to the keyboard. The default keymap is fairly extensive (and will be covered fully later), but some of the basic hot keys are the Home key for resetting the plot view, the left and right keys for back and forward actions, p for pan/zoom mode, o for zoom-to-rectangle mode, and Ctrl + s to trigger a file save. When you are done viewing your figure, close the window as you would close any other application window, or use Ctrl + w. Interactive plotting When we did the previous example, no plots appeared until show() was called. Furthermore, no new commands could be entered into the Python interpreter until all the figures were closed. As you will soon learn, once a figure is closed, the plot it contains is lost, which means that you would have to repeat all the commands again in order to show() it again, perhaps with some modification or additional plot. Matplotlib ships with its interactive plotting mode off by default. There are a couple of ways to turn the interactive plotting mode on. The main way is by calling the ion() function (for Interactive ON). Interactive plotting mode can be turned on at any time and turned off with ioff(). Once this mode is turned on, the next plotting command will automatically trigger an implicit show() command. Furthermore, you can continue typing commands into the Python interpreter. You can modify the current figure, create new figures, and close existing ones at any time, all from the current Python session. Scripted plotting Python is known for more than just its interactive interpreters; it is also a fully fledged programming language that allows its users to easily create programs. Having a script to display plots from daily reports can greatly improve your productivity. Alternatively, you perhaps need a tool that can produce some simple plots of the data from whatever mystery data file you have come across on the network share. Here is a simple example of how to use Matplotlib's pyplot API and the argparse Python standard library tool to create a simple CSV plotting script called plotfile.py. Code: chp1/plotfile.py#!/usr/bin/env python from argparse import ArgumentParserimport matplotlib.pyplot as pltif __name__ == '__main__':    parser = ArgumentParser(description="Plot a CSV file")    parser.add_argument("datafile", help="The CSV File")    # Require at least one column name    parser.add_argument("columns", nargs='+',                        help="Names of columns to plot")    parser.add_argument("--save", help="Save the plot as...")    parser.add_argument("--no-show", action="store_true",                        help="Don't show the plot")    args = parser.parse_args()      plt.plotfile(args.datafile, args.columns)    if args.save:        plt.savefig(args.save)    if not args.no_show:        plt.show() Note the two optional command-line arguments: --save and --no-show. With the --save option, the user can have the plot automatically saved (the graphics format is determined automatically from the filename extension). Also, the user can choose not to display the plot, which when coupled with the --save option might be desirable if the user is trying to plot several CSV files. When calling this script to show a plot, the execution of the script will stop at the call to plt.show(). If the interactive plotting mode was on, then the execution of the script would continue past show(), terminating the script, thus automatically closing out any figures before the user has had a chance to view them. This is why the interactive plotting mode is turned off by default in Matplotlib. Also note that the call to plt.savefig() is before the call to plt.show(). As mentioned before, when the figure window is closed, the plot is lost. You cannot save a plot after it has been closed. Getting help We have covered how to install Matplotlib and went over how to make very simple plots from a Python session or a Python script. Most likely, this went very smoothly for you.. You may be very curious and want to learn more about the many kinds of plots this library has to offer, or maybe you want to learn how to make new kinds of plots. Help comes in many forms. The Matplotlib website (http://matplotlib.org) is the primary online resource for Matplotlib. It contains examples, FAQs, API documentation, and, most importantly, the gallery. Gallery Many users of Matplotlib are often faced with the question, "I want to make a plot that has this data along with that data in the same figure, but it needs to look like this other plot I have seen." Text-based searches on graphing concepts are difficult, especially if you are unfamiliar with the terminology. The gallery showcases the variety of ways in which one can make plots, all using the Matplotlib library. Browse through the gallery, click on any figure that has pieces of what you want in your plot, and see the code that generated it. Soon enough, you will be like a chef, mixing and matching components to produce that perfect graph. Mailing lists and forums When you are just simply stuck and cannot figure out how to get something to work or just need some hints on how to get started, you will find much of the community at the Matplotlib-users mailing list. This mailing list is an excellent resource of information with many friendly members who just love to help out newcomers. Be persistent! While many questions do get answered fairly quickly, some will fall through the cracks. Try rephrasing your question or with a plot showing your attempts so far. The people at Matplotlib-users love plots, so an image that shows what is wrong often gets the quickest response. A newer community resource is StackOverflow, which has many very knowledgeable users who are able to answer difficult questions. From front to backend So far, we have shown you bits and pieces of two of Matplotlib's topmost abstraction layers: pylab and pyplot. The layer below them is the object-oriented layer (the OO layer). To develop any type of application, you will want to use this layer. Mixing the pylab/pyplot layers with the OO layer will lead to very confusing behaviors when dealing with multiple plots and figures. Below the OO layer is the backend interface. Everything above this interface level in Matplotlib is completely platform-agnostic. It will work the same regardless of whether it is in an interactive GUI or comes from a driver script running on a headless server. The backend interface abstracts away all those considerations so that you can focus on what is most important: writing code to visualize your data. There are several backend implementations that are shipped with Matplotlib. These backends are responsible for taking the figures represented by the OO layer and interpreting it for whichever "display device" they implement. The backends are chosen automatically but can be explicitly set, if needed. Interactive versus non-interactive There are two main classes of backends: ones that provide interactive figures and ones that don't. Interactive backends are ones that support a particular GUI, such as Tcl/Tkinter, GTK, Qt, Cocoa/Mac OS X, wxWidgets, and Cairo. With the exception of the Cocoa/Mac OS X backend, all interactive backends can be used on Windows, Linux, and Mac OS X. Therefore, when you make an interactive Matplotlib application that you wish to distribute to users of any of those platforms, unless you are embedding Matplotlib, you will not have to concern yourself with writing a single line of code for any of these toolkits—it has already been done for you! Non-interactive backends are used to produce image files. There are backends to produce Postscript/EPS, Adobe PDF, and Scalable Vector Graphics (SVG) as well as rasterized image files such as PNG, BMP, and JPEGs. Anti-grain geometry The open secret behind the high quality of Matplotlib's rasterized images is its use of the Anti-Grain Geometry (AGG) library (http://agg.sourceforge.net/antigrain.com/index.html). The quality of the graphics generated from AGG is far superior than most other toolkits available. Therefore, not only is AGG used to produce rasterized image files, but it is also utilized in most of the interactive backends as well. Matplotlib maintains and ships with its own fork of the library in order to ensure you have consistent, high quality image products across all platforms and toolkits. What you see on your screen in your interactive figure window will be the same as the PNG file that is produced when you call savefig(). Selecting your backend When you install Matplotlib, a default backend is chosen for you based upon your OS and the available GUI toolkits. For example, on Mac OS X systems, your installation of the library will most likely set the default interactive backend to MacOSX or CocoaAgg for older Macs. Meanwhile, Windows users will most likely have a default of TkAgg or Qt5Agg. In most situations, the choice of interactive backends will not matter. However, in certain situations, it may be necessary to force a particular backend to be used. For example, on a headless server without an active graphics session, you would most likely need to force the use of the non-interactive Agg backend: import matplotlibmatplotlib.use("Agg") When done prior to any plotting commands, this will avoid loading any GUI toolkits, thereby bypassing problems that occur when a GUI fails on a headless server. Any call to show() effectively becomes a no-op (and the execution of the script is not blocked). Another purpose of setting your backend is for scenarios when you want to embed your plot in a native GUI application. Therefore, you will need to explicitly state which GUI toolkit you are using. Finally, some users simply like the look and feel of some GUI toolkits better than others. They may wish to change the default backend via the backend parameter in the matplotlibrc configuration file. Most likely, your rc file can be found in the .matplotlib directory or the .config/matplotlib directory under your home folder. If you can't find it, then use the following set of commands: >>> import matplotlib >>> matplotlib.matplotlib_fname() u'/home/ben/.config/matplotlib/matplotlibrc' Here is an example of the relevant section in my matplotlibrc file: #### CONFIGURATION BEGINS HERE   # the default backend; one of GTK GTKAgg GTKCairo GTK3Agg # GTK3Cairo CocoaAgg MacOSX QtAgg Qt4Agg TkAgg WX WXAgg Agg Cairo # PS PDF SVG # You can also deploy your own backend outside of matplotlib by # referring to the module name (which must be in the PYTHONPATH) # as 'module://my_backend' #backend     : GTKAgg #backend     : QT4Agg backend     : TkAgg # If you are using the Qt4Agg backend, you can choose here # to use the PyQt4 bindings or the newer PySide bindings to # the underlying Qt4 toolkit. #backend.qt4 : PyQt4       # PyQt4 | PySide This is the global configuration file that is used if one isn't found in the current working directory when Matplotlib is imported. The settings contained in this configuration serves as default values for many parts of Matplotlib. In particular, we see that the choice of backends can be easily set without having to use a single line of code. The Matplotlib figure-artist hierarchy Everything that can be drawn in Matplotlib is called an artist. Any artist can have child artists that are also drawable. This forms the basis of a hierarchy of artist objects that Matplotlib sends to a backend for rendering. At the root of this artist tree is the figure. In the examples so far, we have not explicitly created any figures. The pylab and pyplot interfaces will create the figures for us. However, when creating advanced interactive applications, it is highly recommended that you explicitly create your figures. You will especially want to do this if you have multiple figures being displayed at the same time. This is the entry into the OO layer of Matplotlib: fig = plt.figure() Canvassing the figure The figure is, quite literally, your canvas. Its primary component is the FigureCanvas instance upon which all drawing occurs. Unless you are embedding your Matplotlib figures into a GUI application, it is very unlikely that you will need to interact with this object directly. Instead, as plotting commands are issued, artist objects are added to the canvas automatically. While any artist can be added directly to the figure, usually only Axes objects are added. A figure can have many axes objects, typically called subplots. Much like the figure object, our examples so far have not explicitly created any axes objects to use. This is because the pylab and pyplot interfaces will also automatically create and manage axes objects for a figure if needed. For the same reason as for figures, you will want to explicitly create these objects when building your interactive applications. If an axes or a figure is not provided, then the pyplot layer will have to make assumptions about which axes or figure you mean to apply a plotting command to. While this might be fine for simple situations, these assumptions get hairy very quickly in non-trivial applications. Luckily, it is easy to create both your figure and its axes using a single command: fig, axes = plt.subplots(2, 1) # 2x1 grid of subplots These objects are highly advanced complex units that most developers will utilize for their plotting needs. Once placed on the figure canvas, the axes object will provide the ticks, axis labels, axes title(s), and the plotting area. An axes is an artist that manages all of its scale and coordinate transformations (for example, log scaling and polar coordinates), automated tick labeling, and automated axis limits. In addition to these responsibilities, an axes object provides a wide assortment of plotting functions. A sampling of plotting functions is as follows: Function Description bar Make a bar plot barbs Plot a two-dimensional field of barbs boxplot Make a box and whisker plot cohere Plot the coherence between x and y contour Plot contours errorbar Plot an errorbar graph hexbin Make a hexagonal binning plot hist Plot a histogram imshow Display an image on the axes pcolor Create a pseudocolor plot of a two-dimensional array pcolormesh Plot a quadrilateral mesh pie Plot a pie chart plot Plot lines and/or markers quiver Plot a two-dimensional field of arrows sankey Create a Sankey flow diagram scatter Make a scatter plot of x versus y stem Create a stem plot streamplot Draw streamlines of a vector flow This application will be a storm track editing application. Given a series of radar images, the user can circle each storm cell they see in the radar image and link those storm cells across time. The application will need the ability to save and load track data and provide the user with mechanisms to edit the data. Along the way, we will learn about Matplotlib's structure, its artists, the callback system, doing animations, and finally, embedding this application within a larger GUI application. So, to begin, we first need to be able to view a radar image. There are many ways to load data into a Python program but one particular favorite among meteorologists is the Network Common Data Form (NetCDF) file. The SciPy package has built-in support for NetCDF version 3, so we will be using an hour's worth of radar reflectivity data prepared using this format from a NEXRAD site near Oklahoma City, OK on the evening of May 10, 2010, which produced numerous tornadoes and severe storms. The NetCDF binary file is particularly nice to work with because it can hold multiple data variables in a single file, with each variable having an arbitrary number of dimensions. Furthermore, metadata can be attached to each variable and to the dataset itself, allowing you to self-document data files. This particular data file has three variables, namely Reflectivity, lat, and lon to record the radar reflectivity values and the latitude and longitude coordinates of each pixel in the reflectivity data. The reflectivity data is three-dimensional, with the first dimension as time and the other two dimensions as latitude and longitude. The following code example shows how easy it is to load this data and display the first image frame using SciPy and Matplotlib. Code: chp1/simple_radar_viewer.py import matplotlib.pyplot as plt from scipy.io import netcdf_file   ncf = netcdf_file('KTLX_20100510_22Z.nc') data = ncf.variables['Reflectivity'] lats = ncf.variables['lat'] lons = ncf.variables['lon'] i = 0   cmap = plt.get_cmap('gist_ncar') cmap.set_under('lightgrey')   fig, ax = plt.subplots(1, 1) im = ax.imshow(data[i], origin='lower',                extent=(lons[0], lons[-1], lats[0], lats[-1]),               vmin=0.1, vmax=80, cmap='gist_ncar') cb = fig.colorbar(im)   cb.set_label('Reflectivity (dBZ)') ax.set_xlabel('Longitude') ax.set_ylabel('Latitude') plt.show() Running this script should result in a figure window that will display the first frame of our storms. The plot has a colorbar and the axes ticks label the latitudes and longitudes of our data. What is probably most important in this example is the imshow() call. Being an image, traditionally, the origin of the image data is shown in the upper-left corner and Matplotlib follows this tradition by default. However, this particular dataset was saved with its origin in the lower-left corner, so we need to state this with the origin parameter. The extent parameter is a tuple describing the data extent of the image. By default, it is assumed to be at (0, 0) and (N – 1, M – 1) for an MxN shaped image. The vmin and vmax parameters are a good way to ensure consistency of your colormap regardless of your input data. If these two parameters are not supplied, then imshow() will use the minimum and maximum of the input data to determine the colormap. This would be undesirable as we move towards displaying arbitrary frames of radar data. Finally, one can explicitly specify the colormap to use for the image. The gist_ncar colormap is very similar to the official NEXRAD colormap for radar data, so we will use it here: The gist_ncar colormap, along with some other colormaps packaged with Matplotlib such as the default jet colormap, are actually terrible for visualization. See the Choosing Colormaps page of the Matplotlib website for an explanation of why, and guidance on how to choose a better colormap. The menagerie of artists Whenever a plotting function is called, the input data and parameters are processed to produce new artists to represent the data. These artists are either primitives or collections thereof. They are called primitives because they represent basic drawing components such as lines, images, polygons, and text. It is with these primitives that your data can be represented as bar charts, line plots, errorbars, or any other kinds of plots. Primitives There are four drawing primitives in Matplotlib: Line2D, AxesImage, Patch, and Text. It is through these primitive artists that all other artist objects are derived from, and they comprise everything that can be drawn in a figure. A Line2D object uses a list of coordinates to draw line segments in between. Typically, the individual line segments are straight, and curves can be approximated with many vertices; however, curves can be specified to draw arcs, circles, or any other Bezier-approximated curves. An AxesImage class will take two-dimensional data and coordinates and display an image of that data with a colormap applied to it. There are actually other kinds of basic image artists available besides AxesImage, but they are typically for very special uses. AxesImage objects can be very tricky to deal with, so it is often best to use the imshow() plotting method to create and return these objects. A Patch object is an arbitrary two-dimensional object that has a single color for its "face." A polygon object is a specific instance of the slightly more general patch. These objects have a "path" (much like a Line2D object) that specifies segments that would enclose a face with a single color. The path is known as an "edge," and can have its own color as well. Besides the Polygons that one sees for bar plots and pie charts, Patch objects are also used to create arrows, legend boxes, and the markers used in scatter plots and elsewhere. Finally, the Text object takes a Python string, a point coordinate, and various font parameters to form the text that annotates plots. Matplotlib primarily uses TrueType fonts. It will search for fonts available on your system as well as ship with a few FreeType2 fonts, and it uses Bitstream Vera by default. Additionally, a Text object can defer to LaTeX to render its text, if desired. While specific artist classes will have their own set of properties that make sense for the particular art object they represent, there are several common properties that can be set. The following table is a listing of some of these properties. Property Meaning alpha 0 represents transparent and 1 represents opaque color Color name or other color specification visible boolean to flag whether to draw the artist or not zorder value of the draw order in the layering engine Let's extend the radar image example by loading up already saved polygons of storm cells in the tutorial.py file. Code: chp1/simple_storm_cell_viewer.py import matplotlib.pyplot as plt from scipy.io import netcdf_file from matplotlib.patches import Polygon from tutorial import polygon_loader   ncf = netcdf_file('KTLX_20100510_22Z.nc') data = ncf.variables['Reflectivity'] lats = ncf.variables['lat'] lons = ncf.variables['lon'] i = 0   cmap = plt.get_cmap('gist_ncar') cmap.set_under('lightgrey')   fig, ax = plt.subplots(1, 1) im = ax.imshow(data[i], origin='lower',                extent=(lons[0], lons[-1], lats[0], lats[-1]),                vmin=0.1, vmax=80, cmap='gist_ncar') cb = fig.colorbar(im)   polygons = polygon_loader('polygons.shp') for poly in polygons[i]:    p = Polygon(poly, lw=3, fc='k', ec='w', alpha=0.45)    ax.add_artist(p) cb.set_label("Reflectivity (dBZ)") ax.set_xlabel("Longitude") ax.set_ylabel("Latitude") plt.show() The polygon data returned from polygon_loader() is a dictionary of lists keyed by a frame index. The list contains Nx2 numpy arrays of vertex coordinates in longitude and latitude. The vertices form the outline of a storm cell. The Polygon constructor, like all other artist objects, takes many optional keyword arguments. First, lw is short for linewidth, (referring to the outline of the polygon), which we specify to be three points wide. Next is fc, which is short for facecolor, and is set to black ('k'). This is the color of the filled-in region of the polygon. Then edgecolor (ec) is set to white ('w') to help the polygons stand out against a dark background. Finally, we set the alpha argument to be slightly less than half to make the polygon fairly transparent so that one can still see the reflectivity data beneath the polygons. Note a particular difference between how we plotted the image using imshow() and how we plotted the polygons using polygon artists. For polygons, we called a constructor and then explicitly called ax.add_artist() to add each polygon instance as a child of the axes. Meanwhile, imshow() is a plotting function that will do all of the hard work in validating the inputs, building the AxesImage instance, making all necessary modifications to the axes instance (such as setting the limits and aspect ratio), and most importantly, adding the artist object to the axes. Finally, all plotting functions in Matplotlib return artists or a list of artist objects that it creates. In most cases, you will not need to save this return value in a variable because there is nothing else to do with them. In this case, we only needed the returned AxesImage so that we could pass it to the fig.colorbar() method. This is so that it would know what to base the colorbar upon. The plotting functions in Matplotlib exist to provide convenience and simplicity to what can often be very tricky to get right by yourself. They are not magic! They use the same OO interface that is accessible to application developers. Therefore, anyone can write their own plotting functions to make complicated plots easier to perform. Collections Any artist that has child artists (such as a figure or an axes) is called a container. A special kind of container in Matplotlib is called a Collection. A collection usually contains a list of primitives of the same kind that should all be treated similarly. For example, a CircleCollection would have a list of Circle objects, all with the same color, size, and edge width. Individual values for artists in the collection can also be set. A collection makes management of many artists easier. This becomes especially important when considering the number of artist objects that may be needed for scatter plots, bar charts, or any other kind of plot or diagram. Some collections are not just simply a list of primitives, but are artists in their own right. These special kinds of collections take advantage of various optimizations that can be assumed when rendering similar or identical things. RegularPolyCollection, for example, just needs to know the points of a single polygon relative to its center (such as a star or box) and then just needs a list of all the center coordinates, avoiding the need to store all the vertices of every polygon in its collection in memory. In the following example, we will display storm tracks as LineCollection. Note that instead of using ax.add_artist() (which would work), we will use ax.add_collection() instead. This has the added benefit of performing special handling on the object to determine its bounding box so that the axes object can incorporate the limits of this collection with any other plotted objects to automatically set its own limits which we trigger with the ax.autoscale(True) call. Code: chp1/linecoll_track_viewer.py import matplotlib.pyplot as plt from matplotlib.collections import LineCollection from tutorial import track_loader   tracks = track_loader('polygons.shp') # Filter out non-tracks (unassociated polygons given trackID of -9) tracks = {tid: t for tid, t in tracks.items() if tid != -9}   fig, ax = plt.subplots(1, 1) lc = LineCollection(tracks.values(), color='b') ax.add_collection(lc) ax.autoscale(True) ax.set_xlabel("Longitude") ax.set_ylabel("Latitude") plt.show() Much easier than the radar images, Matplotlib took care of all the limit setting automatically. Such features are extremely useful for writing generic applications that do not wish to concern themselves with such details. Summary In this article, we introduced you to the foundational concepts of Matplotlib. Using show(), you showed your first plot with only three lines of Python. With this plot up on your screen, you learned some of the basic interactive features built into Matplotlib, such as panning, zooming, and the myriad of key bindings that are available. Then we discussed the difference between interactive and non-interactive plotting modes and the difference between scripted and interactive plotting. You now know where to go online for more information, examples, and forum discussions of Matplotlib when it comes time for you to work on your next Matplotlib project. Next, we discussed the architectural concepts of Matplotlib: backends, figures, axes, and artists. Then we started our construction project, an interactive storm cell tracking application. We saw how to plot a radar image using a pre-existing plotting function, as well as how to display polygons and lines as artists and collections. While creating these objects, we had a glimpse of how to customize the properties of these objects for our display needs, learning some of the property and styling names. We also learned some of the steps one needs to consider when creating their own plotting functions, such as autoscaling. Resources for Article: Further resources on this subject: The plot function [article] First Steps [article] Machine Learning in IPython with scikit-learn [article]
Read more
  • 0
  • 2
  • 7670

article-image-reconstructing-3d-scenes
Packt
13 Oct 2016
25 min read
Save for later

Reconstructing 3D Scenes

Packt
13 Oct 2016
25 min read
In this article by Robert Laganiere, the author of the book OpenCV 3 Computer Vision Application Programming Cookbook Third Edition, has covered the following recipes: Calibrating a camera Recovering camera pose (For more resources related to this topic, see here.) Digital image formation Let us now redraw a new version of the figure describing the pin-hole camera model. More specifically, we want to demonstrate the relation between a point in 3D at position (X,Y,Z) and its image (x,y), on a camera specified in pixel coordinates. Note the changes that have been made to the original figure. First, a reference frame was positioned at the center of the projection; then, the Y-axis was alligned to point downwards to get a coordinate system that is compatible with the usual convention, which places the image origin at the upper-left corner of the image. Finally, we have identified a special point on the image plane, by considering the line coming from the focal point that is orthogonal to the image plane. The point (u0,v0) is the pixel position at which this line pierces the image plane and is called the principal point. It would be logical to assume that this principal point is at the center of the image plane, but in practice, this one might be off by few pixels depending on the precision of the camera. Since we are dealing with digital images, the number of pixels on the image plane (its resolution) is another important characteristic of a camera. We learned previously that a 3D point (X,Y,Z) will be projected onto the image plane at (fX/Z,fY/Z). Now, if we want to translate this coordinate into pixels, we need to divide the 2D image position by the pixel width (px) and then the height (py). Note that by dividing the focal length given in world units (generally given in millimeters) by px, we obtain the focal length expressed in (horizontal) pixels. We will then define this term as fx. Similarly, fy =f/py is defined as the focal length expressed in vertical pixel unit. The complete projective equation is therefore as shown: We know that (u0,v0) is the principal point that is added to the result in order to move the origin to the upper-left corner of the image. Also, the physical size of a pixel can be obtained by dividing the size of the image sensor (generally in millimeters) by the number of pixels (horizontally or vertically). In modern sensor, pixels are generally of square shape, that is, they have the same horizontal and vertical size. The preceding equations can be rewritten in matrix form. Here is the complete projective equation in its most general form: Calibrating a camera Camera calibration is the process by which the different camera parameters (that is, the ones appearing in the projective equation) are obtained. One can obviously use the specifications provided by the camera manufacturer, but for some tasks, such as 3D reconstruction, these specifications are not accurate enough. However, accurate calibration information can be obtained by undertaking an appropriate camera calibration step. An active camera calibration procedure will proceed by showing known patterns to the camera and analyzing the obtained images. An optimization process will then determine the optimal parameter values that explain the observations. This is a complex process that has been made easy by the availability of OpenCV calibration functions. How to do it... To calibrate a camera, the idea is to show it a set of scene points for which the 3D positions are known. Then, you need to observe where these points project on the image. With the knowledge of a sufficient number of 3D points and associated 2D image points, the exact camera parameters can be inferred from the projective equation. Obviously, for accurate results, we need to observe as many points as possible. One way to achieve this would be to take a picture of a scene with known 3D points, but in practice, this is rarely feasible. A more convenient way is to take several images of a set of 3D points from different viewpoints. This approach is simpler, but it requires you to compute the position of each camera view in addition to the computation of the internal camera parameters, which is fortunately feasible. OpenCV proposes that you use a chessboard pattern to generate the set of 3D scene points required for calibration. This pattern creates points at the corners of each square, and since this pattern is flat, we can freely assume that the board is located at Z=0, with the X and Y axes well-aligned with the grid. In this case, the calibration process simply consists of showing the chessboard pattern to the camera from different viewpoints. The following is an example of a calibration pattern image made of 7x5 inner corners as captured during the calibration step: The good thing is that OpenCV has a function that automatically detects the corners of this chessboard pattern. You simply provide an image and the size of the chessboard used (the number of horizontal and vertical inner corner points). The function will return the position of these chessboard corners on the image. If the function fails to find the pattern, then it simply returns false, as shown: //output vectors of image points std::vector<cv::Point2f> imageCorners; //number of inner corners on the chessboard cv::Size boardSize(7,5); //Get the chessboard corners bool found = cv::findChessboardCorners(image, // image of chessboard pattern boardSize, // size of pattern imageCorners); // list of detected corners The output parameter, imageCorners, will simply contain the pixel coordinates of the detected inner corners of the shown pattern. Note that this function accepts additional parameters if you need to tune the algorithm, which are not discussed here. There is also a special function that draws the detected corners on the chessboard image, with lines connecting them in a sequence: //Draw the corners cv::drawChessboardCorners(image, boardSize, imageCorners, found); // corners have been found The following image is obtained: The lines that connect the points show the order in which the points are listed in the vector of detected image points. To perform a calibration, we now need to specify the corresponding 3D points. You can specify these points in the units of your choice (for example, in centimeters or in inches); however, the simplest is to assume that each square represents one unit. In that case, the coordinates of the first point would be (0,0,0) (assuming that the board is located at a depth of Z=0), the coordinates of the second point would be (1,0,0), and so on, the last point being located at (6,4,0). There are a total of 35 points in this pattern, which is too less to obtain an accurate calibration. To get more points, you need to show more images of the same calibration pattern from various points of view. To do so, you can either move the pattern in front of the camera or move the camera around the board; from a mathematical point of view, this is completely equivalent. The OpenCV calibration function assumes that the reference frame is fixed on the calibration pattern and will calculate the rotation and translation of the camera with respect to the reference frame. Let‘s now encapsulate the calibration process in a CameraCalibrator class. The attributes of this class are as follows: // input points: // the points in world coordinates // (each square is one unit) std::vector<std::vector<cv::Point3f>> objectPoints; // the image point positions in pixels std::vector<std::vector<cv::Point2f>> imagePoints; // output Matrices cv::Mat cameraMatrix; cv::Mat distCoeffs; // flag to specify how calibration is done int flag; Note that the input vectors of the scene and image points are in fact made of std::vector of point instances; each vector element is a vector of the points from one view. Here, we decided to add the calibration points by specifying a vector of the chessboard image filename as input, the method will take care of extracting the point coordinates from the images: // Open chessboard images and extract corner points int CameraCalibrator::addChessboardPoints(const std::vector<std::string>& filelist, // list of filenames cv::Size & boardSize) { // calibration noard size // the points on the chessboard std::vector<cv::Point2f> imageCorners; std::vector<cv::Point3f> objectCorners; // 3D Scene Points: // Initialize the chessboard corners // in the chessboard reference frame // The corners are at 3D location (X,Y,Z)= (i,j,0) for (int i=0; i<boardSize.height; i++) { for (int j=0; j<boardSize.width; j++) { objectCorners.push_back(cv::Point3f(i, j, 0.0f)); } } // 2D Image points: cv::Mat image; // to contain chessboard image int successes = 0; // for all viewpoints for (int i=0; i<filelist.size(); i++) { // Open the image image = cv::imread(filelist[i],0); // Get the chessboard corners bool found = cv::findChessboardCorners(image, //image of chessboard pattern boardSize, // size of pattern imageCorners); // list of detected corners // Get subpixel accuracy on the corners if (found) { cv::cornerSubPix(image, imageCorners, cv::Size(5, 5), // half size of serach window cv::Size(-1, -1), cv::TermCriteria(cv::TermCriteria::MAX_ITER + cv::TermCriteria::EPS,30,// max number of iterations 0.1)); // min accuracy // If we have a good board, add it to our data if (imageCorners.size() == boardSize.area()) { // Add image and scene points from one view addPoints(imageCorners, objectCorners); successes++; } } //If we have a good board, add it to our data if (imageCorners.size() == boardSize.area()) { // Add image and scene points from one view addPoints(imageCorners, objectCorners); successes++; } } return successes; } The first loop inputs the 3D coordinates of the chessboard and the corresponding image points are the ones provided by the cv::findChessboardCorners function; this is done for all the available viewpoints. Moreover, in order to obtain a more accurate image point location, the cv::cornerSubPix function can be used, and as the name suggests, the image points will then be localized at subpixel accuracy. The termination criterion that is specified by the cv::TermCriteria object defines the maximum number of iterations and the minimum accuracy in subpixel coordinates. The first of these two conditions that is reached will stop the corner refinement process. When a set of chessboard corners have been successfully detected, these points are added to the vectors of the image and scene points using our addPoints method. Once a sufficient number of chessboard images have been processed (and consequently, a large number of 3D scene point / 2D image point correspondences are available), we can initiate the computation of the calibration parameters as shown: // Calibrate the camera // returns the re-projection error double CameraCalibrator::calibrate(cv::Size &imageSize){ //Output rotations and translations std::vector<cv::Mat> rvecs, tvecs; // start calibration return calibrateCamera(objectPoints, // the 3D points imagePoints, // the image points imageSize, // image size cameraMatrix, // output camera matrix distCoeffs, // output distortion matrix rvecs, tvecs, // Rs, Ts flag); // set options } In practice, 10 to 20 chessboard images are sufficient, but these must be taken from different viewpoints at different depths. The two important outputs of this function are the camera matrix and the distortion parameters. These will be described in the next section. How it works... In order to explain the result of the calibration, we need to go back to the projective equation presented in the introduction of this article. This equation describes the transformation of a 3D point into a 2D point through the successive application of two matrices. The first matrix includes all of the camera parameters, which are called the intrinsic parameters of the camera. This 3x3 matrix is one of the output matrices returned by the cv::calibrateCamera function. There is also a function called cv::calibrationMatrixValues that explicitly returns the value of the intrinsic parameters given by a calibration matrix. The second matrix is there to have the input points expressed into camera-centric coordinates. It is composed of a rotation vector (a 3x3 matrix) and a translation vector (a 3x1 matrix). Remember that in our calibration example, the reference frame was placed on the chessboard. Therefore, there is a rigid transformation (made of a rotation component represented by the matrix entries r1 to r9 and a translation represented by t1, t2, and t3) that must be computed for each view. These are in the output parameter list of the cv::calibrateCamera function. The rotation and translation components are often called the extrinsic parameters of the calibration and they are different for each view. The intrinsic parameters remain constant for a given camera/lens system. The calibration results provided by the cv::calibrateCamera are obtained through an optimization process. This process aims to find the intrinsic and extrinsic parameters that minimizes the difference between the predicted image point position, as computed from the projection of the 3D scene points, and the actual image point position, as observed on the image. The sum of this difference for all the points specified during the calibration is called the re-projection error. The intrinsic parameters of our test camera obtained from a calibration based on the 27 chessboard images are fx=409 pixels; fy=408 pixels; u0=237; and v0=171. Our calibration images have a size of 536x356 pixels. From the calibration results, you can see that, as expected, the principal point is close to the center of the image, but yet off by few pixels. The calibration images were taken using a Nikon D500 camera with an 18mm lens. Looking at the manufacturer specifitions, we find that the sensor size of this camera is 23.5mm x 15.7mm which gives us a pixel size of 0.0438mm. The estimated focal length is expressed in pixels, so multiplying the result by the pixel size gives us an estimated focal length of 17.8mm, which is consistent with the actual lens we used. Let us now turn our attention to the distortion parameters. So far, we have mentioned that under the pin-hole camera model, we can neglect the effect of the lens. However, this is only possible if the lens that is used to capture an image does not introduce important optical distortions. Unfortunately, this is not the case with lower quality lenses or with lenses that have a very short focal length. Even the lens we used in this experiment introduced some distortion, that is, the edges of the rectangular board are curved in the image. Note that this distortion becomes more important as we move away from the center of the image. This is a typical distortion observed with a fish-eye lens and is called radial distortion. It is possible to compensate for these deformations by introducing an appropriate distortion model. The idea is to represent the distortions induced by a lens by a set of mathematical equations. Once established, these equations can then be reverted in order to undo the distortions visible on the image. Fortunately, the exact parameters of the transformation, which will correct the distortions, can be obtained together with the other camera parameters during the calibration phase. Once this is done, any image from the newly calibrated camera will be undistorted. Therefore, we have added an additional method to our calibration class. //remove distortion in an image (after calibration) cv::Mat CameraCalibrator::remap(const cv::Mat &image) { cv::Mat undistorted; if (mustInitUndistort) { //called once per calibration cv::initUndistortRectifyMap(cameraMatrix, // computed camera matrix distCoeffs, // computed distortion matrix cv::Mat(), // optional rectification (none) cv::Mat(), // camera matrix to generate undistorted image.size(), // size of undistorted CV_32FC1, // type of output map map1, map2); // the x and y mapping functions mustInitUndistort= false; } // Apply mapping functions cv::remap(image, undistorted, map1, map2, cv::INTER_LINEAR); // interpolation type return undistorted; } Running this code on one of our calibration image results in the following undistorted image: To correct the distortion, OpenCV uses a polynomial function that is applied to the image points in order to move them at their undistorted position. By default, five coefficients are used; a model made of eight coefficients is also available. Once these coefficients are obtained, it is possible to compute two cv::Mat mapping functions (one for the x coordinate and one for the y coordinate) that will give the new undistorted position of an image point on a distorted image. This is computed by the cv::initUndistortRectifyMap function, and the cv::remap function remaps all the points of an input image to a new image. Note that because of the nonlinear transformation, some pixels of the input image now fall outside the boundary of the output image. You can expand the size of the output image to compensate for this loss of pixels, but you now obtain output pixels that have no values in the input image (they will then be displayed as black pixels). There‘s more... More options are available when it comes to camera calibration. Calibration with known intrinsic parameters When a good estimate of the camera’s intrinsic parameters is known, it could be advantageous to input them in the cv::calibrateCamera function. They will then be used as initial values in the optimization process. To do so, you just need to add the cv::CALIB_USE_INTRINSIC_GUESS flag and input these values in the calibration matrix parameter. It is also possible to impose a fixed value for the principal point (cv::CALIB_FIX_PRINCIPAL_POINT), which can often be assumed to be the central pixel. You can also impose a fixed ratio for the focal lengths fx and fy (cv::CALIB_FIX_RATIO); in which case, you assume that the pixels have a square shape. Using a grid of circles for calibration Instead of the usual chessboard pattern, OpenCV also offers the possibility to calibrate a camera by using a grid of circles. In this case, the centers of the circles are used as calibration points. The corresponding function is very similar to the function we used to locate the chessboard corners, for example: cv::Size boardSize(7,7); std::vector<cv::Point2f> centers; bool found = cv:: findCirclesGrid(image, boardSize, centers); See also The A flexible new technique for camera calibration article by Z. Zhang  in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no 11, 2000, is a classic paper on the problem of camera calibration Recovering camera pose When a camera is calibrated, it becomes possible to relate the captured with the outside world. If the 3D structure of an object is known, then one can predict how the object will be imaged on the sensor of the camera. The process of image formation is indeed completely described by the projective equation that was presented at the beginning of this article. When most of the terms of this equation are known, it becomes possible to infer the value of the other elements (2D or 3D) through the observation of some images. In this recipe, we will look at the camera pose recovery problem when a known 3D structure is observed. How to do it... Lets consider a simple object here, a bench in a park. We took an image of it using the camera/lens system calibrated in the previous recipe. We have manually identified 8 distinct image points on the bench that we will use for our camera pose estimation. Having access to this object makes it possible to make some physical measurements. The bench is composed of a seat of size 242.5cmx53.5cmx9cm and a back of size 242.5cmx24cmx9cm that is fixed 12cm over the seat. Using this information, we can then easily derive the 3D coordinates of the eight identified points in an object-centric reference frame (here we fixed the origin at the left extremity of the intersection between the two planes). We can then create a vector of cv::Point3f containing these coordinates. //Input object points std::vector<cv::Point3f> objectPoints; objectPoints.push_back(cv::Point3f(0, 45, 0)); objectPoints.push_back(cv::Point3f(242.5, 45, 0)); objectPoints.push_back(cv::Point3f(242.5, 21, 0)); objectPoints.push_back(cv::Point3f(0, 21, 0)); objectPoints.push_back(cv::Point3f(0, 9, -9)); objectPoints.push_back(cv::Point3f(242.5, 9, -9)); objectPoints.push_back(cv::Point3f(242.5, 9, 44.5)); objectPoints.push_back(cv::Point3f(0, 9, 44.5)); The question now is where the camera was with respect to these points when the shown picture was taken. Since the coordinates of the image of these known points on the 2D image plane are also known, it becomes easy to answer this question using the cv::solvePnP function. Here, the correspondence between the 3D and the 2D points has been established manually, but as a reader of this book, you should be able to come up with methods that would allow you to obtain this information automatically. //Input image points std::vector<cv::Point2f> imagePoints; imagePoints.push_back(cv::Point2f(136, 113)); imagePoints.push_back(cv::Point2f(379, 114)); imagePoints.push_back(cv::Point2f(379, 150)); imagePoints.push_back(cv::Point2f(138, 135)); imagePoints.push_back(cv::Point2f(143, 146)); imagePoints.push_back(cv::Point2f(381, 166)); imagePoints.push_back(cv::Point2f(345, 194)); imagePoints.push_back(cv::Point2f(103, 161)); // Get the camera pose from 3D/2D points cv::Mat rvec, tvec; cv::solvePnP(objectPoints, imagePoints, // corresponding 3D/2D pts cameraMatrix, cameraDistCoeffs, // calibration rvec, tvec); // output pose // Convert to 3D rotation matrix cv::Mat rotation; cv::Rodrigues(rvec, rotation); This function computes the rigid transformation (rotation and translation) that brings the object coordinates in the camera-centric reference frame (that is, the ones that has its origin at the focal point). It is also important to note that the rotation computed by this function is given in the form of a 3D vector. This is a compact representation in which the rotation to apply is described by a unit vector (an axis of rotation) around which the object is rotated by a certain angle. This axis-angle representation is also called the Rodrigues’ rotation formula. In OpenCV, the angle of rotation corresponds to the norm of the output rotation vector, which is later aligned with the axis of rotation. This is why the cv::Rodrigues function is used to obtain the 3D matrix of rotation that appears in our projective equation. The pose recovery procedure described here is simple, but how do we know we obtained the right camera/object pose information. We can visually assess the quality of the results using the cv::viz module that gives us the ability to visualize 3D information. The use of this module is explained in the last section of this recipe. For now, lets display a simple 3D representation of our object and the camera that captured it: It might be difficult to judge of the quality of the pose recovery just by looking at this image but if you test the example of this recipe on your computer, you will have the possibility to move this representation in 3D using your mouse which should give you a better sense of the solution obtained. How it works... In this recipe, we assumed that the 3D structure of the object was known as well as the correspondence between sets of object points and image points. The camera’s intrinsic parameters were also known through calibration. If you look at our projective equation presented at the end of the Digital image formation section of the introduction of this article, this means that we have points for which coordinates (X,Y,Z) and (x,y) are known. We also have the elements of first matrix known (the intrinsic parameters). Only the second matrix is unknown; this is the one that contains the extrinsic parameters of the camera that is the camera/object pose information. Our objective is to recover these unknown parameters from the observation of 3D scene points. This problem is known as the Perspective-n-Point problem or PnP problem. Rotation has three degrees of freedom (for example, angle of rotation around the three axes) and translation also has three degrees of freedom. We therefore have a total of 6 unknowns. For each object point/image point correspondence, the projective equation gives us three algebraic equations but since the projective equation is up to a scale factor, we only have 2 independent equations. A minimum of three points is therefore required to solve this system of equations. Obviously, more points provide a more reliable estimate. In practice, many different algorithms have been proposed to solve this problem and OpenCV proposes a number of different implementation in its cv::solvePnP function. The default method consists in optimizing what is called the reprojection error. Minimizing this type of error is considered to be the best strategy to get accurate 3D information from camera images. In our problem, it corresponds to finding the optimal camera position that minimizes the 2D distance between the projected 3D points (as obtained by applying the projective equation) and the observed image points given as input. Note that OpenCV also has a cv::solvePnPRansac function. As the name suggest, this function uses the RANSAC algorithm in order to solve the PnP problem. This means that some of the object points/image points correspondences may be wrong and the function will returns which ones have been identified as outliers. This is very useful when these correspondences have been obtained through an automatic process that can fail for some points. There‘s more... When working with 3D information, it is often difficult to validate the solutions obtained. To this end, OpenCV offers a simple yet powerful visualization module that facilitates the development and debugging of 3D vision algorithms. It allows inserting points, lines, cameras, and other objects in a virtual 3D environment that you can interactively visualize from various points of views. cv::Viz, a 3D Visualizer module cv::Viz is an extra module of the OpenCV library that is built on top of the VTK open source library. This Visualization Tooolkit (VTK) is a powerful framework used for 3D computer graphics. With cv::viz, you create a 3D virtual environment to which you can add a variety of objects. A visualization window is created that displays the environment from a given point of view. You saw in this recipe an example of what can be displayed in a cv::viz window. This window responds to mouse events that are used to navigate inside the environment (through rotations and translations). This section describes the basic use of the cv::viz module. The first thing to do is to create the visualization window. Here we use a white background: // Create a viz window cv::viz::Viz3d visualizer(“Viz window“); visualizer.setBackgroundColor(cv::viz::Color::white()); Next, you create your virtual objects and insert them into the scene. There is a variety of predefined objects. One of them is particularly useful for us; it is the one that creates a virtual pin-hole camera: // Create a virtual camera cv::viz::WCameraPosition cam(cMatrix, // matrix of intrinsics image, // image displayed on the plane 30.0, // scale factor cv::viz::Color::black()); // Add the virtual camera to the environment visualizer.showWidget(“Camera“, cam); The cMatrix variable is a cv::Matx33d (that is,a cv::Matx<double,3,3>) instance containing the intrinsic camera parameters as obtained from calibration. By default this camera is inserted at the origin of the coordinate system. To represent the bench, we used two rectangular cuboid objects. // Create a virtual bench from cuboids cv::viz::WCube plane1(cv::Point3f(0.0, 45.0, 0.0), cv::Point3f(242.5, 21.0, -9.0), true, // show wire frame cv::viz::Color::blue()); plane1.setRenderingProperty(cv::viz::LINE_WIDTH, 4.0); cv::viz::WCube plane2(cv::Point3f(0.0, 9.0, -9.0), cv::Point3f(242.5, 0.0, 44.5), true, // show wire frame cv::viz::Color::blue()); plane2.setRenderingProperty(cv::viz::LINE_WIDTH, 4.0); // Add the virtual objects to the environment visualizer.showWidget(“top“, plane1); visualizer.showWidget(“bottom“, plane2); This virtual bench is also added at the origin; it then needs to be moved at its camera-centric position as found from our cv::solvePnP function. It is the responsibility of the setWidgetPose method to perform this operation. This one simply applies the rotation and translation components of the estimated motion. cv::Mat rotation; // convert vector-3 rotation // to a 3x3 rotation matrix cv::Rodrigues(rvec, rotation); // Move the bench cv::Affine3d pose(rotation, tvec); visualizer.setWidgetPose(“top“, pose); visualizer.setWidgetPose(“bottom“, pose); The final step is to create a loop that keeps displaying the visualization window. The 1ms pause is there to listen to mouse events. // visualization loop while(cv::waitKey(100)==-1 && !visualizer.wasStopped()) { visualizer.spinOnce(1, // pause 1ms true); // redraw } This loop will stop when the visualization window is closed or when a key is pressed over an OpenCV image window. Try to apply inside this loop some motion on an object (using setWidgetPose); this is how animation can be created. See also Model-based object pose in 25 lines of code by D. DeMenthon and L. S. Davis, in European Conference on Computer Vision, 1992, pp.335–343 is a famous method for recovering camera pose from scene points. Summary This article teaches us how, under specific conditions, the 3D structure of the scene and the 3D pose of the cameras that captured it can be recovered. We have seen how a good understanding of projective geometry concepts allows to devise methods enabling 3D reconstruction. Resources for Article: Further resources on this subject: OpenCV: Image Processing using Morphological Filters [article] Learn computer vision applications in Open CV [article] Cardboard is Virtual Reality for Everyone [article]
Read more
  • 0
  • 0
  • 7668

article-image-getting-started-haskell
Packt
30 Oct 2013
9 min read
Save for later

Getting started with Haskell

Packt
30 Oct 2013
9 min read
(For more resources related to this topic, see here.) So what is Haskell? It is a fast, type-safe, purely functional programming language with a powerful type inference. Having said that, let us try to understand what it gives us. First, a purely functional programming language means that, in general, functions in Haskell don't have side effects. There is a special type for impure functions, or functions with side effects. Then, Haskell has a strong, static type system with an automatic and robust type inference. This, in practice, means that you do not usually need to specify types of functions and also the type checker does not allow passing incompatible types. In strongly typed languages, types are considered to be a specification, due to the Curry-Howard correspondence, the direct relationship between programs and mathematical proofs. Under this great simplification, the theorem states that if a value of the type exists (or is inhabited), then the corresponding mathematical proof is correct. Or jokingly saying, if a program compiles, then there is 99 percent probability that it works according to specification. Though the question if the types conform, the specification in natural language is still open; Haskell won't help you with it. The Haskell platform The glorious Glasgow Haskell Compilation System, or simply Glasgow Haskell Compiler (GHC), is the most widely used Haskell compiler. It is the current de facto standard. The compiler is packaged into the Haskell platform that follows Python's principle, "Batteries included". The platform is updated twice a year with new compilers and libraries. It usually includes a compiler, an interactive Read-Evaluate-Print Loop (REPL) interpreter, Haskell 98/2010 libraries (so-called Prelude) that includes most of the common definitions and functions, and a set of commonly used libraries. If you are on Windows or Mac OS X, it is strongly recommended to use prepackaged installers of the Haskell platform at http://www.haskell.org/platform/. Quick tour of Haskell To start with development, first we should be familiar with a few basic features of Haskell. We really need to know about laziness, datatypes, pattern matching, type classes, and basic notion of monads to start with Haskell. Laziness Haskell is a language with lazy evaluation. From a programmer's point of view that means that the value is evaluated if and only if it is really needed. Imperative languages usually have a strict evaluation, that is, function arguments are evaluated before function application. To see the difference, let's take a look at a simple expression in Haskell: let x = 2 + 3 In a strict or eager language, the 2 + 3 expression would be immediately evaluated to 5, whereas in Haskell, only a promise to do this evaluation will be created and the value will be evaluated only when it is needed. In other words, this statement just introduces definition of x which might be used afterwards, unlike in strict language where it is an operator that assigns the computed value, 5 to a memory cell named x. Also, this strategy allows sharing of evaluations, because laziness assumes that computations can be executed whenever it is needed and therefore, the result can be memorized. It might reduce the running time by exponential factor over strict evaluation. Laziness also allows to manipulate with infinite data structures. For instance, we can construct an infinite list of natural numbers as follows: let ns = [1..] And moreover, we can manipulate it as if it is a normal list, even though some caution is needed, as you can get an infinite loop. We can take the first five elements of this infinite list by means of the built-in function, take: take 5 ns By running this example in GHCi you will get [1,2,3,4,5]. Functions as first-class citizens The notion of function is one of the core ideas in functional languages and Haskell is not an exception at all. The definition of a function includes a body of function and an optional type declaration. For instance, the take function is defined in Prelude as follows: take :: Int -> [a] -> [a] take = ... Here, the type declaration says that the function takes an integer as the argument and a list of objects of the a type, and returns a new list of the same type. Also Haskell allows partial application of a function. For example, you can construct a function that takes first five elements of the list as follows: take5 :: [a] -> [a] take5 = take 5 Also functions are themselves objects, that is, you may pass a function as an argument to another function. Prelude defines map function as a function of a function: map :: (a -> b) -> [a] -> [b] map takes a function and applies it to each element of the list. Thus functions are first-class citizens in the language and it is possible to manipulate with them as if they were normal objects. Data types Data type is a core of a strongly typed language as Haskell. The distinctive feature of Haskell data types is that they all are immutable, i.e. after object constructed it cannot be changed. It might be weird on the first sight but in the long run it has few positive consequences. First, it enables computation parallelization. Second, all data are referentially transparent, i.e. there is no difference between reference to object and object itself. Those two properties allow compiler to reason about code optimization on higher level than C/C++ compiler can. Let us consider the following data type definitions in Haskell: type TimeStamp = Integer newtype Price = Price Double data Quote = AskQuote TimeStamp Price | BidQuote TimeStamp Price There are three common ways to define data types in Haskell: The first declaration just creates a synonym for an existing data type and type checker won’t prevent you from using Integer instead of TimeStamp. Also you can use a value of type TimeStamp with any function that expects to work with an Integer. The second declaration creates a new type for prices and you are not allowed to use Double instead of Price. The compiler will raise an error in such cases and thus it will enforce the correct usage. The last declaration introduces Algebraic Data Type (ADT) and says that type Quote might be constructed either by data constructor AskQuote, or by BidQuote with time stamp and price as its parameters. Quote itself is called a type constructor. Type constructor might be parameterized by type variables. For example, types Maybe and Either, quite often used standard types, are defined as follows: data Maybe a = Nothing | Just a data Either a b = Left a | Right b Type variables a and b can be substituted by any other type. Type classes Type classes in Haskell are not classes like in object-oriented languages. It is more like interfaces with optional implementation. You can find them in other languages named traits, mix-ins and so on but unlike in them, this feature in Haskell enables ad-hoc polymorphism, i.e. function could be applied to arguments of different types. It is also known as function overloading or operator overloading. A polymorphic function can specify different implementation for different types. In principle, type class consists of function declaration over some objects. Eq type class is a standard type class that specifies how to compare two objects of same type: class Eq a where (==) :: a -> a -> Bool (/=) :: a -> a -> Bool (/=) x y = not $ (==) x y Here Eq is the name of type class, a is a type variable and == and /= are the operations defined in the type class. This definition means that some type a is of class Eq if it has defined operation == and /=. Moreover, the definition provides default implementation of the operation /=. And if you decide to implement this class for some data type, you need provide the single operation implementation. There are numbers of useful type classes defined in Prelude. For example, Eq is used to define equality between 2 objects; Ord specifies total ordering; Show and Read introduce a String representation of object; Enum type class describes enumerations, i.e. data types with null constructors. It might be quite boring to implement this quite trivial but useful type classes for each data type, so Haskell supports automatic derivation of most of standard type classes: newtype Price = Price Double deriving (Eq, Show, Read, Ord) Now objects of type Price could be converted from/to String by means of read/show functions and compared with themselves using equality operators. Those who want to know all the details can look them up in Haskell language report. IO monad Being a pure functional language Haskell requires a marker of impure functions and IO monad is the marker. Function Main is an entry point for any Haskell program. It cannot be a pure function because it changes state of the “World”, at least by creating a new process in OS. Let us take a look at its type: main :: IO () main = do putStrLn "Hello, World!" IO () type signs that there is a computation that performs some I/O operation and return empty result. There is really only one way to perform I/O in Haskell: use it in main procedure of your program. Also it is not possible to execute I/O action from arbitrary function, unless that function is in the IO monad and called from main, directly or indirectly. Summary In this article we walked through the main unique features of Haskell language and learned a bit of its history. Resources for Article : Further resources on this subject: Core Data iOS: Designing a Data Model and Building Data Objects [Article] The OpenFlow Controllers [Article] Managing Network Layout [Article]
Read more
  • 0
  • 0
  • 7664

article-image-oracle-wallet-manager
Packt
16 Oct 2009
9 min read
Save for later

Oracle Wallet Manager

Packt
16 Oct 2009
9 min read
  The Oracle Wallet Manager Oracle Wallet Manager is a password protected stand-alone Java application tool used to maintain security credentials and store SSL related information such as authentication and signing credentials, private keys, certificates, and trusted certificates. OWM uses Public Key Cryptographic Standards (PKCS) #12 specification for the Wallet format and PKCS #10 for certificate requests. Oracle Wallet Manager stores X.509 v3 certificates and private keys in industry-standard PKCS #12 formats, and generates certificate requests according to the PKCS #10 specification. This makes the Oracle Wallet structure interoperable with supported third party PKI applications, and provides Wallet portability across operating systems. Additionally, Oracle Wallet Manager Wallets can be enabled to store credentials on hardware security modules that use APIs compliant with the PKCS #11 specification. The OWM creates Wallets, generates certificate requests, accesses Public Key interface-based services, saves credentials into cryptographic hardware such as smart cards, uploads and unloads Wallets to LDAP directories, and imports Wallets in PKCS #12 format. In a Windows environment, Oracle Wallet Manager can be accessed from the start menu. The following screenshot shows the Oracle Wallet Manager Properties: In a Unix like environment, OWM can be accessed directly from the command line with the owm shell script located at $ORACLE_HOME/bin/owm, it requires a graphical environment so it can be launched. Creating the Oracle Wallet If this is the first time the Wallet has been opened, then a Wallet file does not yet exist. A Wallet is physically created in a specified directory. The user can declare the path where the Oracle Wallet file should be created. The user may either specify a default location or declare a particular directory. A file named ewallet.p12 will be created in the specified location. Enabling Auto Login The Oracle Wallet Manager Auto Login feature creates an obfuscated copy of the Wallet and enables PKI-based access to the services without a password. When this feature is enabled, only the user who created the Wallet will have access to it. By default, Single Sign-On (SSO) access to a different database is disabled. The auto login feature must be enabled in order for you to have access to multiple databases using SSO. Checking and unchecking the Auto Login option will enable and disable this feature. mkwallet, the CLI OWM version Besides the Java client, there is a command line interface version of the Wallet, which can be accessed by means of the mkwallet utility. This can also be used to generate a Wallet and have it configured in Auto Login mode. This is a fully featured tool that allows you to create Wallets, and to view and modify their content. The options provided by the mkwallet tool are shown in the following table:     Option Meaning -R rootPwd rootWrl DN keySize expDate Create the root Wallet -e pwd wrl Create an empty Wallet -r pwd wrl DN keySize certReqLoc Create a certificate request, add it to Wallet and export it to certReqLoc -c rootPwd rootWrl certReqLoc certLoc Create a certificate for a certificate request -i pwd wrl certLoc NZDST_CERTIFICATE | NZDST_CLEAR_PTP Install a certificate | trusted point -d pwd wrl DN Delete a certificate with matching DN -s pwd wrl Store sso Wallet -p pwd wrl Dump the content of Wallet -q certLoc Dump the content of the certificate -Lg pwd wrl crlLoc nextUpdate Generate CRL -La pwd wrl crlLoc certtoRevoke Revoke certificate -Ld crlLoc Display CRL -Lv crlLoc cacert Verify CRL signature -Ls crlLoc cert Check certificate revocation status -Ll oidHostname oidPortNumber cacert Fetch CRL from LDAP directory -Lc cert Fetch CRL from CRLDP in cert -Lb b64CrlLoc derCrlLoc Convert CRL from B64 to DER format -Pw pwd wrl pkcs11Lib tokenPassphrase Create an empty Wallet. Store PKCS11 info in it. -Pq pwd wrl DN keysize certreqLoc Create cert request. Generate key pair on pkcs11 device. -Pl pwd wrl Test pkcs11 device login using Wallet containing PKCS11 info. -Px pwd wrl pkcs11Lib tokenPassphrase Create a Wallet with pkcs11 info from a software Wallet.   Managing Wallets with orapki A CLI-based tool, orapki, is used to manage Public Key Infrastructure components such as Wallets and revocation lists. This tool eases the procedures related to PKI management and maintenance by allowing the user to include it in batch scripts. This tool can be used to create and view signed certificates for testing purposes, create Oracle Wallets, add and remove certificate and certificate requests, and manage Certification Revocation Lists (CRLs)—renaming them and managing them against the Oracle Internet Directory. The syntax for this tool is: orapki module command -parameter <value> module can have these values: wallet: Oracle Wallet crl: Certificate Revocation List cert: The PKI Certificate To create a Wallet you can issue this command: orapki wallet create -wallet <Path to Wallet> To create a Wallet with the auto login feature enabled, you can issue the command: orapki wallet create -wallet <Path to Wallet> -autologin To add a certificate request to the Wallet you can use the command: orapki wallet add -wallet <wallet_location> -dn <user_dn> -keySize <512|1024|2048> To add a user certificate to an Oracle Wallet: orapki wallet add -wallet <wallet_location> -user_cert -cert <certificate_location> The options and values available for the orapki tool depend on the module to be configured: orapki Action Description and Syntax orapki cert create Creates a signed certificate for testing purposes. orapki cert create [-wallet <wallet_location>] -request <certificate_request_location> -cert <certificate_location> -validity <number_of_days> [-summary] orapki cert display Displays details of a specific certificate. orapki cert display -cert <certificate_location> [-summary|-complete] orapki crl delete Deletes CRLs from Oracle Internet Directory.   orapki crl delete -issuer <issuer_name> -ldap <hostname: ssl_port> -user <username> [-wallet <wallet_location>] [-summary] orapki crl diskplay Displays specific CRLs that are stored in Oracle Internet Directory. orapki crl display -crl <crl_location> [-wallet <wallet_location>] [-summary|-complete] orapki crl hash Generates a hash value of the certificate revocation list (CRL) issuer to identify the location of the CRL in your file system for certificate validation. orapki crl hash -crl <crl_filename|URL> [-wallet <wallet_location>] [-symlink|-copy] <crl_directory> [-summary] orapki crl list Displays a list of CRLs stored in Oracle Internet Directory. orapki crl list -ldap <hostname:ssl_port> orapki crl upload Uploads CRLs to the CRL subtree in Oracle Internet Directory. orapki crl upload -crl <crl_location> -ldap <hostname:ssl_port> -user <username> [-wallet <wallet_location>] [-summary] orapki wallet add Add certificate requests and certificates to an Oracle Wallet. orapki wallet add -wallet <wallet_location> -dn <user_dn> -keySize <512|1024|2048> orapki wallet create Creates an Oracle Wallet or to set auto login on for an Oracle Wallet. orapki wallet create -wallet <wallet_location> [-auto_login] orapki wallet display Displays the certificate requests, user certificates, and trusted certificates in an Oracle Wallet. orapki wallet display -wallet <wallet_location> orapki wallet export Export certificate requests and certificates from an Oracle Wallet. orapki wallet export -wallet <wallet_location> -dn <certificate_dn> -cert <certificate_filename>   Oracle Wallet Manager CSR generation Oracle Wallet Manager generates a certificate request in PKCS #10 format. This certificate request can be sent to a certificate authority of your choice. The procedure to generate this certificate request is as follows: From the main menu choose the Operations menu and then select the Add Certificate Request submenu. As shown in the following screenshot, a form will be displayed where you can capture specific information. The parameters used to request a certificate are described next: Common Name: This parameter is mandatory. This is the user's name or entity's name. If you are using a user's name, then enter it using the first name, last name format. Organization Unit: This is the name of the identity's organization unit. It could be the name of the department where the entity belongs (optional parameter). Organization: This is the company's name (optional). Location/City: The location and the city where the entity resides (optional). State/Province: This is the full name of the state where the entity resides. Do not use abbreviations (optional). Country: This parameter is mandatory. It specifies the country where the entity is located. Key Size: This parameter is mandatory. It defines the key size used when a public/private key pair is created. The key size can be as little as 512 bytes and up to 4096 bytes. Advanced: When the parameters are introduced a Distinguished Name (DN) is assembled. If you want to customize this DN, then you can use the advanced DN configuration mode. Once the Certificate Request form has been completed, a PKCS#10 format certificate request is generated. The information that appears between the BEGIN and END keywords must be used to request a certificate to a Certificate Authority (CA); there are several well known certificate authorities, and depending on the usage you plan for your certificate, you could address the request to a known CA (from the browser perspective) so when an end user accesses your site it doesn't get warned about the site's identity. If the certificate will be targeted at a local community who doesn't mind about the certificate warning, then you may generate your own certificate or ask a CA to issue a certificate for you. For demonstration purposes, we used the Oracle Certificate Authority (OCA) included with the Oracle Application Server. OCA will provide the Certificate Authority capabilities to your site and it can issue standard certificates, suitable for the intranet users. If you are planning to use OCA then you should review the license agreements to determine if you are allowed to use it.  
Read more
  • 0
  • 0
  • 7656

article-image-amazon-dynamodb-modelling-relationships-error-handling
Packt
20 Aug 2014
7 min read
Save for later

Amazon DynamoDB - Modelling relationships, Error handling

Packt
20 Aug 2014
7 min read
In this article by Tanmay Deshpande author of Mastering DynamoDB we are going to revise our concepts about the DynamoDB and will try to discover more about its features and implementation. (For more resources related to this topic, see here.) Amazon DynamoDB is a fully managed, cloud hosted, NoSQL database. It provides fast and predictable performance with the ability to scale seamlessly. It allows you to store and retrieve any amount of data , serving any level of network traffic without having any operational burden. DynamoDB gives numerous other advantages like consistent and predictable performance, flexible data modeling and durability. With just few clicks on Amazon Web Service console, you would be able to create your own DynamoDB database table, scale up or scale down provision throughput without taking down your application even for a millisecond. DynamoDB uses solid state disks (SSD) to store the data which confirms the durability of the work you are doing. It also automatically replicates the data across other AWS Availability Zones which provides built-in high availability and reliability. Before we start discussion details about DynamoDB let's try to understand what NoSQL databases are and when to choose DynamoDB over RDBMS. With the rise in data volume, variety and velocity, RDBMS were neither designed to cope up with the scale and flexibility challenges developers are facing to build the modern day applications nor were they able to take advantage of cheap commodity hardware. Also we need to provide schema before we start adding data which was restricting developers from making their application flexible. On the other hand, NoSQL databases are fast, provide flexible schema operations and do the effective use of cheap storage. Considering all these things, NoSQL is becoming popular very quickly amongst the developer community. But one has to be very cautious about when to go for NoSQL and when to stick to RDBMS. Sticking to relational databases makes sense when you know the schema is more over static, strong consistency is must and the data is not going to be that big in volume. But when you want to build an application which is Internet scalable, schema is more likely to get evolved over the time, the storage is going to be really big and the operations involved are ok to be eventually consistent then NoSQL is the way to go. There are various types of NoSQL databases. Following is the list of NoSQL database types and popular examples Document Store – MongoDB, CouchDB, MarkLogic etc. Column Store – Hbase, Cassandra etc. Key Value Store – DynamoDB, Azure, Redis etc. Graph Databases – Neo4J, DEX etc. Most of these NoSQL solutions are open source except few like DynamoDB, Azure which are available as service over Internet. DynamoDB being a key-value store indexes data only upon primary keys and one need to go through primary key to access certain attribute. Let's start learning more about DynamoDB by having a look at its history. DynamoDB's History Amazon's ecommerce platform had a huge set of decoupled services developed and managed individually and each and every service had API to be used and consumed for others. Earlier each service had direct database access which was a major bottleneck. In terms of scalability, Amazon's requirements were more than any third party vendors could provide at that time. DynamoDB was built to address Amazon's high availability, extreme scalability and durability needs. Earlier Amazon used to store its production data in relational databases and services had been provided for all required operations. But later they realized that most of the services access data only through its primary key and need not use complex queries to fetch the required data, plus maintaining these RDBMS systems required high end hardware and skilled personnel. So to overcome all such issues, Amazon's engineering team built a NoSQL database which addresses all above mentioned issues. In 2007, Amazon released one research paper on Dynamo which was combining the best of ideas from database and key value store worlds which was inspiration for many open source projects at the time. Cassandra, Voldemort and Riak were one of them. You can find the above mentioned paper at http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf Even though Dynamo had great features which would take care of all engineering needs, it was not widely accepted at that time in Amazon itself as it was not a fully managed service. When Amazon released S3 and SimpleDB, engineering teams were quite excited to adopt those compared to Dynamo as DynamoDB was bit expensive at that time due to SSDs. So finally after rounds of improvement, Amazon released Dynamo as cloud based service and since then it is one the most widely used NoSQL database. Before releasing to public cloud in 2012, DynamoDB has been the core storage service for Amazon's e-commerce platform, which was started shopping cart and session management service. Any downtime or degradation in performance had major impact on Amazon's business and any financial impact was strictly not acceptable and DynamoDB proved itself to be the best choice at the end. Now let's try to understand in more detail about DynamoDB. What is DynamoDB? DynamoDB is a fully managed, Internet scalable and easily administered, cost effective NoSQL database. It is a part of database as a service offering pane of Amazon Web Services. The above diagram shows how Amazon offers its various cloud services and where DynamoDB is exactly placed. AWS RDS is relational database as a service over Internet from Amazon while Simple DB and DynamoDB are NoSQL database as services. Both SimpleDB and DynamoDB are fully managed, non-relational services. DynamoDB is build considering fast, seamless scalability, and high performance. It runs on SSDs to provide faster responses and has no limits on request capacity and storage. It automatically partitions your data throughout the cluster to meet the expectations while in SimpleDB we have storage limit of 10 GB and can only take limited requests per second. Also in SimpleDB we have to manage our own partitions. So depending upon your need you have to choose the correct solution. To use DynamoDB, the first and foremost requirement is having an AWS account. Through easy to use AWS management console, you can directly create new tables, providing necessary information and can start loading data into the tables in few minutes. Modelling Relationships Like any other database, modeling relationships is quite interesting even though DynamoDB is a NoSQL database. Most of the time, people get confused on how do I model the relationships between various tables, in this section, we are trying make an effort to simplify this problem. To understand the relationships better, let's try to understand that using our example of Book Store where we have entities like Book, Author, Publisher, and so on. One to One In this type of relationship, one entity record of a table is related only one entity record of other table. In our book store application, we have BookInfo and Book-Details tables, BookInfo table can have short information about the book which can be used to display book information on web page whereas BookDetails table would be used when someone explicitly needs to see all the details of book. This design helps us keeping our system healthy as even if there are high request on one table, the other table would always be to up and running to fulfil what it is supposed to do. Following diagram shows how the table structure would look like. One to many In this type of relationship, one record from an entity is related to more than one record in another entity. In book store application, we can have Publisher Book Table which would keep information about the book and publisher relationship. Here we can have Publisher Id as hash key and Book Id as range key. Following diagram shows how a table structure would like. Many to many Many to many relationship means many records from an entity is related to many records from another entity. In case of book store application, we can say that a book can be authored by multiple authors and an author can write multiple books. In this we should use two tables with both and range keys.
Read more
  • 0
  • 0
  • 7638

article-image-roger-mcnamee-on-silicon-valleys-obsession-for-building-data-voodoo-dolls
Savia Lobo
05 Jun 2019
5 min read
Save for later

Roger McNamee on Silicon Valley’s obsession for building “data voodoo dolls”

Savia Lobo
05 Jun 2019
5 min read
The Canadian Parliament's Standing Committee on Access to Information, Privacy and Ethics hosted the hearing of the International Grand Committee on Big Data, Privacy and Democracy from Monday May 27 to Wednesday May 29.  Witnesses from at least 11 countries appeared before representatives to testify on how governments can protect democracy and citizen rights in the age of big data. This section of the hearing, which took place on May 28, includes Roger McNamee’s take on why Silicon Valley wants to build data voodoo dolls for users. Roger McNamee is the Author of Zucked: Waking up to the Facebook Catastrophe. His remarks in this section of the hearing builds on previous hearing presentations by Professor Zuboff, Professor Park Ben Scott and the previous talk by Jim Balsillie. Roger McNamee’s remarks build on previous hearing presentations by Professor Zuboff, Professor Park Ben Scott and the previous talk by Jim Balsillie. He started off by saying, “Beginning in 2004, I noticed a transformation in the culture of Silicon Valley and over the course of a decade customer focused models were replaced by the relentless pursuit of global scale, monopoly, and massive wealth.” McNamee says that Google wants to make the world more efficient, they want to eliminate user stress that results from too many choices. Now, Google knew that society would not permit a business model based on denying consumer choice and free will, so they covered their tracks. Beginning around 2012, Facebook adopted a similar strategy later followed by Amazon, Microsoft, and others. For Google and Facebook, the business is behavioral prediction using which they build a high-resolution data avatar of every consumer--a voodoo doll if you will. They gather a tiny amount of data from user posts and queries; but the vast majority of their data comes from surveillance, web tracking, scanning emails and documents, data from apps and third parties, and ambient surveillance from products like Alexa, Google assistant, sidewalk labs, and Pokemon go. Google and Facebook used data voodoo dolls to provide their customers who are marketers with perfect information about every consumer. They use the same data to manipulate consumer choices just as in China behavioral manipulation is the goal. The algorithms of Google and Facebook are tuned to keep users on site and active; preferably by pressing emotional buttons that reveal each user's true self. For most users, this means content that provokes fear or outrage. Hate speech, disinformation, and conspiracy theories are catnip for these algorithms. The design of these platforms treats all content precisely the same whether it be hard news from a reliable site, a warning about an emergency, or a conspiracy theory. The platforms make no judgments, users choose aided by algorithms that reinforce past behavior. The result is, 2.5 billion Truman shows on Facebook each a unique world with its own facts. In the U.S. nearly 40% of the population identifies with at least one thing that is demonstrably false; this undermines democracy. “The people at Google and Facebook are not evil they are the products of an American business culture with few rules where misbehavior seldom results in punishment”, he says. Unlike industrial businesses, internet platforms are highly adaptable and this is the challenge. If you take away one opportunity they will move on to the next one and they are moving upmarket getting rid of the middlemen. Today, they apply behavioral prediction to advertising but they have already set their sights on transportation and financial services. This is not an argument against undermining their advertising business but rather a warning that it may be a Pyrrhic victory. If a user’s goals are to protect democracy and personal liberty, McNamee tells them, they have to be bold. They have to force a radical transformation of the business model of internet platforms. That would mean, at a minimum banning web tracking, scanning of email and documents, third party commerce and data, and ambient surveillance. A second option would be to tax micro targeted advertising to make it economically unattractive. But you also need to create space for alternative business models using trust that longs last. Startups can happen anywhere they can come from each of your countries. At the end of the day, though the most effective path to reform would be to shut down the platforms at least temporarily as Sri Lanka did. Any country can go first. The platform's have left you no choice the time has come to call their bluff companies with responsible business models will emerge overnight to fill the void. McNamee explains, “when they (organizations) gather all of this data the purpose of it is to create a high resolution avatar of each and every human being. Doesn't matter whether they use their systems or not they collect it on absolutely everybody. In the Caribbean, Voodoo was essentially this notion that you create a doll, an avatar, such that you can poke it with a pin and the person would experience that pain right and so it becomes literally a representation of the human being.” To know more you can listen to the full hearing video titled, “Meeting No. 152 ETHI - Standing Committee on Access to Information, Privacy and Ethics” on ParlVU. Experts present most pressing issues facing global lawmakers on citizens’ privacy, democracy and rights to freedom of speech Time for data privacy: DuckDuckGo CEO Gabe Weinberg in an interview with Kara Swisher Over 19 years of ANU(Australian National University) students’ and staff data breached
Read more
  • 0
  • 0
  • 7628
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-administration-rights-for-power-bi-users
Pravin Dhandre
02 Jul 2018
8 min read
Save for later

Administration rights for Power BI users

Pravin Dhandre
02 Jul 2018
8 min read
In this tutorial, you will understand and learn administration rights/rules for Power BI users. This includes setting and monitoring rules like; who in the organization can utilize which feature, how Power BI Premium capacity is allocated and by whom, and other settings such as embed codes and custom visuals. This article is an excerpt from a book written by Brett Powell titled Mastering Microsoft Power BI. The admin portal is accessible to Office 365 Global Administrators and users mapped to the Power BI service administrator role. To open the admin portal, log in to the Power BI service and select the Admin portal item from the Settings (Gear icon) menu in the top right, as shown in the following screenshot: All Power BI users, including Power BI free users, are able to access the Admin portal. However, users who are not admins can only view the Capacity settings page. The Power BI service administrators and Office 365 global administrators have view and edit access to the following seven pages: Administrators of Power BI most commonly utilize the Tenant settings and Capacity settings as described in the Tenant Settings and Power BI Premium Capacities sections later in this tutorial. However, the admin portal can also be used to manage any approved custom visuals for the organization. Usage metrics The Usage metrics page of the Admin portal provides admins with a Power BI dashboard of several top metrics, such as the most consumed dashboards and the most consumed dashboards by workspace. However, the dashboard cannot be modified and the tiles of the dashboard are not linked to any underlying reports or separate dashboards to support further analysis. Given these limitations, alternative monitoring solutions are recommended, such as the Office 365 audit logs and usage metric datasets specific to Power BI apps. Details of both monitoring options are included in the app usage metrics and Power BI audit log activities sections later in this chapter. Users and Audit logs The Users and Audit logs pages only provide links to the Office 365 admin center. In the admin center, Power BI users can be added, removed and managed. If audit logging is enabled for the organization via the Create audit logs for internal activity and auditing and compliance tenant setting, this audit log data can be retrieved from the Office 365 Security & Compliance Center or via PowerShell. This setting is noted in the following section regarding the Tenant settings tab of the Power BI admin portal. An Office 365 license is not required to utilize the Office 365 admin center for Power BI license assignments or to retrieve Power BI audit log activity. Tenant settings The Tenant settings page of the Admin portal allows administrators to enable or disable various features of the Power BI web service. Likewise, the administrator could allow only a certain security group to embed Power BI content in SaaS applications such as SharePoint Online. The following diagram identifies the 18 tenant settings currently available in the admin portal and the scope available to administrators for configuring each setting: From a data security perspective, the first seven settings within the Export and Sharing and Content packs and apps groups are most important. For example, many organizations choose to disable the Publish to web feature for the entire organization. Additionally, only certain security groups may be allowed to export data or to print hard copies of reports and dashboards. As shown in the Scope column of the previous table and the following example, granular security group configurations are available to minimize risk and manage the overall deployment. Currently, only one tenant setting is available for custom visuals and this setting (Custom visuals settings) can be enabled or disabled for the entire organization only. For organizations that wish to restrict or prohibit custom visuals for security reasons, this setting can be used to eliminate the ability to add, view, share, or interact with custom visuals. More granular controls to this setting are expected later in 2018, such as the ability to define users or security groups of users who are allowed to use custom visuals. In the following screenshot from the Tenant settings page of the Admin portal, only the users within the BI Admin security group who are not also members of the BI Team security group are allowed to publish apps to the entire organization: For example, a report author who also helps administer the On-premises data gateway via the BI Admin security group would be denied the ability to publish apps to the organization given membership in the BI Team security group. Many of the tenant setting configurations will be more simple than this example, particularly for smaller organizations or at the beginning of Power BI deployments. However, as adoption grows and the team responsible for Power BI changes, it's important that the security groups created to help administer these settings are kept up to date. Embed Codes Embed Codes are created and stored in the Power BI service when the Publish to web feature is utilized. As described in the Publish to web section of the previous chapter, this feature allows a Power BI report to be embedded in any website or shared via URL on the public internet. Users with edit rights to the workspace of the published to web content are able to manage the embed codes themselves from within the workspace. However, the admin portal provides visibility and access to embed codes across all workspaces, as shown in the following screenshot: Via the Actions commands on the far right of the Embed Codes page, a Power BI Admin can view the report in a browser (diagonal arrow) or remove the embed code. The Embed Codes page can be helpful to periodically monitor the usage of the Publish to web feature and for scenarios in which data was included in a publish to web report that shouldn't have been, and thus needs to be removed. As shown in the Power BI Tenant settings table referenced in the previous section, this feature can be enabled or disabled for the entire organization or for specific users within security groups. Organizational Custom visuals The Custom Visuals page allows admins to upload and manage custom visuals (.pbiviz files) that have been approved for use within the organization. For example, an organization may have proprietary custom visuals developed internally, which it wishes to expose to business users. Alternatively, the organization may wish to define a set of approved custom visuals, such as only the custom visuals that have been certified by Microsoft. In the following screenshot, the Chiclet Slicer custom visual is added as an organizational custom visual from the Organizational visuals page of the Power BI admin portal: The Organizational visuals page provides a link (Add a custom visual) to launch the form and identifies all uploaded visuals, as well as their last update. Once a visual has been uploaded, it can be deleted but not updated or modified. Therefore, when a new version of an organizational visual becomes available, this visual can be added to the list of organizational visuals with a descriptive title (Chiclet Slicer v2.0). Deleting an organizational custom visual will cause any reports that use this visual to stop rendering. The following screenshot reflects the uploaded Chiclet Slicer custom visual on the Organization visuals page: Once the custom visual has been uploaded as an organizational custom visual, it will be accessible to users in Power BI Desktop. In the following screenshot from Power BI Desktop, the user has opened the MARKETPLACE of custom visuals and selected MY ORGANIZATION: In this screenshot, rather than searching through the MARKETPLACE, the user can go directly to visuals defined by the organization. The marketplace of custom visuals can be launched via either the Visualizations pane or the From Marketplace icon on the Home tab of the ribbon. Organizational custom visuals are not supported for reports or dashboards shared with external users. Additionally, organizational custom visuals used in reports that utilize the publish to web feature will not render outside the Power BI tenant. Moreover, Organizational custom visuals are currently a preview feature. Therefore, users must enable the My organization custom visuals feature via the Preview features tab of the Options window in Power BI Desktop. With this, we got you acquainted with features and processes applicable in administering Power BI for an organization. This includes the configuration of tenant settings in the Power BI admin portal, analyzing the usage of Power BI assets, and monitoring overall user activity via the Office 365 audit logs. If you found this tutorial useful, do check out the book Mastering Microsoft Power BI to develop visually rich, immersive, and interactive Power BI reports and dashboards. Unlocking the secrets of Microsoft Power BI A tale of two tools: Tableau and Power BI Building a Microsoft Power BI Data Model
Read more
  • 0
  • 0
  • 7578

article-image-building-linear-regression-model-python-developers
Pravin Dhandre
07 Feb 2018
7 min read
Save for later

Building a Linear Regression Model in Python for developers

Pravin Dhandre
07 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rodolfo Bonnin titled Machine Learning for Developers. This book is a systematic guide for developers to implement various Machine Learning techniques and develop efficient and intelligent applications.[/box] Let’s start using one of the most well-known toy datasets, explore it, and select one of the dimensions to learn how to build a linear regression model for its values. Let's start by importing all the libraries (scikit-learn, seaborn, and matplotlib); one of the excellent features of Seaborn is its ability to define very professional-looking style settings. In this case, we will use the whitegrid style: import numpy as np from sklearn import datasets import seaborn.apionly as sns %matplotlib inline import matplotlib.pyplot as plt sns.set(style='whitegrid', context='notebook') The Iris Dataset It’s time to load the Iris dataset. This is one of the most well-known historical datasets. You will find it in many books and publications. Given the good properties of the data, it is useful for classification and regression examples. The Iris dataset (https://archive.ics.uci.edu/ml/datasets/Iris) contains 50 records for each of the three types of iris, 150 lines in a total over five fields. Each line is a measurement of the following: Sepal length in cm Sepal width in cm Petal length in cm Petal width in cm The final field is the type of flower (setosa, versicolor, or virginica). Let’s use the load_dataset method to create a matrix of values from the dataset: iris2 = sns.load_dataset('iris') In order to understand the dependencies between variables, we will implement the covariance operation. It will receive two arrays as parameters and will return the covariance(x,y) value: def covariance (X, Y): xhat=np.mean(X) yhat=np.mean(Y) epsilon=0 for x,y in zip (X,Y): epsilon=epsilon+(x-xhat)*(y-yhat) return epsilon/(len(X)-1) Let's try the implemented function and compare it with the NumPy function. Note that we calculated cov (a,b), and NumPy generated a matrix of all the combinations cov(a,a), cov(a,b), so our result should be equal to the values (1,0) and (0,1) of that matrix: print (covariance ([1,3,4], [1,0,2])) print (np.cov([1,3,4], [1,0,2])) 0.5 [[ 2.33333333   0.5              ] [ 0.5                   1.                ]] Having done a minimal amount of testing of the correlation function as defined earlier, receive two arrays, such as covariance, and use them to get the final value: def correlation (X, Y): return (covariance(X,Y)/(np.std(X,    ddof=1)*np.std(Y,   ddof=1))) ##We have to indicate ddof=1 the unbiased std Let’s test this function with two sample arrays, and compare this with the (0,1) and (1,0) values of the correlation matrix from NumPy: print (correlation ([1,1,4,3], [1,0,2,2])) print (np.corrcoef ([1,1,4,3], [1,0,2,2])) 0.870388279778 [[ 1.                     0.87038828] [ 0.87038828   1.                ]] Getting an intuitive idea with Seaborn pairplot A very good idea when starting worke on a problem is to get a graphical representation of all the possible variable combinations. Seaborn’s pairplot function provides a complete graphical summary of all the variable pairs, represented as scatterplots, and a representation of the univariate distribution for the matrix diagonal. Let’s look at how this plot type shows all the variables dependencies, and try to look for a linear relationship as a base to test our regression methods: sns.pairplot(iris2, size=3.0) <seaborn.axisgrid.PairGrid at 0x7f8a2a30e828> Pairplot of all the variables in the dataset. Lets' select two variables that, from our initial analysis, have the property of being linearly dependent. They are petal_width and petal_length: X=iris2['petal_width'] Y=iris2['petal_length'] Let’s now take a look at this variable combination, which shows a clear linear tendency: plt.scatter(X,Y) This is the representation of the chosen variables, in a scatter type graph: This is the current distribution of data that we will try to model with our linear prediction function. Creating the prediction function First, let's define the function that will abstractedly represent the modeled data, in the form of a linear function, with the form y=beta*x+alpha: def predict(alpha, beta, x_i): return beta * x_i + alpha Defining the error function It’s now time to define the function that will show us the difference between predictions and the expected output during training. We have two main alternatives: measuring the absolute difference between the values (or L1), or measuring a variant of the square of the difference (or L2). Let’s define both versions, including the first formulation inside the second: def error(alpha, beta, x_i, y_i): #L1 return y_i - predict(alpha, beta, x_i) def sum_sq_e(alpha, beta, x, y): #L2 return sum(error(alpha, beta, x_i, y_i) ** 2 for x_i, y_i in zip(x, y)) Correlation fit Now, we will define a function implementing the correlation method to find the parameters for our regression: def correlation_fit(x, y): beta = correlation(x, y) * np.std(y, ddof=1) / np.std(x,ddof=1) alpha = np.mean(y) - beta * np.mean(x) return alpha, beta Let’s then run the fitting function and print the guessed parameters: alpha, beta = correlation_fit(X, Y) print(alpha) print(beta) 1.08355803285 2.22994049512 Let’s now graph the regressed line with the data in order to intuitively show the appropriateness of the solution: plt.scatter(X,Y) xr=np.arange(0,3.5) plt.plot(xr,(xr*beta)+alpha) This is the final plot we will get with our recently calculated slope and intercept: Final regressed line Polynomial regression and an introduction to underfitting and overfitting When looking for a model, one of the main characteristics we look for is the power of generalizing with a simple functional expression. When we increase the complexity of the model, it's possible that we are building a model that is good for the training data, but will be too optimized for that particular subset of data. Underfitting, on the other hand, applies to situations where the model is too simple, such as this case, which can be represented fairly well with a simple linear model. In the following example, we will work on the same problem as before, using the scikit- learn library to search higher-order polynomials to fit the incoming data with increasingly complex degrees. Going beyond the normal threshold of a quadratic function, we will see how the function looks to fit every wrinkle in the data, but when we extrapolate, the values outside the normal range are clearly out of range: from sklearn.linear_model import Ridge from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline ix=iris2['petal_width'] iy=iris2['petal_length'] # generate points used to represent the fitted function x_plot = np.linspace(0, 2.6, 100) # create matrix versions of these arrays X = ix[:, np.newaxis] X_plot = x_plot[:, np.newaxis] plt.scatter(ix, iy, s=30, marker='o', label="training points") for count, degree in enumerate([3, 6, 20]): model = make_pipeline(PolynomialFeatures(degree), Ridge()) model.fit(X, iy) y_plot = model.predict(X_plot) plt.plot(x_plot, y_plot, label="degree %d" % degree) plt.legend(loc='upper left') plt.show() The combined graph shows how the different polynomials' coefficients describe the data population in different ways. The 20 degree polynomial shows clearly how it adjusts perfectly for the trained dataset, and after the known values, it diverges almost spectacularly, going against the goal of generalizing for future data. Curve fitting of the initial dataset, with polynomials of increasing values With this, we successfully explored how to develop an efficient linear regression model in Python and how you can make predictions using the designed model. We've reviewed ways to identify and optimize the correlation between the prediction and the expected output using simple and definite functions. If you enjoyed our post, you must check out Machine Learning for Developers to uncover advanced tools for building machine learning applications on your fingertips.  
Read more
  • 0
  • 1
  • 7566

article-image-hadoop-and-mapreduce
Packt
05 Mar 2015
43 min read
Save for later

Hadoop and MapReduce

Packt
05 Mar 2015
43 min read
In this article by the author, Thilina Gunarathne, of the book, Hadoop MapReduce v2 Cookbook - Second Edition, we will learn about Hadoop and MadReduce. We are living in the era of big data, where exponential growth of phenomena such as web, social networking, smartphones, and so on are producing petabytes of data on a daily basis. Gaining insights from analyzing these very large amounts of data has become a must-have competitive advantage for many industries. However, the size and the possibly unstructured nature of these data sources make it impossible to use traditional solutions such as relational databases to store and analyze these datasets. (For more resources related to this topic, see here.) Storage, processing, and analyzing petabytes of data in a meaningful and timely manner require many compute nodes with thousands of disks and thousands of processors together with the ability to efficiently communicate massive amounts of data among them. Such a scale makes failures such as disk failures, compute node failures, network failures, and so on a common occurrence making fault tolerance a very important aspect of such systems. Other common challenges that arise include the significant cost of resources, handling communication latencies, handling heterogeneous compute resources, synchronization across nodes, and load balancing. As you can infer, developing and maintaining distributed parallel applications to process massive amounts of data while handling all these issues is not an easy task. This is where Apache Hadoop comes to our rescue. Google is one of the first organizations to face the problem of processing massive amounts of data. Google built a framework for large-scale data processing borrowing the map and reduce paradigms from the functional programming world and named it as MapReduce. At the foundation of Google, MapReduce was the Google File System, which is a high throughput parallel filesystem that enables the reliable storage of massive amounts of data using commodity computers. Seminal research publications that introduced Google MapReduce and Google File System concepts can be found at http://research.google.com/archive/mapreduce.html and http://research.google.com/archive/gfs.html.Apache Hadoop MapReduce is the most widely known and widely used open source implementation of the Google MapReduce paradigm. Apache Hadoop Distributed File System (HDFS) provides an open source implementation of the Google File Systems concept. Apache Hadoop MapReduce, HDFS, and YARN provide a scalable, fault-tolerant, distributed platform for storage and processing of very large datasets across clusters of commodity computers. Unlike in traditional High Performance Computing (HPC) clusters, Hadoop uses the same set of compute nodes for data storage as well as to perform the computations, allowing Hadoop to improve the performance of large scale computations by collocating computations with the storage. Also, the hardware cost of a Hadoop cluster is orders of magnitude cheaper than HPC clusters and database appliances due to the usage of commodity hardware and commodity interconnects. Together Hadoop-based frameworks have become the de-facto standard for storing and processing big data. Hadoop Distributed File System – HDFS HDFS is a block structured distributed filesystem that is designed to store petabytes of data reliably on compute clusters made out of commodity hardware. HDFS overlays on top of the existing filesystem of the compute nodes and stores files by breaking them into coarser grained blocks (for example, 128 MB). HDFS performs better with large files. HDFS distributes the data blocks of large files across to all the nodes of the cluster to facilitate the very high parallel aggregate read bandwidth when processing the data. HDFS also stores redundant copies of these data blocks in multiple nodes to ensure reliability and fault tolerance. Data processing frameworks such as MapReduce exploit these distributed sets of data blocks and the redundancy to maximize the data local processing of large datasets, where most of the data blocks would get processed locally in the same physical node as they are stored. HDFS consists of NameNode and DataNode services providing the basis for the distributed filesystem. NameNode stores, manages, and serves the metadata of the filesystem. NameNode does not store any real data blocks. DataNode is a per node service that manages the actual data block storage in the DataNodes. When retrieving data, client applications first contact the NameNode to get the list of locations the requested data resides in and then contact the DataNodes directly to retrieve the actual data. The following diagram depicts a high-level overview of the structure of HDFS: Hadoop v2 brings in several performance, scalability, and reliability improvements to HDFS. One of the most important among those is the High Availability (HA) support for the HDFS NameNode, which provides manual and automatic failover capabilities for the HDFS NameNode service. This solves the widely known NameNode single point of failure weakness of HDFS. Automatic NameNode high availability of Hadoop v2 uses Apache ZooKeeper for failure detection and for active NameNode election. Another important new feature is the support for HDFS federation. HDFS federation enables the usage of multiple independent HDFS namespaces in a single HDFS cluster. These namespaces would be managed by independent NameNodes, but share the DataNodes of the cluster to store the data. The HDFS federation feature improves the horizontal scalability of HDFS by allowing us to distribute the workload of NameNodes. Other important improvements of HDFS in Hadoop v2 include the support for HDFS snapshots, heterogeneous storage hierarchy support (Hadoop 2.3 or higher), in-memory data caching support (Hadoop 2.3 or higher), and many performance improvements. Almost all the Hadoop ecosystem data processing technologies utilize HDFS as the primary data storage. HDFS can be considered as the most important component of the Hadoop ecosystem due to its central nature in the Hadoop architecture. Hadoop YARN YARN (Yet Another Resource Negotiator) is the major new improvement introduced in Hadoop v2. YARN is a resource management system that allows multiple distributed processing frameworks to effectively share the compute resources of a Hadoop cluster and to utilize the data stored in HDFS. YARN is a central component in the Hadoop v2 ecosystem and provides a common platform for many different types of distributed applications. The batch processing based MapReduce framework was the only natively supported data processing framework in Hadoop v1. While MapReduce works well for analyzing large amounts of data, MapReduce by itself is not sufficient enough to support the growing number of other distributed processing use cases such as real-time data computations, graph computations, iterative computations, and real-time data queries. The goal of YARN is to allow users to utilize multiple distributed application frameworks that provide such capabilities side by side sharing a single cluster and the HDFS filesystem. Some examples of the current YARN applications include the MapReduce framework, Tez high performance processing framework, Spark processing engine, and the Storm real-time stream processing framework. The following diagram depicts the high-level architecture of the YARN ecosystem: The YARN ResourceManager process is the central resource scheduler that manages and allocates resources to the different applications (also known as jobs) submitted to the cluster. YARN NodeManager is a per node process that manages the resources of a single compute node. Scheduler component of the ResourceManager allocates resources in response to the resource requests made by the applications, taking into consideration the cluster capacity and the other scheduling policies that can be specified through the YARN policy plugin framework. YARN has a concept called containers, which is the unit of resource allocation. Each allocated container has the rights to a certain amount of CPU and memory in a particular compute node. Applications can request resources from YARN by specifying the required number of containers and the CPU and memory required by each container. ApplicationMaster is a per-application process that coordinates the computations for a single application. The first step of executing a YARN application is to deploy the ApplicationMaster. After an application is submitted by a YARN client, the ResourceManager allocates a container and deploys the ApplicationMaster for that application. Once deployed, the ApplicationMaster is responsible for requesting and negotiating the necessary resource containers from the ResourceManager. Once the resources are allocated by the ResourceManager, ApplicationMaster coordinates with the NodeManagers to launch and monitor the application containers in the allocated resources. The shifting of application coordination responsibilities to the ApplicationMaster reduces the burden on the ResourceManager and allows it to focus solely on managing the cluster resources. Also having separate ApplicationMasters for each submitted application improves the scalability of the cluster as opposed to having a single process bottleneck to coordinate all the application instances. The following diagram depicts the interactions between various YARN components, when a MapReduce application is submitted to the cluster: While YARN supports many different distributed application execution frameworks, our focus in this article is mostly on traditional MapReduce and related technologies. Hadoop MapReduce Hadoop MapReduce is a data processing framework that can be utilized to process massive amounts of data stored in HDFS. As we mentioned earlier, distributed processing of a massive amount of data in a reliable and efficient manner is not an easy task. Hadoop MapReduce aims to make it easy for users by providing a clean abstraction for programmers by providing automatic parallelization of the programs and by providing framework managed fault tolerance support. MapReduce programming model consists of Map and Reduce functions. The Map function receives each record of the input data (lines of a file, rows of a database, and so on) as key-value pairs and outputs key-value pairs as the result. By design, each Map function invocation is independent of each other allowing the framework to use divide and conquer to execute the computation in parallel. This also allows duplicate executions or re-executions of the Map tasks in case of failures or load imbalances without affecting the results of the computation. Typically, Hadoop creates a single Map task instance for each HDFS data block of the input data. The number of Map function invocations inside a Map task instance is equal to the number of data records in the input data block of the particular Map task instance. Hadoop MapReduce groups the output key-value records of all the Map tasks of a computation by the key and distributes them to the Reduce tasks. This distribution and transmission of data to the Reduce tasks is called the Shuffle phase of the MapReduce computation. Input data to each Reduce task would also be sorted and grouped by the key. The Reduce function gets invoked for each key and the group of values of that key (reduce <key, list_of_values>) in the sorted order of the keys. In a typical MapReduce program, users only have to implement the Map and Reduce functions and Hadoop takes care of scheduling and executing them in parallel. Hadoop will rerun any failed tasks and also provide measures to mitigate any unbalanced computations. Have a look at the following diagram for a better understanding of the MapReduce data and computational flows: In Hadoop 1.x, the MapReduce (MR1) components consisted of the JobTracker process, which ran on a master node managing the cluster and coordinating the jobs, and TaskTrackers, which ran on each compute node launching and coordinating the tasks executing in that node. Neither of these processes exist in Hadoop 2.x MapReduce (MR2). In MR2, the job coordinating responsibility of JobTracker is handled by an ApplicationMaster that will get deployed on-demand through YARN. The cluster management and job scheduling responsibilities of JobTracker are handled in MR2 by the YARN ResourceManager. JobHistoryServer has taken over the responsibility of providing information about the completed MR2 jobs. YARN NodeManagers provide the functionality that is somewhat similar to MR1 TaskTrackers by managing resources and launching containers (which in the case of MapReduce 2 houses Map or Reduce tasks) in the compute nodes. Hadoop installation modes Hadoop v2 provides three installation choices: Local mode: The local mode allows us to run MapReduce computation using just the unzipped Hadoop distribution. This nondistributed mode executes all parts of Hadoop MapReduce within a single Java process and uses the local filesystem as the storage. The local mode is very useful for testing/debugging the MapReduce applications locally. Pseudo distributed mode: Using this mode, we can run Hadoop on a single machine emulating a distributed cluster. This mode runs the different services of Hadoop as different Java processes, but within a single machine. This mode is good to let you play and experiment with Hadoop. Distributed mode: This is the real distributed mode that supports clusters that span from a few nodes to thousands of nodes. For production clusters, we recommend using one of the many packaged Hadoop distributions as opposed to installing Hadoop from scratch using the Hadoop release binaries, unless you have a specific use case that requires a vanilla Hadoop installation. Refer to the Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution recipe for more information on Hadoop distributions. Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution The Hadoop YARN ecosystem now contains many useful components providing a wide range of data processing, storing, and querying functionalities for the data stored in HDFS. However, manually installing and configuring all of these components to work together correctly using individual release artifacts is quite a challenging task. Other challenges of such an approach include the monitoring and maintenance of the cluster and the multiple Hadoop components. Luckily, there exist several commercial software vendors that provide well integrated packaged Hadoop distributions to make it much easier to provision and maintain a Hadoop YARN ecosystem in our clusters. These distributions often come with easy GUI-based installers that guide you through the whole installation process and allow you to select and install the components that you require in your Hadoop cluster. They also provide tools to easily monitor the cluster and to perform maintenance operations. For regular production clusters, we recommend using a packaged Hadoop distribution from one of the well-known vendors to make your Hadoop journey much easier. Some of these commercial Hadoop distributions (or editions of the distribution) have licenses that allow us to use them free of charge with optional paid support agreements. Hortonworks Data Platform (HDP) is one such well-known Hadoop YARN distribution that is available free of charge. All the components of HDP are available as free and open source software. You can download HDP from http://hortonworks.com/hdp/downloads/. Refer to the installation guides available in the download page for instructions on the installation. Cloudera CDH is another well-known Hadoop YARN distribution. The Express edition of CDH is available free of charge. Some components of the Cloudera distribution are proprietary and available only for paying clients. You can download Cloudera Express from http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-express.html. Refer to the installation guides available on the download page for instructions on the installation. Hortonworks HDP, Cloudera CDH, and some of the other vendors provide fully configured quick start virtual machine images that you can download and run on your local machine using a virtualization software product. These virtual machines are an excellent resource to learn and try the different Hadoop components as well as for evaluation purposes before deciding on a Hadoop distribution for your cluster. Apache Bigtop is an open source project that aims to provide packaging and integration/interoperability testing for the various Hadoop ecosystem components. Bigtop also provides a vendor neutral packaged Hadoop distribution. While it is not as sophisticated as the commercial distributions, Bigtop is easier to install and maintain than using release binaries of each of the Hadoop components. In this recipe, we provide steps to use Apache Bigtop to install Hadoop ecosystem in your local machine. Benchmarking Hadoop MapReduce using TeraSort Hadoop TeraSort is a well-known benchmark that aims to sort 1 TB of data as fast as possible using Hadoop MapReduce. TeraSort benchmark stresses almost every part of the Hadoop MapReduce framework as well as the HDFS filesystem making it an ideal choice to fine-tune the configuration of a Hadoop cluster. The original TeraSort benchmark sorts 10 million 100 byte records making the total data size 1 TB. However, we can specify the number of records, making it possible to configure the total size of data. Getting ready You must set up and deploy HDFS and Hadoop v2 YARN MapReduce prior to running these benchmarks, and locate the hadoop-mapreduce-examples-*.jar file in your Hadoop installation. How to do it... The following steps will show you how to run the TeraSort benchmark on the Hadoop cluster: The first step of the TeraSort benchmark is the data generation. You can use the teragen command to generate the input data for the TeraSort benchmark. The first parameter of teragen is the number of records and the second parameter is the HDFS directory to generate the data. The following command generates 1 GB of data consisting of 10 million records to the tera-in directory in HDFS. Change the location of the hadoop-mapreduce-examples-*.jar file in the following commands according to your Hadoop installation: $ hadoop jar $HADOOP_HOME/share/Hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen 10000000 tera-in It's a good idea to specify the number of Map tasks to the teragen computation to speed up the data generation. This can be done by specifying the –Dmapred.map.tasks parameter. Also, you can increase the HDFS block size for the generated data so that the Map tasks of the TeraSort computation would be coarser grained (the number of Map tasks for a Hadoop computation typically equals the number of input data blocks). This can be done by specifying the –Ddfs.block.size parameter. $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen –Ddfs.block.size=536870912 –Dmapred.map.tasks=256 10000000 tera-in The second step of the TeraSort benchmark is the execution of the TeraSort MapReduce computation on the data generated in step 1 using the following command. The first parameter of the terasort command is the input of HDFS data directory, and the second part of the terasort command is the output of the HDFS data directory. $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort tera-in tera-out It's a good idea to specify the number of Reduce tasks to the TeraSort computation to speed up the Reducer part of the computation. This can be done by specifying the –Dmapred.reduce.tasks parameter as follows: $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort –Dmapred.reduce.tasks=32 tera-in tera-out The last step of the TeraSort benchmark is the validation of the results. This can be done using the teravalidate application as follows. The first parameter is the directory with the sorted data and the second parameter is the directory to store the report containing the results. $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoopmapreduce- examples-*.jar teravalidate tera-out tera-validate How it works... TeraSort uses the sorting capability of the MapReduce framework together with a custom range Partitioner to divide the Map output among the Reduce tasks ensuring the global sorted order. Optimizing Hadoop YARN and MapReduce configurations for cluster deployments In this recipe, we explore some of the important configuration options of Hadoop YARN and Hadoop MapReduce. Commercial Hadoop distributions typically provide a GUI-based approach to specify Hadoop configurations. YARN allocates resource containers to the applications based on the resource requests made by the applications and the available resource capacity of the cluster. A resource request by an application would consist of the number of containers required and the resource requirement of each container. Currently, most container resource requirements are specified using the amount of memory. Hence, our focus in this recipe will be mainly on configuring the memory allocation of a YARN cluster. Getting ready Set up a Hadoop cluster by following the earlier recipes. How to do it... The following instructions will show you how to configure the memory allocation in a YARN cluster. The number of tasks per node is derived using this configuration: The following property specifies the amount of memory (RAM) that can be used by YARN containers in a worker node. It's advisable to set this slightly less than the amount of physical RAM present in the node, leaving some memory for the OS and other non-Hadoop processes. Add or modify the following lines in the yarn-site.xml file: <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>100240</value> </property> The following property specifies the minimum amount of memory (RAM) that can be allocated to a YARN container in a worker node. Add or modify the following lines in the yarn-site.xml file to configure this property. If we assume that all the YARN resource-requests request containers with only the minimum amount of memory, the maximum number of concurrent resource containers that can be executed in a node equals (YARN memory per node specified in step 1)/(YARN minimum allocation configured below). Based on this relationship, we can use the value of the following property to achieve the desired number of resource containers per node.The number of resource containers per node is recommended to be less than or equal to the minimum of (2*number CPU cores) or (2* number of disks). <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>3072</value> </property> Restart the YARN ResourceManager and NodeManager services by running sbin/stop-yarn.sh and sbin/start-yarn.sh from the HADOOP_HOME directory. The following instructions will show you how to configure the memory requirements of the MapReduce applications. The following properties define the maximum amount of memory (RAM) that will be available to each Map and Reduce task. These memory values will be used when MapReduce applications request resources from YARN for Map and Reduce task containers. Add the following lines to the mapred-site.xml file: <property> <name>mapreduce.map.memory.mb</name> <value>3072</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>6144</value> </property> The following properties define the JVM heap size of the Map and Reduce tasks respectively. Set these values to be slightly less than the corresponding values in step 4, so that they won't exceed the resource limits of the YARN containers. Add the following lines to the mapred-site.xml file: <property> <name>mapreduce.map.java.opts</name> <value>-Xmx2560m</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx5120m</value> </property> How it works... We can control Hadoop configurations through the following four configuration files. Hadoop reloads the configurations from these configuration files after a cluster restart: core-site.xml: Contains the configurations common to the whole Hadoop distribution hdfs-site.xml: Contains configurations for HDFS mapred-site.xml: Contains configurations for MapReduce yarn-site.xml: Contains configurations for the YARN ResourceManager and NodeManager processes Each configuration file has name-value pairs expressed in XML format, defining the configurations of different aspects of Hadoop. The following is an example of a property in a configuration file. The <configuration> tag is the top-level parent XML container and <property> tags, which define individual properties, are specified as child tags inside the <configuration> tag: <configuration>   <property>     <name>mapreduce.reduce.shuffle.parallelcopies</name>     <value>20</value>   </property>...</configuration> Some configurations can be configured on a per-job basis using the job.getConfiguration().set(name, value) method from the Hadoop MapReduce job driver code. There's more... There are many similar important configuration properties defined in Hadoop. The following are some of them: conf/core-site.xml Name Default value Description fs.inmemory.size.mb 200 Amount of memory allocated to the in-memory filesystem that is used to merge map outputs at reducers in MBs io.file.buffer.size 131072 Size of the read/write buffer used by sequence files conf/mapred-site.xml Name Default value Description mapreduce.reduce.shuffle.parallelcopies 20 Maximum number of parallel copies the reduce step will execute to fetch output from many parallel jobs mapreduce.task.io.sort.factor 50 Maximum number of streams merged while sorting files mapreduce.task.io.sort.mb 200 Memory limit while sorting data in MBs conf/hdfs-site.xml Name Default value Description dfs.blocksize 134217728 HDFS block size dfs.namenode.handler.count 200 Number of server threads to handle RPC calls in NameNodes You can find a list of deprecated properties in the latest version of Hadoop and the new replacement properties for them at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html.The following documents provide the list of properties, their default values, and the descriptions of each of the configuration files mentioned earlier: Common configuration: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml HDFS configuration: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml YARN configuration: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml MapReduce configuration: http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml Unit testing Hadoop MapReduce applications using MRUnit MRUnit is a JUnit-based Java library that allows us to unit test Hadoop MapReduce programs. This makes it easy to develop as well as to maintain Hadoop MapReduce code bases. MRUnit supports testing Mappers and Reducers separately as well as testing MapReduce computations as a whole. In this recipe, we'll be exploring all three testing scenarios. Getting ready We use Gradle as the build tool for our sample code base. How to do it... The following steps show you how to perform unit testing of a Mapper using MRUnit: In the setUp method of the test class, initialize an MRUnit MapDriver instance with the Mapper class you want to test. In this example, we are going to test the Mapper of the WordCount MapReduce application we discussed in earlier recipes: public class WordCountWithToolsTest {   MapDriver<Object, Text, Text, IntWritable> mapDriver;   @Before public void setUp() {    WordCountWithTools.TokenizerMapper mapper =       new WordCountWithTools.TokenizerMapper();    mapDriver = MapDriver.newMapDriver(mapper); } …… } Write a test function to test the Mapper logic. Provide the test input to the Mapper using the MapDriver.withInput method. Then, provide the expected result of the Mapper execution using the MapDriver.withOutput method. Now, invoke the test using the MapDriver.runTest method. The MapDriver.withAll and MapDriver.withAllOutput methods allow us to provide a list of test inputs and a list of expected outputs, rather than adding them individually. @Test public void testWordCountMapper() throws IOException {    IntWritable inKey = new IntWritable(0);    mapDriver.withInput(inKey, new Text("Test Quick"));    ….    mapDriver.withOutput(new Text("Test"),new     IntWritable(1));    mapDriver.withOutput(new Text("Quick"),new     IntWritable(1));    …    mapDriver.runTest(); } The following step shows you how to perform unit testing of a Reducer using MRUnit. Similar to step 1 and 2, initialize a ReduceDriver by providing the Reducer class under test and then configure the ReduceDriver with the test input and the expected output. The input to the reduce function should conform to a key with a list of values. Also, in this test, we use the ReduceDriver.withAllOutput method to provide a list of expected outputs. public class WordCountWithToolsTest { ReduceDriver<Text,IntWritable,Text,IntWritable>   reduceDriver;   @Before public void setUp() {    WordCountWithTools.IntSumReducer reducer =       new WordCountWithTools.IntSumReducer();    reduceDriver = ReduceDriver.newReduceDriver(reducer); }   @Test public void testWordCountReduce() throws IOException {    ArrayList<IntWritable> reduceInList =       new ArrayList<IntWritable>();    reduceInList.add(new IntWritable(1));    reduceInList.add(new IntWritable(2));      reduceDriver.withInput(new Text("Quick"),     reduceInList);    ...    ArrayList<Pair<Text, IntWritable>> reduceOutList =       new ArrayList<Pair<Text,IntWritable>>();    reduceOutList.add(new Pair<Text, IntWritable>     (new Text("Quick"),new IntWritable(3)));    ...    reduceDriver.withAllOutput(reduceOutList);    reduceDriver.runTest(); } } The following steps show you how to perform unit testing on a whole MapReduce computation using MRUnit. In this step, initialize a MapReduceDriver by providing the Mapper class and Reducer class of the MapReduce program that you want to test. Then, configure the MapReduceDriver with the test input data and the expected output data. When executed, this test will execute the MapReduce execution flow starting from the Map input stage to the Reduce output stage. It's possible to provide a combiner implementation to this test as well. public class WordCountWithToolsTest { …… MapReduceDriver<Object, Text, Text, IntWritable, Text,IntWritable> mapReduceDriver; @Before public void setUp() { .... mapReduceDriver = MapReduceDriver. newMapReduceDriver(mapper, reducer); } @Test public void testWordCountMapReduce() throws IOException { IntWritable inKey = new IntWritable(0); mapReduceDriver.withInput(inKey, new Text ("Test Quick")); …… ArrayList<Pair<Text, IntWritable>> reduceOutList = new ArrayList<Pair<Text,IntWritable>>(); reduceOutList.add(new Pair<Text, IntWritable> (new Text("Quick"),new IntWritable(2))); …… mapReduceDriver.withAllOutput(reduceOutList); mapReduceDriver.runTest(); } } The Gradle build script (or any other Java build mechanism) can be configured to execute these unit tests with every build. We can add the MRUnit dependency to the Gradle build file as follows: dependencies { testCompile group: 'org.apache.mrunit', name: 'mrunit',   version: '1.1.+',classifier: 'hadoop2' …… } Use the following Gradle command to execute only the WordCountWithToolsTest unit test. This command executes any test class that matches the pattern **/WordCountWith*.class: $ gradle –Dtest.single=WordCountWith test :chapter3:compileJava UP-TO-DATE :chapter3:processResources UP-TO-DATE :chapter3:classes UP-TO-DATE :chapter3:compileTestJava UP-TO-DATE :chapter3:processTestResources UP-TO-DATE :chapter3:testClasses UP-TO-DATE :chapter3:test BUILD SUCCESSFUL Total time: 27.193 secs You can also execute MRUnit-based unit tests in your IDE. You can use the gradle eclipse or gradle idea commands to generate the project files for the Eclipse and IDEA IDE respectively. Generating an inverted index using Hadoop MapReduce Simple text searching systems rely on inverted index to look up the set of documents that contain a given word or a term. In this recipe, we implement a simple inverted index building application that computes a list of terms in the documents, the set of documents that contains each term, and the term frequency in each of the documents. Retrieval of results from an inverted index can be as simple as returning the set of documents that contains the given terms or can involve much more complex operations such as returning the set of documents ordered based on a particular ranking. Getting ready You must have Apache Hadoop v2 configured and installed to follow this recipe. Gradle is needed for the compiling and building of the source code. How to do it... In the following steps, we use a MapReduce program to build an inverted index for a text dataset: Create a directory in HDFS and upload a text dataset. This dataset should consist of one or more text files. $ hdfs dfs -mkdir input_dir $ hdfs dfs -put *.txt input_dir You can download the text versions of the Project Gutenberg books by following the instructions given at http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages. Make sure to provide the filetypes query parameter of the download request as txt. Unzip the downloaded files. You can use the unzipped text files as the text dataset for this recipe. Compile the source by running the gradle build command from the chapter 8 folder of the source repository. Run the inverted indexing MapReduce job using the following command.Provide the HDFS directory where you uploaded the input data in step 2 as the first argument and provide an HDFS path to store the output as the second argument: $ hadoop jar hcb-c8-samples.jar chapter8.invertindex.TextOutInvertedIndexMapReduce input_dir output_dir Check the output directory for the results by running the following command. The output of this program will consist of the term followed by a comma-separated list of filename and frequency: $ hdfs dfs -cat output_dir/* ARE three.txt:1,one.txt:1,four.txt:1,two.txt:1, AS three.txt:2,one.txt:2,four.txt:2,two.txt:2, AUGUSTA three.txt:1, About three.txt:1,two.txt:1, Abroad three.txt:2, We used the text outputting inverted indexing MapReduce program in step 3 for the clarity of understanding the algorithm. Run the program by substituting the command in step 3 with the following command: $ hadoop jar hcb-c8-samples.jar chapter8.invertindex.InvertedIndexMapReduce input_dir seq_output_dir How it works... The Map Function receives a chunk of an input document as the input and outputs the term and <docid, 1> pair for each word. In the Map function, we first replace all the non-alphanumeric characters from the input text value before tokenizing it as follows: public void map(Object key, Text value, ……… { String valString = value.toString().replaceAll("[^a-zA-Z0-9]+"," "); StringTokenizer itr = new StringTokenizer(valString); StringTokenizer(value.toString()); FileSplit fileSplit = (FileSplit) context.getInputSplit(); String fileName = fileSplit.getPath().getName(); while (itr.hasMoreTokens()) { term.set(itr.nextToken()); docFrequency.set(fileName, 1); context.write(term, docFrequency); } } We use the getInputSplit() method of MapContext to obtain a reference to InputSplit assigned to the current Map task. The InputSplits class for this computation are instances of FileSplit due to the usage of a FileInputFormat based InputFormat. Then we use the getPath() method of FileSplit to obtain the path of the file containing the current split and extract the filename from it. We use this extracted filename as the document ID when constructing the inverted index. The Reduce function receives IDs and frequencies of all the documents that contain the term (Key) as the input. The Reduce function then outputs the term and a list of document IDs and the number of occurrences of the term in each document as the output: public void reduce(Text key, Iterable values,Context context) …………{ HashMap<Text, IntWritable> map = new HashMap<Text, IntWritable>(); for (TermFrequencyWritable val : values) { Text docID = new Text(val.getDocumentID()); int freq = val.getFreq().get(); if (map.get(docID) != null) { map.put(docID, new IntWritable(map.get(docID).get() + freq)); } else { map.put(docID, new IntWritable(freq)); } } MapWritable outputMap = new MapWritable(); outputMap.putAll(map); context.write(key, outputMap); } In the preceding model, we output a record for each word, generating a large amount of intermediate data between Map tasks and Reduce tasks. We use the following combiner to aggregate the terms emitted by the Map tasks, reducing the amount of Intermediate data that needs to be transferred between Map and Reduce tasks: public void reduce(Text key, Iterable values …… { int count = 0; String id = ""; for (TermFrequencyWritable val : values) { count++; if (count == 1) { id = val.getDocumentID().toString(); } } TermFrequencyWritable writable = new TermFrequencyWritable(); writable.set(id, count); context.write(key, writable); } In the driver program, we set the Mapper, Reducer, and the Combiner classes. Also, we specify both Output Value and the MapOutput Value properties as we use different value types for the Map tasks and the reduce tasks. … job.setMapperClass(IndexingMapper.class); job.setReducerClass(IndexingReducer.class); job.setCombinerClass(IndexingCombiner.class); … job.setMapOutputValueClass(TermFrequencyWritable.class); job.setOutputValueClass(MapWritable.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); There's more... We can improve this indexing program by performing optimizations such as filtering stop words, substituting words with word stems, storing more information about the context of the word, and so on, making indexing a much more complex problem. Luckily, there exist several open source indexing frameworks that we can use for indexing purposes. The later recipes of this article will explore indexing using Apache Solr and Elasticsearch, which are based on the Apache Lucene indexing engine. The upcoming section introduces the usage of MapFileOutputFormat to store InvertedIndex in an indexed random accessible manner. Outputting a random accessible indexed InvertedIndex Apache Hadoop supports a file format called MapFile that can be used to store an index into the data stored in SequenceFiles. MapFile is very useful when we need to random access records stored in a large SequenceFile. You can use the MapFileOutputFormat format to output MapFiles, which would consist of a SequenceFile containing the actual data and another file containing the index into the SequenceFile. The chapter8/invertindex/MapFileOutInvertedIndexMR.java MapReduce program in the source folder of chapter8 utilizes MapFiles to store a secondary index into our inverted index. You can execute that program by using the following command. The third parameter (sample_lookup_term) should be a word that is present in your input dataset: $ hadoop jar hcb-c8-samples.jar      chapter8.invertindex.MapFileOutInvertedIndexMR      input_dir indexed_output_dir sample_lookup_term If you check indexed_output_dir, you will be able to see folders named as part-r-xxxxx with each containing a data and an index file. We can load these indexes to MapFileOutputFormat and perform random lookups for the data. An example of a simple lookup using this method is given in the MapFileOutInvertedIndexMR.java program as follows: MapFile.Reader[] indexReaders = MapFileOutputFormat.getReaders(new Path(args[1]), getConf());MapWritable value = new MapWritable();Text lookupKey = new Text(args[2]);// Performing the lookup for the values if the lookupKeyWritable map = MapFileOutputFormat.getEntry(indexReaders,new HashPartitioner<Text, MapWritable>(), lookupKey, value); In order to use this feature, you need to make sure to disable Hadoop from writing a _SUCCESS file in the output folder by setting the following property. The presence of the _SUCCESS file might cause an error when using MapFileOutputFormat to lookup the values in the index: job.getConfiguration().setBoolean     ("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); Data preprocessing using Hadoop streaming and Python Data preprocessing is an important and often required component in data analytics. Data preprocessing becomes even more important when consuming unstructured text data generated from multiple different sources. Data preprocessing steps include operations such as cleaning the data, extracting important features from data, removing duplicate items from the datasets, converting data formats, and many more. Hadoop MapReduce provides an ideal environment to perform these tasks in parallel when processing massive datasets. Apart from using Java MapReduce programs or Pig scripts or Hive scripts to preprocess the data, Hadoop also contains several other tools and features that are useful in performing these data preprocessing operations. One such feature is the InputFormats, which provides us with the ability to support custom data formats by implementing custom InputFormats. Another feature is the Hadoop Streaming support, which allows us to use our favorite scripting languages to perform the actual data cleansing and extraction, while Hadoop will parallelize the computation to hundreds of compute and storage resources. In this recipe, we are going to use Hadoop Streaming with a Python script-based Mapper to perform data extraction and format conversion. Getting ready Check whether Python is already installed on the Hadoop worker nodes. If not, install Python on all the Hadoop worker nodes. How to do it... The following steps show how to clean and extract data from the 20news dataset and store the data as a tab-separated file: Download and extract the 20news dataset from http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz: $ wget http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz$ tar –xzf 20news-19997.tar.gz Upload the extracted data to the HDFS. In order to save the compute time and resources, you can use only a subset of the dataset: $ hdfs dfs -mkdir 20news-all$ hdfs dfs –put <extracted_folder> 20news-all Extract the resource package and locate the MailPreProcessor.py Python script. Locate the hadoop-streaming.jar JAR file of the Hadoop installation in your machine. Run the following Hadoop Streaming command using that JAR. /usr/lib/hadoop-mapreduce/ is the hadoop-streaming JAR file's location for the BigTop-based Hadoop installations: $ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input 20news-all/*/* -output 20news-cleaned -mapper MailPreProcessor.py -file MailPreProcessor.py Inspect the results using the following command: > hdfs dfs –cat 20news-cleaned/part-* | more How it works... Hadoop uses the default TextInputFormat as the input specification for the previous computation. Usage of the TextInputFormat generates a Map task for each file in the input dataset and generates a Map input record for each line. Hadoop streaming provides the input to the Map application through the standard input: line = sys.stdin.readline(); while line: …. if (doneHeaders):    list.append( line ) elif line.find( "Message-ID:" ) != -1:    messageID = line[ len("Message-ID:"):] …. elif line == "":    doneHeaders = True      line = sys.stdin.readline(); The preceding Python code reads the input lines from the standard input until it reaches the end of the file. We parse the headers of the newsgroup file till we encounter the empty line that demarcates the headers from the message contents. The message content will be read in to a list line by line: value = ' '.join( list ) value = fromAddress + "t" ……"t" + value print '%st%s' % (messageID, value) The preceding code segment merges the message content to a single string and constructs the output value of the streaming application as a tab-delimited set of selected headers, followed by the message content. The output key value is the Message-ID header extracted from the input file. The output is written to the standard output by using a tab to delimit the key and the value. There's more... We can generate the output of the preceding computation in the Hadoop SequenceFile format by specifying SequenceFileOutputFormat as the OutputFormat of the streaming computations: $ hadoop jar /usr/lib/Hadoop-mapreduce/hadoop-streaming.jar -input 20news-all/*/* -output 20news-cleaned -mapper MailPreProcessor.py -file MailPreProcessor.py -outputformat          org.apache.hadoop.mapred.SequenceFileOutputFormat -file MailPreProcessor.py It is a good practice to store the data as SequenceFiles (or other Hadoop binary file formats such as Avro) after the first pass of the input data because SequenceFiles takes up less space and supports compression. You can use hdfs dfs -text <path_to_sequencefile> to output the contents of a SequenceFile to text: $ hdfs dfs –text 20news-seq/part-* | more However, for the preceding command to work, any Writable classes that are used in the SequenceFile should be available in the Hadoop classpath. Loading large datasets to an Apache HBase data store – importtsv and bulkload The Apache HBase data store is very useful when storing large-scale data in a semi-structured manner, so that it can be used for further processing using Hadoop MapReduce programs or to provide a random access data storage for client applications. In this recipe, we are going to import a large text dataset to HBase using the importtsv and bulkload tools. Getting ready Install and deploy Apache HBase in your Hadoop cluster. Make sure Python is installed in your Hadoop compute nodes. How to do it… The following steps show you how to load the TSV (tab-separated value) converted 20news dataset in to an HBase table: Follow the Data preprocessing using Hadoop streaming and Python recipe to perform the preprocessing of data for this recipe. We assume that the output of the following step 4 of that recipe is stored in an HDFS folder named "20news-cleaned": $ hadoop jar    /usr/lib/hadoop-mapreduce/hadoop-streaming.jar    -input 20news-all/*/*    -output 20news-cleaned    -mapper MailPreProcessor.py -file MailPreProcessor.py Start the HBase shell: $ hbase shell Create a table named 20news-data by executing the following command in the HBase shell. Older versions of the importtsv (used in the next step) command can handle only a single column family. Hence, we are using only a single column family when creating the HBase table: hbase(main):001:0> create '20news-data','h' Execute the following command to import the preprocessed data to the HBase table created earlier: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,h:from,h:group,h:subj,h:msg 20news-data 20news-cleaned Start the HBase Shell and use the count and scan commands of the HBase shell to verify the contents of the table: hbase(main):010:0> count '20news-data'           12xxx row(s) in 0.0250 seconds hbase(main):010:0> scan '20news-data', {LIMIT => 10} ROW                                       COLUMN+CELL                                                                           <1993Apr29.103624.1383@cronkite.ocis.te column=h:c1,       timestamp=1354028803355, value= katop@astro.ocis.temple.edu   (Chris Katopis)> <1993Apr29.103624.1383@cronkite.ocis.te column=h:c2,     timestamp=1354028803355, value= sci.electronics   ...... The following are the steps to load the 20news dataset to an HBase table using the bulkload feature: Follow steps 1 to 3, but create the table with a different name: hbase(main):001:0> create '20news-bulk','h' Use the following command to generate an HBase bulkload datafile: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,h:from,h:group,h:subj,h:msg -Dimporttsv.bulk.output=hbaseloaddir 20news-bulk–source 20news-cleaned List the files to verify that the bulkload datafiles are generated: $ hadoop fs -ls 20news-bulk-source ...... drwxr-xr-x   - thilina supergroup         0 2014-04-27 10:06 /user/thilina/20news-bulk-source/h   $ hadoop fs -ls 20news-bulk-source/h -rw-r--r--   1 thilina supergroup     19110 2014-04-27 10:06 /user/thilina/20news-bulk-source/h/4796511868534757870 The following command loads the data to the HBase table by moving the output files to the correct location: $ hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles 20news-bulk-source 20news-bulk......14/04/27 10:10:00 INFO mapreduce.LoadIncrementalHFiles: Tryingto load hfile=hdfs://127.0.0.1:9000/user/thilina/20news-bulksource/h/4796511868534757870 first= <1993Apr29.103624.1383@cronkite.ocis.temple.edu>last= <stephens.736002130@ngis>...... Start the HBase Shell and use the count and scan commands of the HBase shell to verify the contents of the table: hbase(main):010:0> count '20news-bulk'             hbase(main):010:0> scan '20news-bulk', {LIMIT => 10} How it works... The MailPreProcessor.py Python script extracts a selected set of data fields from the newsboard message and outputs them as a tab-separated dataset: value = fromAddress + "t" + newsgroup +"t" + subject +"t" + value print '%st%s' % (messageID, value) We import the tab-separated dataset generated by the Streaming MapReduce computations to HBase using the importtsv tool. The importtsv tool requires the data to have no other tab characters except for the tab characters that separate the data fields. Hence, we remove any tab characters that may be present in the input data by using the following snippet of the Python script: line = line.strip() line = re.sub('t',' ',line) The importtsv tool supports the loading of data into HBase directly using the Put operations as well as by generating the HBase internal HFiles as well. The following command loads the data to HBase directly using the Put operations. Our generated dataset contains a Key and four fields in the values. We specify the data fields to the table column name mapping for the dataset using the -Dimporttsv.columns parameter. This mapping consists of listing the respective table column names in the order of the tab-separated data fields in the input dataset: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=<data field to table column mappings>    <HBase tablename> <HDFS input directory> We can use the following command to generate HBase HFiles for the dataset. These HFiles can be directly loaded to HBase without going through the HBase APIs, thereby reducing the amount of CPU and network resources needed: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=<filed to column mappings> -Dimporttsv.bulk.output=<path for hfile output> <HBase tablename> <HDFS input directory> These generated HFiles can be loaded into HBase tables by simply moving the files to the right location. This moving can be performed by using the completebulkload command: $ hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <HDFS path for hfiles> <table name> There's more... You can use the importtsv tool that has datasets with other data-filed separator characters as well by specifying the '-Dimporttsv.separator' parameter. The following is an example of using a comma as the separator character to import a comma-separated dataset in to an HBase table: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv '-Dimporttsv.separator=,' -Dimporttsv.columns=<data field to table column mappings>    <HBase tablename> <HDFS input directory> Look out for Bad Lines in the MapReduce job console output or in the Hadoop monitoring console. One reason for Bad Lines is to have unwanted delimiter characters. The Python script we used in the data-cleaning step removes any extra tabs in the message: 14/03/27 00:38:10 INFO mapred.JobClient:   ImportTsv 14/03/27 00:38:10 INFO mapred.JobClient:     Bad Lines=2 Data de-duplication using HBase HBase supports the storing of multiple versions of column values for each record. When querying, HBase returns the latest version of values, unless we specifically mention a time period. This feature of HBase can be used to perform automatic de-duplication by making sure we use the same RowKey for duplicate values. In our 20news example, we use MessageID as the RowKey for the records, ensuring duplicate messages will appear as different versions of the same data record. HBase allows us to configure the maximum or minimum number of versions per column family. Setting the maximum number of versions to a low value will reduce the data usage by discarding the old versions. Refer to http://hbase.apache.org/book/schema.versions.html for more information on setting the maximum or minimum number of versions. Summary In this article, we have learned about getting started with Hadoop, Benchmarking Hadoop MapReduce, optimizing Hadoop YARN, unit testing, generating an inverted index, data processing, and loading large datasets to an Apache HBase data store. Resources for Article: Further resources on this subject: Hive in Hadoop [article] Evolution of Hadoop [article] Learning Data Analytics with R and Hadoop [article]
Read more
  • 0
  • 0
  • 7562

article-image-adjusting-key-parameters-r
Packt
24 Jan 2011
6 min read
Save for later

Adjusting Key Parameters in R

Packt
24 Jan 2011
6 min read
R Graph Cookbook Detailed hands-on recipes for creating the most useful types of graphs in R – starting from the simplest versions to more advanced applications Learn to draw any type of graph or visual data representation in R Filled with practical tips and techniques for creating any type of graph you need; not just theoretical explanations All examples are accompanied with the corresponding graph images, so you know what the results look like Each recipe is independent and contains the complete explanation and code to perform the task as efficiently as possible       Setting colors of points, lines, and bars In this recipe we will learn the simplest way to change the colors of points, lines, and bars in scatter plots, line plots, histograms, and bar plots. Getting ready All you need to try out this recipe is to run R and type the recipe at the command prompt. You can also choose to save the recipe as a script so that you can use it again later on. How to do it... The simplest way to change the color of any graph element is by using the col argument. For example, the plot() function takes the col argument: plot(rnorm(1000), col="red")   The code file can be downloaded from here. If we choose plot type as line, then the color is applied to the plotted line. Let’s use the dailysales.csv example dataset. First, we need to load it: Sales <- read.csv("dailysales.csv",header=TRUE) plot(sales$units~as.Date(sales$date,"%d/%m/%y"), type="l", #Specify type of plot as l for line col="blue")   Similarly, the points() and lines() functions apply the col argument's value to the plotted points and lines respectively. barplot() and hist() also take the col argument and apply them to the bars. So the following code would produce a bar plot with blue bars: barplot(sales$ProductA~sales$City, col="blue") The col argument for boxplot() is applied to the color of the boxes plotted. How it works... The col argument automatically applies the specified color to the elements being plotted, based on the plot type. So, if we do not specify a plot type or choose points, then the color is applied to points. Similarly, if we choose plot type as line then the color is applied to the plotted line and if we use the col argument in the barplot() or histogram() commands, then the color is applied to the bars. col accepts names of colors such as red, blue, and black. The colors() (or colours()) function lists all the built-in colors (more than 650) available in R. We can also specify colors as hexadecimal codes such as #FF0000 (for red), #0000FF (for blue), and #000000 (for black). If you have ever made any web pages¸ you would know that these hex codes are used in HTML to represent colors. col can also take numeric values. When it is set to a numeric value, the color corresponding to that index in the current color palette is used. For example, in the default color palette the first color is black and the second color is red. So col=1 and col=2 refers to black and red respectively. Index 0 corresponds to the background color. There's more... In many settings, col can also take a vector of multiple colors, instead of a single color. This is useful if you wish to use more than one color in a graph. The heat.colors() function takes a number as an argument and returns a vector of those many colors. So heat.colors(5) produces a vector of five colors. Type the following at the R prompt: heat.colors(5) You should get the following output: [1] "#FF0000FF" "#FF5500FF" "#FFAA00FF" "#FFFF00FF" "#FFFF80FF" Those are five colors in the hexadecimal format. Another way of specifying a vector of colors is to construct one: barplot(as.matrix(sales[,2:4]), beside=T, legend=sales$City, col=c("red","blue","green","orange","pink"), border="white") In the example, we set the value of col to c("red","blue","green","orange","pink"), which is a vector of five colors. We have to take care to make a vector matching the length of the number of elements, in this case bars we are plotting. If the two numbers don’t match, R will 'recycle’ values by repeating colors from the beginning of the vector. For example, if we had fewer colors in the vector than the number of elements, say if we had four colors in the previous plot, then R would apply the four colors to the first four bars and then apply the first color to the fifth bar. This is called recycling in R: barplot(as.matrix(sales[,2:4]), beside=T, legend=sales$City, col=c("red","blue","green","orange"), border="white") In the example, both the bars for the first and last data rows (Seattle and Mumbai) would be of the same color (red), making it difficult to distinguish one from the other. One good way to ensure that you always have the correct number of colors is to find out the length of the number of elements first and pass that as an argument to one of the color palette functions. For example, if we did not know the number of cities in the example we have just seen; we could do the following to make sure the number of colors matches the number of bars plotted: barplot(as.matrix(sales[,2:4]), beside=T, legend=sales$City, col=heat.colors(length(sales$City)), border="white") We used the length() function to find out the length or the number of elements in the vector sales$City and passed that as the argument to heat.colors(). So, regardless of the number of cities we will always have the right number of colors. See also In the next four recipes, we will see how to change the colors of other elements. The fourth recipe is especially useful where we look at color combinations and palettes.
Read more
  • 0
  • 0
  • 7555
article-image-exporting-sap-businessobjects-dashboards-different-environments
Packt
02 Jun 2011
7 min read
Save for later

Exporting SAP BusinessObjects Dashboards into Different Environments

Packt
02 Jun 2011
7 min read
  SAP BusinessObjects Dashboards 4.0 Cookbook Introduction First, the visual model is compiled to a SWF file format. Compiling to a SWF file format ensures that the dashboard plays smoothly on different screen sizes and across different platforms. It also ensures that the users aren't given huge 10+ megabyte files. After compilation of the visual model to a SWF file, developers can then publish it to a format of their choice. The following are the available choices—Flash (SWF), AIR, SAP BusinessObjects Platform, HTML, PDF, PPT, Outlook, and Word. Once publishing is complete, the dashboard is ready to share! Exporting to a standard SWF, PPT, PDF, and so on After developing a Visual Model on Dashboard Design, we will need to somehow share it with users. We want to put it into a format that everyone can see on their machines. The simplest way is to export to a standard SWF file. One of the great features Dashboard Design has is to be able to embed dashboards into different office file formats. For example, a presenter could have a PowerPoint deck, and in the middle of the presentation, have a working dashboard that presents an important set of data values to the audience. Another example could be an executive level user who is viewing a Word document created by an analyst. The analyst could create a written document in Word and then embed a working dashboard with the most updated data to present important data values to the executive level user. You can choose to embed a dashboard in the following file types: PowerPoint Word PDF Outlook HTML Getting ready Make sure your visual model is complete and ready for sharing. How to do it... In the menu toolbar, go to File | Export | Flash (SWF). Select the directory in which you want the SWF to go to and the name of your SWF file How it works... Xcelsius compiles the visual model into an SWF file that everyone is able to see. Once the SWF file has been compiled, the dashboard will then be ready for sharing. It is mandatory that anyone viewing the dashboard have Adobe Flash installed. If not, they can download and install it from http://www.adobe.com/products/flashplayer/. If we export to PPT, we can then edit the PowerPoint file however we desire. If you have an existing PowerPoint presentation deck and want to append the dashboard to it, the easiest way is to first embed the dashboard SWF to a temporary PowerPoint file and then copy that slide to your existing PowerPoint file. There's more... Exporting to an SWF file makes it very easy for distribution, thus making the presentation of mockups great at a business level. Developers are able to work very closely with the business and iteratively come up with a visual model closest to the business goals. It is important though, when distributing SWF files, that everyone viewing the dashboards has the same version, otherwise confusion may occur. Thus, as a best practice, versioning every SWF that is distributed is very important. It is important to note that when the much anticipated Adobe Flash 10.1 was released, there were problems with embedding Dashboard Design dashboards in DOC, PPT, PDF, and so on. However, with the 10.1.82.76 Adobe Flash Player update, this has been fixed. Thus, it is important that if users have Adobe Flash Player 10.1+ installed, the version is higher than or equal to 10.1.82.76. When exporting to PDF, please take the following into account: In Dashboard Design 2008, the default format for exporting to PDF is Acrobat 9.0 (PDF 1.8). If Acrobat Reader 8.0 is installed, the default exported PDF cannot be opened. If using Acrobat Reader 8.0 or older, change the format to "Acrobat 6.0 (PDF 1.5)" before exporting to PDF. Exporting to SAP Business Objects Enterprise After Dashboard Design became a part of BusinessObjects, it was important to be able to export dashboards into the BusinessObjects Enterprise system. Once a dashboard is exported to BusinessObjects Enterprise, users can then easily access their dashboards through InfoView (now BI launch pad). On top of that, administrators are able control dashboard security. Getting ready Make sure your visual model is complete and ready for sharing. How to do it... From the menu toolbar, go to File | Export | Export to SAP BusinessObjects Platform. Enter your BusinessObjects login credentials and then select the location in the SAP BusinessObjects Enterprise system, where you want to store the SWF file, as shown in the following screenshot: Log into BI launch pad (formerly known as InfoView) and verify that you can access the dashboard. (Move the mouse over the image to enlarge.) How it works... When we export a dashboard to SAP BusinessObjects Enterprise, we basically place it in the SAP BusinessObjects Enterprise content management system. From there, we can control accessibility to the dashboard and make sure that we have one source of truth instead of sending out multiple dashboards through e-mail and possibly getting mixed up with what is the latest version. When we log into BI launch pad (formerly known as Infoview), it also passes the login token to the dashboard, so we don't have to enter our credentials again when connecting to SAP BusinessObjects Enterprise data. This is important because we don't have to manually create and pass any additional tokens once we have logged in. There's more... To give a true website type feel, developers can house their dashboards in a website type format using Dashboard Builder. This in turn provides a better experience for users, as they don't have to navigate through many folders in order to access the dashboard that they are looking for. Publishing to SAP BW This recipe shows you how to publish Dashboard Design dashboards to a SAP BW system. Once a dashboard is saved to the SAP BW system, it can be published within a SAP Enterprise Portal iView and made available for the users. Getting ready For this recipe, you will need an Dashboard Design dashboard model. This dashboard does not necessarily have to include a data connection to SAP BW. How to do it... Select Publish in the SAP menu. If you want to save the Xcelsius model with a different name, select the Publish As... option. If you are not yet connected to the SAP BW system, a pop up will appear. Select the appropriate system and fill in your username and password in the dialog box. If you want to disconnect from the SAP BW system and connect to a different system, select the Disconnect option from the SAP menu. Enter the Description and Technical Name of the dashboard. Select the location you want to save the dashboard to and click on Save. The dashboard is now published to the SAP BW system. To launch the dashboard and view it from the SAP BW environment, select the Launch option from the SAP menu. You will be asked to log in to the SAP BW system before you are able to view the dashboard. How it works... As we have seen in this recipe, the publishing of an Dashboard Design dashboard to SAP BW is quite straightforward. As the dashboard is part of the SAP BW environment after publishing, the model can be transported between SAP BW systems like all other SAP BW objects. There is more... After launching step 5, the Dashboard Design dashboard will load in your browser from the SAP BW server. You can add the displayed URL to an SAP Enterprise Portal iView to make the dashboard accessible for portal users.  
Read more
  • 0
  • 0
  • 7527

article-image-use-labview-data-acquisition
Fatema Patrawala
17 Feb 2018
14 min read
Save for later

How to use LabVIEW for data acquisition

Fatema Patrawala
17 Feb 2018
14 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Data Acquisition Using LabVIEW written by Behzad Ehsani. In this book you will learn to transform physical phenomena into computer-acceptable data using an object-oriented language.[/box] Today we will discuss basics of LabVIEW, focus on its installation with an example of a LabVIEW program which is generally known as Virtual Instrument (VI). Introduction to LabVIEW LabVIEW is a graphical developing and testing environment unlike any other test and development tool available in the industry. LabVIEW sets itself apart from traditional programming environments by its completely graphical approach to programming. As an example, while representation of a while loop in a text-based language such as C consists of several predefined, extremely compact, and sometimes extremely cryptic lines of text, a while loop in LabVIEW is actually a graphical loop. The environment is extremely intuitive and powerful, which makes for a short learning curve for the beginner. LabVIEW is based on what is called the G language, but there are still other languages, especially C, under the hood. However, the ease of use and power of LabVIEW is somewhat deceiving to a novice user. Many people have attempted to start projects in LabVIEW only because, at first glance, the graphical nature of the interface and the concept of drag and drop used in LabVIEW appears to do away with the required basics of programming concepts and classical education in programming science and engineering. This is far from the reality of using LabVIEW as the predominant development environment. While it is true that, in many higher-level development and testing environments, especially when using complicated test equipment and complex mathematical calculations or even creating embedded software, LabVIEW's approach will be a much more time-efficient and bug-free environment which otherwise would require several lines of code in a traditional text based programming environment, one must be aware of LabVIEW's strengths and possible weaknesses.   LabVIEW does not completely replace the need for traditional text based languages and, depending on the entire nature of a project, LabVIEW or another traditional text based language such as C may be the most suitable programming or test environment. Installing LabVIEW Installation of LabVIEW is very simple and it is just as routine as any modern-day program installation; that is, insert the DVD 1 and follow the onscreen guided installation steps. LabVIEW comes in one DVD for the Mac and Linux versions but in four or more DVDs for the Windows edition (depending on additional software, different licensing, and additional libraries and packages purchased). In this book, we will use the LabVIEW 2013 Professional Development version for Windows. Given the target audience of this book, we assume the user is fully capable of installing the program. Installation is also well documented by National Instruments (NI) and the mandatory 1-year support purchase with each copy of LabVIEW is a valuable source of live and e-mail help. Also, the NI website (www.ni.com) has many user support groups that are also a great source of support, example codes, discussion groups, local group events and meetings of fellow LabVIEW developers, and so on. It's worth noting for those who are new to the installation of LabVIEW that the installation DVDs include much more than what an average user would need and pay for. We do strongly suggest that you install additional software (beyond what has been purchased and licensed or immediately needed!). This additional software is fully functional in demo mode for 7 days, which may be extended for about a month with online registration. This is a very good opportunity to have hands-on experience with even more of the power and functionality that LabVIEW is capable of offering. The additional information gained by installing the other software available on the DVDs may help in further development of a given project. Just imagine, if the current development of a robot only encompasses mechanical movements and sensors today, optical recognition is probably going to follow sooner than one may think. If data acquisition using expensive hardware and software may be possible in one location, the need for web sharing and remote control of the setup is just around the corner. It is very helpful to at least be aware of what packages are currently available and be able to install and test them prior to a full purchase and implementation. The following screenshot shows what may be installed if almost all the software on all the DVDs is selected: When installing a fresh version of LabVIEW, if you do decide to observe the given advice, make sure to click on the + sign next to each package you decide to install and prevent any installation of LabWindows/CVI... and Measurement Studio... for Visual Studio. LabWindows, according to NI, is an ANSI C integrated development environment. Also note that, by default, NI device drivers are not selected to be installed. Device drivers are an essential part of any data acquisition and appropriate drivers for communications and instrument(s) control must be installed before LabVIEW can interact with external equipment. Also, note that device drivers (on Windows installations) come on a separate DVD, which means that one does not have to install device drivers at the same time that the main application and other modules are installed; they can be installed at any time later on. Almost all well-established vendors are packaging their product with LabVIEW drivers and example codes. If a driver is not readily available, NI has programmers that would do just that. But this would come at a cost to the user. VI Package Manager, now installed as a part of standard installation, is also a must these days. NI distributes third-party software and drivers and public domain packages via VI Package Manager. We are going to use examples using Arduino (http://www.arduino.cc) microcontrollers in later chapters of this book. Appropriate software and drivers for these microcontrollers are installed via VI Package Manager. You can install many public domain packages that further install many useful LabVIEW toolkits to a LabVIEW installation and can be used just as those that are delivered professionally by NI. Finally, note that the more modules, packages, and software that are selected to be installed, the longer it will take to complete the installation. This may sound like making an obvious point but, surprisingly enough, installation of all software on the three DVDs (for Windows) takes up over 5 hours! On a standard laptop or PC we used. Obviously, a more powerful PC (such as one with a solid state hard drive) may not take such long time. Basic LabVIEW VI Once the LabVIEW application is launched, by default two blank windows open simultaneously–a Front Panel and a Block Diagram window–and a VI is created: VIs are the heart and soul of LabVIEW. They are what separate LabVIEW from all other text-based development environments. In LabVIEW, everything is an object which is represented graphically. A VI may only consist of a few objects or hundreds of objects embedded in many subVIs. These graphical representations of a thing, be it a simple while loop, a complex mathematical concept such as Polynomial Interpolation, or simply a Boolean constant, are all graphically represented. To use an object, right-click inside the Block Diagram or Front Panel window, a pallet list appears. Follow the arrow and pick an object from the list of objects from subsequent pallet and place it on the appropriate window. The selected object can now be dragged and placed on different locations on the appropriate window and is ready to be wired. Depending on what kind of object is selected, a graphical representation of the object appears on both windows. Of course, there are many exceptions to this rule. For example, a while loop can only be selected in Block Diagram and, by itself, a while loop does not have a graphical representation on the Front Panel window. Needless to say, LabVIEW also has keyboard combinations that expedite selecting and placing any given toolkit objects onto the appropriate window: Each object has one (or several) wire connections going into it as input(s) and coming out as its output(s). A VI becomes functional when a minimum number of wires are appropriately connected to the input and output of one or more objects. Example 1 – counter with a gauge This is a fairly simple program with simple user interaction. Once the program has been launched, it uses a while loop to wait for the user input. This is a typical behavior of almost any user-friendly program. For example, if the user launches Microsoft Office, the program launches and waits for the user to pick a menu item, click on a button, or perform any other action that the program may provide. Similarly, this program starts execution but waits in a loop for the user to choose a command. In this case only a simple Start or Stop is available. If the Start button is clicked, the program uses a for loop function to simply count from 0 to 10 in intervals of 200 milliseconds. After each count is completed, the gauge on the Front Panel, the GUI part of the program, is updated to show the current count. The counter is then set to the zero location of the gauge and the program awaits subsequent user input. If the Start button is clicked again, this action is repeated, and, obviously, if the Stop button is clicked, the program exits. Although very simple, in this example, you can find many of the concepts that are often used in a much more elaborate program. Let's walk through the code and point out some of these concepts. The following steps not only walk the reader through the example code but are also a brief tutorial on how to use LabVIEW, how to utilize each working window, and how to wire objects. Launch LabVIEW and from the File menu, choose New VI and follow the steps:    Right-click on the Block Diagram window.    From Programming Functions, choose Structures and select While Loop.    Click (and hold) and drag the cursor to create a (resizable) rectangle. On the bottom-left corner, right-click on the wire to the stop loop and choose Create a control. Note that a Stop button appears on both the Block Diagram and Front panel windows. Inside the while loop box, right-click on the Block Diagram window and from Programming Function, choose Structures and select Case Structures. Click and (and hold) and drag the cursor to create a (resizable) rectangle. On the Front Panel window, next to the Stop button created, right-click and from Modern Controls, choose Boolean and select an OK button. Double-click on the text label of the OK button and replace the OK button text with Start. Note that an OK button is also created on the Block Diagram window and the text label on that button also changed when you changed the text label on the Front Panel window. On the Front Panel window, drag-and-drop the newly created Start button next to the tiny green question mark on the left-hand side of the Case Structure box, outside of the case structure but inside the while loop. Wire the Start button to the Case Structure. Inside the Case Structure box, right-click on the Block Diagram window and from Programming Function, choose Structures and select For Loop. Click and (and hold) and drag the cursor to create a (resizable) rectangle. Inside the Case Structure box, right-click on N on the top-left side of the Case Structure and choose Create Constant. An integer blue box with a value of 0 will be connected to the For Loop. This is the number of irritations the for loop is going to have. Change 0 to 11. Inside the For Loop box, right click on the Block Diagram widow and from Programming Function, choose Timing and select Wait(ms). Right-click on the Wait function created in step 10 and connect a integer value of 200 similar to step 9. On the Front Panel window, right-click and from Modern functions, choose Gauge. Note that a Gauge function will appear on the Block Diagram window too. If the function is not inside the For Loop, drag and drop it inside the For Loop. Inside the For loop, on the Block Diagram widow, connect the iteration count i to the Gauge. On the Block Diagram, right-click on the Gauge, and under the Create submenu, choose Local variable. If it is not already inside the while loop, drag and drop it inside the while loop but outside of the case structure. Right-click on the local variable created in step 15 and connect a Zero to the input of the local variable. Click on the Clean Up icon on the main menu bar on the Block Diagram window and drag and move items on the Front Panel window so that both windows look similar to the following screenshots: Creating a project is a must When LabVIEW is launched, a default screen such as in the following screenshot appears on the screen: The most common way of using LabVIEW, at least in the beginning of a small project or test program, is to create a new VI. A common rule of programming is that each function, or in this case VI, should not be larger than a page. Keep in mind that, by nature, LabVIEW will have two windows to begin with and, being a graphical programming environment only, each VI may require more screen space than the similar text based development environment. To start off development and in order to set up all devices and connections required for tasks such as data acquisition, a developer may get the job done by simply creating one, and, more likely several VIs. Speaking from experience among engineers and other developers (in other words, in situations where R&D looms more heavily on the project than collecting raw data), quick VIs are more efficient initially, but almost all projects that start in this fashion end up growing very quickly and other people and other departments will need be involved and/or be fed the gathered data. In most cases, within a short time from the beginning of the project, technicians from the same department or related teams may be need to be trained to use the software in development. This is why it is best to develop the habit of creating a new project from the very beginning. Note the center button on the left-hand window in the preceding screenshot. Creating a new project (as opposed to creating VIs and sub-VIs) has many advantages and it is a must if the program created will have to run as an executable on computers that do not have LabVIEW installed on them. Later versions of LabVIEW have streamlined the creation of a project and have added many templates and starting points to them. Although, for the sake of simplicity, we created our first example with the creation of a simple VI, one could almost as easily create a project and choose from many starting points, templates, and other concepts (such as architecture) in LabVIEW. The most useful starting point for a complete and user-friendly application for data acquisition would be a state machine. Throughout the book, we will create simple VIs as a quick and simple way to illustrate a point but, by the end of the book, we will collect all of the VIs, icons, drivers, and sub-VIs in one complete state machine, all collected in one complete project. From the project created, we will create a standalone application that will not need the LabVIEW environment to execute, which could run on any computer that has LabVIEW runtime engine installed on it. To summarize, we went through the basics of LabVIEW and the main functionality of each of its icons by way of an actual user-interactive example. LabVIEW is capable of developing embedded systems, fuzzy logic, and almost everything in between! If you are interested to know more about LabVIEW, check out this book Data Acquisition Using LabVIEW.    
Read more
  • 0
  • 0
  • 7491

article-image-store-access-social-media-data-mongodb
Amey Varangaonkar
26 Dec 2017
6 min read
Save for later

How to store and access social media data in MongoDB

Amey Varangaonkar
26 Dec 2017
6 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Python Social Media Analytics, co-authored by Siddhartha Chatterjee and Michal Krystyanczuk.[/box] Our article explains how to effectively perform different operations using MongoDB and Python to effectively access and modify the data. According to the official MongoDB page: MongoDB is free and open-source, distributed database which allows ad hoc queries, indexing, and real time aggregation to access and analyze your data. It is published under the GNU Affero General Public License and stores data in flexible, JSON-like documents, meaning fields can vary from document to document and the data structure can be changed over time. Along with ease of use, MongoDB is recognized for the following advantages: Schema-less design: Unlike traditional relational databases, which require the data to fit its schema, MongoDB provides a flexible schema-less data model. The data model is based on documents and collections. A document is essentially a JSON structure and a collection is a group of documents. One links data within collections using specific identifiers. The document model is quite useful in this subject as most social media APIs provide their data in JSON format. High performance: Indexing and GRIDFS features of MongoDB provide fast access and storage. High availability: Duplication feature that allows us to make various copies of databases in different nodes confirms high availability in the case of node failures. Automatic scaling: The Sharding feature of MongoDB scales large data sets Automatically. You can access information on the implementation of Sharding in the official documentation of MongoDB: https://docs.mongodb.com/v3.0/sharding/ Installing MongoDB MongoDB can be downloaded and installed from the following link: http://www.mongodb.org/downloads?_ga=1.253005644.410512988.1432811016. Setting up the environment MongoDB requires a data directory to store all the data. The directory can be created in your working directory: md datadb Starting MongoDB We need to go to the folder where mongod.exe is stored and and run the following command: cmd binmongod.exe Once the MongoDB server is running in the background, we can switch to our Python environment to connect and start working. MongoDB using Python MongoDB can be used directly from the shell command or through programming languages. For the sake of our book we'll explain how it works using Python. MongoDB is accessed using Python through a driver module named PyMongo. We will not go into the detailed usage of MongoDB, which is beyond the scope of this book. We will see the most common functionalities required for analysis projects. We highly recommend reading the official MongoDB documentation. PyMongo can be installed using the following command: pip install pymongo Then the following command imports it in the Python script  from pymongo import MongoClient client = MongoClient('localhost:27017') The database structure of MongoDB is similar to SQL languages, where you have databases, and inside databases you have tables. In MongoDB you have databases, and inside them you have collections. Collections are where you store the data, and databases store multiple collections. As MongoDB is a NoSQL database, your tables do not need to have a predefined structure, you can add documents of any composition as long as they are a JSON object. But by convention is it best practice to have a common general structure for documents in the same collections. To access a database named scrapper we simply have to do the following: db_scrapper = db.scrapper To access a collection named articles in the database scrapper we do this: db_scrapper = db.scrapper collection_articles = db_scrapper.articles Once you have the client object initiated you can access all the databases and the collections very easily. Now, we will see how to perform different operations: Insert: To insert a document into a collection we build a list of new documents to insert into the database: docs = [] for _ in range(0, 10): # each document must be of the python type dict docs.append({ "author": "...", "content": "...", "comment": ["...", ... ] }) Inserting all the docs at once: db.collection.insert_many(docs) Or you can insert them one by one: for doc in docs: db.collection.insert_one(doc) You can find more detailed documentation at: https://docs.mongodb.com/v3.2/tutorial/insert-documents/. Find: To fetch all documents within a collection: # as the find function returns a cursor we will iterate over the cursor to actually fetch # the data from the database docs = [d for d in db.collection.find()] To fetch all documents in batches of 100 documents: batch_size = 100 Iteration = 0 count = db.collection.count() # getting the total number of documents in the collection while iteration * batch_size < count: docs = [d for d in db.collection.find().skip(batch_size * iteration).limit(batch_size)] Iteration += 1 To fetch documents using search queries, where the author is Jean Francois: query = {'author': 'Jean Francois'} docs = [d for d in db.collection.find(query) Where the author field exists and is not null: query = {'author': {'$exists': True, '$ne': None}} docs = [d for d in db.collection.find(query)] There are many other different filtering methods that provide a wide variety of flexibility and precision; we highly recommend taking your time going through the different search operators. You can find more detailed documentation at: https://docs.mongodb.com/v3.2/reference/method/db.collection.find/ Update: To update a document where the author is Jean Francois and set the attribute published as True: query_search = {'author': 'Jean Francois'} query_update = {'$set': {'published': True}} db.collection.update_many(query_search, query_update) Or you can update just the first matching document: db.collection.update_one(query_search, query_update) Find more detailed documentation at: https://docs.mongodb.com/v3.2/reference/method/db.collection.update/ Remove: Remove all documents where the author is Jean Francois: query_search = {'author': 'Jean Francois'} db.collection.delete_many(query_search, query_update) Or remove the first matching document: db.collection.delete_one(query_search, query_update) Find more detailed documentation at: https://docs.mongodb.com/v3.2/tutorial/remove-documents/ Drop: You can drop collections by the following: db.collection.drop() Or you can drop the whole database: db.dropDatabase() We saw how to store and access data from MongoDB. MongoDB has gained a lot of popularity and is the preferred database choice for many, especially when it comes to working with social media data. If you found our post to be useful, do make sure to check out Python Social Media Analytics, which contains useful tips and tricks on leveraging the power of Python for effective data analysis from various social media sites such as YouTube, GitHub, Twitter etc.  
Read more
  • 0
  • 0
  • 7489
article-image-3-ways-use-structures-machine-learning-lise-getoor-nips-2017
Sugandha Lahoti
08 Dec 2017
11 min read
Save for later

3 great ways to leverage Structures for Machine Learning problems by Lise Getoor at NIPS 2017

Sugandha Lahoti
08 Dec 2017
11 min read
Lise Getoor is a professor in the Computer Science Department, at the University of California, Santa Cruz. She has a PhD in Computer Science from Stanford University. She has spent a lot of time studying machine learning, reasoning under uncertainty, databases, data science for social good, artificial intelligence This article attempts to bring our readers to Lisa’s Keynote speech at NIPS 2017. It highlights how structures can be unreasonably effective and the ways to leverage structures in Machine learning problems. After reading this article, head over to the NIPS Facebook page for the complete keynote. All images in this article come from Lisa’s presentation slides and do not belong to us. Our ability to collect, manipulate, analyze, and act on vast amounts of data is having a profound impact on all aspects of society. Much of this data is heterogeneous in nature and interlinked in a myriad of complex ways. This Data is Multimodal (it has different kinds of entities), Multi-relational (it has different links between things), and Spatio-Temporal (it involves space and time parameters). This keynote explores how we can exploit the structure that's in the input as well as the output of machine learning algorithms. A large number of structured problems exists in the fields of NLP and computer vision, computational biology, computational social sciences, knowledge graph extraction and so on. According to Dan Roth, all interesting decisions are structured i.e. there are dependencies between the predictions. Most ML algorithms take this nicely structured data and flatten it put it in a matrix form, which is convenient for our algorithms. However, there is a bunch of different issues with it. The most fundamental issue with the matrix form is that it assumes incorrect independence. Further, in the context of structure and outputs, we’re unable to apply the collective reasoning about the predictions we made for different entries in this matrix. Therefore we need to have ways where we can declaratively talk about how to transform the structure into features. This talk provides us with patterns, tools, and templates for dealing with structures in both inputs and outputs.   Lisa has covered three topics for solving structured problems: Patterns, Tools, and Templates. Patterns are used for simple structured problems. Tools help in getting patterns to work and in creating tractable structured problems. Templates build on patterns and tools to solve bigger computational problems. [dropcap]1[/dropcap] Patterns They are used for naively simple structured problems. But on encoding them repeatedly, one can increase performance by 5 or 10%. We use Logical Rules to capture structure in patterns. These logical structures capture structure, i.e. they give an easy way of talking about entities and links between entities. They also tend to be interpretable. There are three basic patterns for structured prediction problems: Collective Classification, Link Prediction, Entity Resolution. [toggle title="To learn more about Patterns, open this section" state="close"] Collective Classification Collective classification is used for inferring the labels of nodes in a graph. The pattern for expressing this in logical rules is [box type="success" align="" class="" width=""]local - predictor (x, l) → label (x, l) label (x, l) & link(x,y) → label (y,l)[/box] It is called as collective classification as the thing to predict i.e. the label, occurs on both sides of the rule. Let us consider a toy problem: We have to predict unknown labels here (marked in grey) as to what political party the unknown person will vote for. We apply logical rules to the problem. Local rules: [box type="success" align="" class="" width=""]“If X donates to part P, X votes for P” “If X tweets party P slogans, X votes for P”[/box] Relational rules: [box type="success" align="" class="" width=""]“If X is linked to Y, and X votes for P, Y votes for P” Votes (X,P) & Friends (X,Y) → Votes (Y, P) Votes (X,P) & Spouse (X,Y) → Votes (Y, P)[/box] The above example shows the local and relational rules applied to the problem based on collective classification. Adding a collective classifier like this to other problems yields significant improvement. Link Prediction Link Prediction is used for predicting links or edges in a graph. The pattern for expressing this in logical rules is : [box type="success" align="" class="" width=""]link (x,y) & similar (y,z) →  link (x,y)[/box] For example, consider a basic recommendation system. We apply logical rules of link prediction to express likes and similarities. So, how you infer one link is gonna give you information about another link. Rules express: [box type="success" align="" class="" width=""]“If user U likes item1, and item2 is similar to item1, user U likes item2” Likes (U, I1) & SimilarItem (I1, I2) → Likes(U, I2) “If user1  likes item I, and user2 is similar to user1, user2 likes item I” Likes (U1, I) & SimilarUser (U1, U2) → Likes(U2, I)[/box] Entity Resolution Entity Resolution is used for determining which nodes refer to the same underlying entity. Here we use local rules between how similar things are, for instance, how similar their names or links are [box type="success" align="" class="" width=""]similar - name (x,y) → same (x,y) similar - links (x,y) → same (x,y)[/box] There are two collective rules. One is based on transitivity. [box type="success" align="" class="" width=""]similar - name (x,y) → same (x,y) similar - links (x,y) → same (x,y) same (x,y) && same(y,z) → same (x,z)[/box] The other is based on matching i.e. dependence on both sides of the rule. [box type="success" align="" class="" width=""]similar - name (x,y) → same (x,y) similar - links (x,y) → same (x,y) same (x,y) & ! same (y,z) → ! same (x,z)[/box] The logical rules as described above though being quite helpful, have certain disadvantages. They are intractable, can’t handle inconsistencies, and can’t represent degrees of similarity.[/toggle] [dropcap]2[/dropcap] Tools Tools help in making the structured kind of problems tractable and in getting patterns to work. The tools come from the Statistical Relational Learning community.  Lise adds another one to this mix of languages - PSL. PSL is probabilistic logical programming, a declarative language for expressing collective inference problems. To know more: psl.linqs.org Predicate = relationship or property Ground Atom = (continuous) random variable Weighted Rules = capture dependency or constraint PSL Program = Rules + Input DB PSL makes reasoning scalable by mapping Logical inference to Convex optimization. The language takes logical rules and assign weights to them and then uses it to define a distribution for the unknown variables. One of the striking features here is that the random variables have continuous values. The work done pertaining to the PSL language turns the disadvantages of logical rules into advantages. So they are tractable, can handle inconsistencies, and can represent similarity. The key idea is to convert the clauses to concave functions. To be tractable, we relax it to a concave maximization. PSL has semantics from three different worlds: Randomized algorithms from the Computer science community, Probabilistic graphical models from the Machine Learning community, and Soft Logic from the AI community. [toggle title="To learn more about PSL, open this section" state="close"] Randomized Algorithm In this setting, we have a weighted rule. We have nonnegative weights and then a set of weighted logical rules in clausal form. Weighted Max SAT is a classical problem where we attempt to find the assignment to the random variables that maximize the weights of the satisfied rules. However, this problem is NP-HARD (which is a computational complexity theory for non-deterministic polynomial-time hardness). To overcome this, the randomized community converts this combinatorial optimization to a continuous optimization by introducing random variables which denote rounding probabilities. Probabilistic Graphic Models Graph models represent problems as a factor graph where we have random variables and rules that are essentially the potential function. However, this problem is also NP-Hard. We use Variational Inference approximation technique to solve this. Here we introduce marginal distributions (μ) for the variables. We can then express a solution if we can find a set of globally consistent assignment for these marginal distributions. The problem here is, although we can express it as a linear program, there is an exponential number of constraints. We will use techniques from the graphical model's community, particularly Local Consistency Relaxation to convert this to a simpler problem. The simple idea is to relax search over consistent marginals to simpler set. We introduce local pseudo marginals over joint potential states. Using KKT conditions we can optimize out the θ to derive simplified projected LCR over μ. This approach shows 16% improvement over canonical dual decomposition (MPLP) Soft Logic In the Soft Logic technique for convex optimizations, we have random variables that denote a degree of truth or similarity. We are essentially trying to minimize the amount of dissatisfaction in the rules. Hence with three different interpretations i.e. Randomized Algorithms, Graphical Models, and Soft Logic, we get the same convex optimizations. A PSL essentially takes a PSL program, takes some input data and defines a convex optimization. PSL is open-source. The code, data, tutorials are available online at psl.linqs.org MAP inference in PSL translates into convex optimization problem Inference is further enhanced with state-of-the-art optimization and distributed graph processing paradigms Learning methods for rule weights and latent variables Using PSL gives fast as well as accurate results on comparison with other approaches. [/toggle] [dropcap]3[/dropcap] Templates Templates build on patterns to solve problems in bigger areas such as computational social sciences, knowledge discovery, and responsible data science and machine learning. [toggle title="To learn about some use cases of PSL and Templates for pattern recognition, open this section." state="close"] Computational Social Sciences For exploring this area we will apply a PSL model to Debate stance classification. Let us consider a scenario of an online debate. The topic of the debate is climate change. We can use information in the text to figure out if the people participating in the debate are pro or anti the topic. We can also use information about the dialogue in the discourse. And we can build this on a PSL model. This is based on the collective classification problem we saw earlier in the post. We get a significant rise in accuracy by using a PSL program. Here are the results Knowledge Discovery Using a structure and making use of patterns in Knowledge discovery really pays off. Although we have Information extractors which can extract information from the web and other sources such as facts about entities, relationships, they are usually noisy. So it gets difficult to reason about them collectively to figure out which facts we actually wanna add to our knowledge base. We can add structure to the knowledge graph construction by Performing collective classification, link prediction, and entity resolution Enforcing ontological constraints Integrate knowledge source confidences Using PSL to make it scalable Here’s the PSL program for knowledge graph identification. These were evaluated on three real-world knowledge graphs. NELL, MusicBrainz, and Freebase. As shown in the above image, both statistical features and semantic constraints help but combining them always wins. Responsible Machine Learning Understanding structure can be key to mitigating negative effects and lead to responsible Machine Learning. The perils of ignoring structure in the machine learning space include overlooking Privacy. For instance, many approaches consider only individual’s attribute data. Some don't take into account what can be inferred from relational context. The other area is around Fairness. The structure here is often outside the data. It can be in the organization or the socio-economic structure. To enable fairness we need to implement impartial decision making without bias and need to take into account structural patterns. Algorithmic Discrimination is another area which can make use of a structure. The fundamental structural pattern here is a feedback loop. Having a way of encoding this feedback loop is important to eliminate algorithmic discrimination. [/toggle] Conclusion In this article, we saw ways of exploiting structures that can be tractable. It provided some tools and templates for exploiting structure. The keynote also provided opportunities for Machine Learning methods that can mix: Structured and unstructured approaches Probabilistic and logical inference Data-driven and knowledge-driven modeling AI and Machine Learning developers need to build on the approaches as described above and discover, exploit, and find new structure and create compelling commercial, scientific, and societal applications.
Read more
  • 0
  • 0
  • 7416

article-image-getting-started-with-apache-kafka-clusters
Amarabha Banerjee
23 Feb 2018
10 min read
Save for later

Getting Started with Apache Kafka Clusters

Amarabha Banerjee
23 Feb 2018
10 min read
[box type="note" align="" class="" width=""]Below given article is a book excerpt from Apache Kafka 1.0 Cookbook written by Raúl Estrada. This book contains easy to follow recipes to help you set-up, configure and use Apache Kafka in the best possible manner.[/box] Here in this article, we are going to talk about how you can get started with Apache Kafka clusters and implement them seamlessly. In Apache Kafka there are three types of clusters: Single-node single-broker Single-node multiple-broker Multiple-node multiple-broker cluster The following four recipes show how to run Apache Kafka in these clusters. Configuring a single-node single-broker cluster – SNSB The first cluster configuration is single-node single-broker (SNSB). This cluster is very useful when a single point of entry is needed. Yes, its architecture resembles the singleton design pattern. A SNSB cluster usually satisfies three requirements: Controls concurrent access to a unique shared broker Access to the broker is requested from multiple, disparate producers There can be only one broker If the proposed design has only one or two of these requirements, a redesign is almost always the correct option. Sometimes, the single broker could become a bottleneck or a single point of failure. But it is useful when a single point of communication is needed. Getting ready Go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka How to do it... The diagram shows an example of an SNSB cluster: Starting ZooKeeper Kafka provides a simple ZooKeeper configuration file to launch a single ZooKeeper instance. To install the ZooKeeper instance, use this command: > bin/zookeeper-server-start.sh config/zookeeper.properties The main properties specified in the zookeeper.properties file are: clientPort: This is the listening port for client requests. By default, ZooKeeper listens on TCP port 2181: clientPort=2181 dataDir: This is the directory where ZooKeeper is stored: dataDir=/tmp/zookeeper means unbounded): maxClientCnxns=0 For more information about Apache ZooKeeper visit the project home page at: http://zookeeper.apache.org/. Starting the broker After ZooKeeper is started, start the Kafka broker with this command: > bin/kafka-server-start.sh config/server.properties The main properties specified in the server.properties file are: broker.id: The unique positive integer identifier for each broker: broker.id=0 log.dir: Directory to store log files: log.dir=/tmp/kafka10-logs num.partitions: The number of log partitions per topic: num.partitions=2 port: The port that the socket server listens on: port=9092 zookeeper.connect: The ZooKeeper URL connection: zookeeper.connect=localhost:2181 How it works Kafka uses ZooKeeper for storing metadata information about the brokers, topics, and partitions. Writes to ZooKeeper are performed only on changes of consumer group membership or on changes to the Kafka cluster itself. This amount of traffic is minimal, and there is no need for a dedicated ZooKeeper ensemble for a single Kafka cluster. Actually, many deployments use a single ZooKeeper ensemble to control multiple Kafka clusters (using a chroot ZooKeeper path for each cluster). SNSB – creating a topic, producer, and consumer The SNSB Kafka cluster is running; now let's create topics, producer, and consumer. Getting ready We need the previous recipe executed: Kafka already installed ZooKeeper up and running A Kafka server up and running Now, go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka How to do it The following steps will show you how to create an SNSB topic, producer, and consumer. Creating a topic As we know, Kafka has a command to create topics. Here we create a topic called SNSBTopic with one partition and one replica: > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 1 --partitions 1 --topic SNSBTopic We obtain the following output: Created topic "SNSBTopic". The command parameters are: --replication-factor 1: This indicates just one replica --partition 1: This indicates just one partition --zookeeper localhost:2181: This indicates the ZooKeeper URL As we know, to get the list of topics on a Kafka server we use the following command: > bin/kafka-topics.sh --list --zookeeper localhost:2181 We obtain the following output: SNSBTopic Starting the producer Kafka has a command to start producers that accepts inputs from the command line and publishes each input line as a message. By default, each new line is considered a message: > bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic SNSBTopic This command requires two parameters: broker-list: The broker URL to connect to topic: The topic name (to send a message to the topic subscribers) Now, type the following in the command line: The best thing about a boolean is [Enter] even if you are wrong [Enter] you are only off by a bit. [Enter] This output is obtained (as expected): The best thing about a boolean is even if you are wrong you are only off by a bit. The producer.properties file has the producer configuration. Some important properties defined in the producer.properties file are: metadata.broker.list: The list of brokers used for bootstrapping information on the rest of the cluster in the format host1:port1, host2:port2: metadata.broker.list=localhost:9092 compression.codec: The compression codec used. For example, none, gzip, and snappy: compression.codec=none Starting the consumer Kafka has a command to start a message consumer client. It shows the output in the command line as soon as it has subscribed to the topic: > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic SNSBTopic --from-beginning Note that the parameter from-beginning is to show the entire log: The best thing about a boolean is even if you are wrong you are only off by a bit. One important property defined in the consumer.properties file is: group.id: This string identifies the consumers in the same group: group.id=test-consumer-group There's more It is time to play with this technology. Open a new command-line window for ZooKeeper, a broker, two producers, and two consumers. Type some messages in the producers and watch them get displayed in the consumers. If you don't know or don't remember how to run the commands, run it with no arguments to display the possible values for the Parameters. Configuring a single-node multiple-broker cluster – SNMB The second cluster configuration is single-node multiple-broker (SNMB). This cluster is used when there is just one node but inner redundancy is needed. When a topic is created in Kafka, the system determines how each replica of a partition is mapped to each broker. In general, Kafka tries to spread the replicas across all available brokers. The messages are first sent to the first replica of a partition (to the current broker leader of that partition) before they are replicated to the remaining brokers. The producers may choose from different strategies for sending messages (synchronous or asynchronous mode). Producers discover the available brokers in a cluster and the partitions on each (all this by registering watchers in ZooKeeper). In practice, some of the high volume topics are configured with more than one partition per broker. Remember that having more partitions increases the I/O parallelism for writes and this increases the degree of parallelism for consumers (the partition is the unit for distributing data to consumers). On the other hand, increasing the number of partitions increases the overhead because: There are more files, so more open file handlers There are more offsets to be checked by consumers, so the ZooKeeper load is increased The art of this is to balance these tradeoffs. Getting ready Go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users): > cd /usr/local/kafka The following diagram shows an example of an SNMB cluster: How to do it Begin starting the ZooKeeper server as follows: > bin/zookeeper-server-start.sh config/zookeeper.properties A different server.properties file is needed for each broker. Let's call them: server-1.properties, server-2.properties, server-3.properties, and so on (original, isn't it?). Each file is a copy of the original server.properties file. In the server-1.properties file set the following properties: broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 Similarly, in the server-2.properties file set the following properties: broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2 Finally, in the server-3.properties file set the following properties: broker.id=3 port=9095 log.dir=/tmp/kafka-logs-3 With ZooKeeper running, start the Kafka brokers with these commands: > bin/kafka-server-start.sh config/server-1.properties > bin/kafka-server-start.sh config/server-2.properties > bin/kafka-server-start.sh config/server-3.properties How it works Now the SNMB cluster is running. The brokers are running on the same Kafka node, on ports 9093, 9094, and 9095. SNMB – creating a topic, producer, and consumer The SNMB Kafka cluster is running; now let's create topics, producer, and consumer. Getting ready We need the previous recipe executed: Kafka already installed  ZooKeeper up and running A Kafka server up and running Now, go to the Kafka installation directory (/usr/local/kafka/ for macOS users and /opt/kafka/ for Linux users):  > cd /usr/local/kafka How to do it The following steps will show you how to create an SNMB topic, producer, and consumer Creating a topic Using the command to create topics, let's create a topic called SNMBTopic with two partitions and two replicas: > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 2 --partitions 3 --topic SNMBTopic The following output is displayed: Created topic "SNMBTopic" This command has the following effects: Kafka will create three logical partitions for the topic. Kafka will create two replicas (copies) per partition. This means, for each partition it will pick two brokers that will host those replicas. For each partition, Kafka will randomly choose a broker Leader. Now ask Kafka for the list of available topics. The list now includes the new SNMBTopic: > bin/kafka-topics.sh --zookeeper localhost:2181 --list SNMBTopic Starting a producer Now, start the producers; indicating more brokers in the broker-list is easy: > bin/kafka-console-producer.sh --broker-list localhost:9093, localhost:9094, localhost:9095 --topic SNMBTopic If it's necessary to run multiple producers connecting to different brokers, specify a different broker list for each producer. Starting a consumer To start a consumer, use the following command: > bin/kafka-console-consumer.sh -- zookeeper localhost:2181 --frombeginning --topic SNMBTopic How it works The first important fact is the two parameters: replication-factor and partitions. The replication-factor is the number of replicas each partition will have in the topic created. The partitions parameter is the number of partitions for the topic created. There's more If you don't know the cluster configuration or don't remember it, there is a useful option for the kafka-topics command, the describe parameter: > bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic SNMBTopic The output is something similar to: Topic:SNMBTopic PartitionCount:3 ReplicationFactor:2 Configs: Topic: SNMBTopic Partition: 0 Leader: 2 Replicas: 2,3 Isr: 3,2 Topic: SNMBTopic Partition: 1 Leader: 3 Replicas: 3,1 Isr: 1,3 Topic: SNMBTopic Partition: 2 Leader: 1 Replicas: 1,2 Isr: 1,2 An explanation of the output: the first line gives a summary of all the partitions; each line gives information about one partition. Since we have three partitions for this topic, there are three lines: Leader: This node is responsible for all reads and writes for a particular partition. For a randomly selected section of the partitions each node is the leader. Replicas: This is the list of nodes that duplicate the log for a particular partition irrespective of whether it is currently alive. Isr: This is the set of in-sync replicas. It is a subset of the replicas currently alive and following the leader. In order to see the options for: create, delete, describe, or change a topic, type this command without parameters: > bin/kafka-topics.sh We discussed how to implement Apache Kafka clusters effectively. If you liked this post, be sure to check out Apache Kafka 1.0 Cookbook which consists of useful recipes to work with Apache Kafka installation.  
Read more
  • 0
  • 1
  • 7376
Modal Close icon
Modal Close icon