Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Programming

1083 Articles
article-image-getting-started-selenium-webdriver-and-python
Packt
23 Dec 2014
19 min read
Save for later

GETTING STARTED WITH SELENIUM WEBDRIVER AND PYTHON

Packt
23 Dec 2014
19 min read
In this article by UnmeshGundecha, author of the book Learning Selenium Testing Tools with Python, we will introduce you to the Selenium WebDriver client library for Python by demonstrating its installation, basic features, and overall structure. Selenium automates browsers. It automates the interaction we do in a browser window such as navigating to a website, clicking on links, filling out forms, submitting forms, navigating through pages, and so on. It works on every major browser available out there. In order to use Selenium WebDriver, we need a programing language to write automation scripts. The language that we select should also have a Selenium client library available. Python is a widely used general-purpose, high-level programming language. It's easy and its syntax allows us to express concepts in fewer lines of code. It emphasizes code readability and provides constructs that enable us to write programs on both the small and large scale. It also provides a number of in-built and user-written libraries to achieve complex tasks quite easily. The Selenium WebDriver client library for Python provides access to all the Selenium WebDriver features and Selenium standalone server for remote and distributed testing of browser-based applications. Selenium Python language bindings are developed and maintained by David Burns, Adam Goucher, MaikRöder, Jason Huggins, Luke Semerau, Miki Tebeka, and Eric Allenin. The Selenium WebDriver client library is supported on Python Version 2.6, 2.7, 3.2, and 3.3. In this article, we will cover the following topics: Installing Python and Selenium package Selecting and setting up a Python editor Implementing a sample script using the Selenium WebDriver Python client library Implementing cross-browser support with Internet Explorer and Google Chrome (For more resources related to this topic, see here.) Preparing your machine As a first step of using Selenium with Python, we'll need to install it on our computer with the minimum requirements possible. Let's set up the basic environment with the steps explained in the following sections. Installing Python You will find Python installed by default on most Linux distributions, Mac OS X, and other Unix machines. On Windows, you will need to install it separately. Installers for different platforms can be found at http://python.org/download/. Installing the Selenium package The Selenium WebDriver Python client library is available in the Selenium package. To install the Selenium package in a simple way, use the pip installer tool available at https://pip.pypa.io/en/latest/. With pip, you can simply install or upgrade the Selenium package using the following command: pip install -U selenium This is a fairly simple process. This command will set up the Selenium WebDriver client library on your machine with all modules and classes that we will need to create automated scripts using Python. The pip tool will download the latest version of the Selenium package and install it on your machine. The optional –U flag will upgrade the existing version of the installed package to the latest version. You can also download the latest version of the Selenium package source from https://pypi.python.org/pypi/selenium. Just click on the Download button on the upper-right-hand side of the page, unarchive the downloaded file, and install it with following command: python setup.py install Browsing the Selenium WebDriver Python documentation The Selenium WebDriver Python client library documentation is available at http://selenium.googlecode.com/git/docs/api/py/api.html as shown in the following screenshot:   It offers detailed information on all core classes and functions of Selenium WebDriver. Also note the following links for Selenium documentation: The official documentation at http://docs.seleniumhq.org/docs/ offers documentation for all the Selenium components with examples in supported languages Selenium Wiki at https://code.google.com/p/selenium/w/list lists some useful topics. Selecting an IDE Now that we have Python and Selenium WebDriver set up, we will need an editor or an Integrated Development Environment (IDE) to write automation scripts. A good editor or IDE increases the productivity and helps in doing a lot of other things that make the coding experience simple and easy. While we can write Python code in simple editors such as Emacs, Vim, or Notepad, using an IDE will make life a lot easier. There are many IDEs to choose from. Generally, an IDE provides the following features to accelerate your development and coding time: A graphical code editor with code completion and IntelliSense A code explorer for functions and classes Syntax highlighting Project management Code templates Tools for unit testing and debugging Source control support If you're new to Python, or you're a tester working for the first time in Python, your development team will help you to set up the right IDE. However, if you're starting with Python for the first time and don't know which IDE to select, here are a few choices that you might want to consider. PyCharm PyCharm is developed by JetBrains, a leading vendor of professional development tools and IDEs such as IntelliJ IDEA, RubyMine, PhpStorm, and TeamCity. PyCharm is a polished, powerful, and versatile IDE that works pretty well. It brings best of the JetBrains experience in building powerful IDEs with lots of other features for a highly productive experience. PyCharm is supported on Windows, Linux, and Mac. To know more about PyCharm and its features visit http://www.jetbrains.com/pycharm/. PyCharm comes in two versions—a community edition and a professional edition. The community edition is free, whereas you have to pay for the professional edition. Here is the PyCharm community edition running a sample Selenium script in the following screenshot:   The community edition is great for building and running Selenium scripts with its fantastic debugging support. We will use PyCharm in the rest of this Article. Later in this article, we will set up PyCharm and create our first Selenium script. All the examples in this article are built using PyCharm; however, you can easily use these examples in your choice of editor or IDE. The PyDev Eclipse plugin The PyDev Eclipse plugin is another widely used editor among Python developers. Eclipse is a famous open source IDE primarily built for Java; however, it also offers support to various other programming languages and tools through its powerful plugin architecture. Eclipse is a cross-platform IDE supported on Windows, Linux, and Mac. You can get the latest edition of Eclipse at http://www.eclipse.org/downloads/. You need to install the PyDev plugin separately after setting up Eclipse. Use the tutorial from Lars Vogel to install PyDev at http://www.vogella.com/tutorials/Python/article.html to install PyDev. Installation instructions are also available at http://pydev.org/. Here's the Eclipse PyDev plugin running a sample Selenium script as shown in the following screenshot:   PyScripter For the Windows users, PyScripter can also be a great choice. It is open source, lightweight, and provides all the features that modern IDEs offer such as IntelliSense and code completion, testing, and debugging support. You can find more about PyScripter along with its download information at https://code.google.com/p/pyscripter/. Here's PyScripter running a sample Selenium script as shown in the following screenshot:   Setting up PyCharm Now that we have seen IDE choices, let's set up PyCharm. All examples in this article are created with PyCharm. However, you can set up any other IDE of your choice and use examples as they are. We will set up PyCharm with following steps to get started with Selenium Python: Download and install the PyCharm Community Edition from JetBrains site http://www.jetbrains.com/pycharm/download/index.html. Launch the PyCharm Community Edition. Click on the Create New Project option on the PyCharm Community Edition dialog box as shown in the following screenshot: On the Create New Project dialog box, as shown in next screenshot, specify the name of your project in the Project name field. In this example, setests is used as the project name. We need to configure the interpreter for the first time. Click on the button to set up the interpreter, as shown in the following screenshot: On the Python Interpreter dialog box, click on the plus icon. PyCharm will suggest the installed interpreter similar to the following screenshot. Select the interpreter from Select Interpreter Path. PyCharm will configure the selected interpreter as shown in the following screenshot. It will show a list of packages that are installed along with Python. Click on the Apply button and then on the OK button: On the Create New Project dialog box, click on the OK button to create the project: Taking your first steps with Selenium and Python We are now ready to start with creating and running automated scripts in Python. Let's begin with Selenium WebDriver and create a Python script that uses Selenium WebDriver classes and functions to automate browser interaction. We will use a sample web application for most of the examples in this artricle. This sample application is built on a famous e-commerce framework—Magento. You can find the application at http://demo.magentocommerce.com/. In this sample script, we will navigate to the demo version of the application, search for products, and list the names of products from the search result page with the following steps: Let's use the project that we created earlier while setting up PyCharm. Create a simple Python script that will use the Selenium WebDriver client library. In Project Explorer, right-click on setests and navigate to New | Python File from the pop-up menu: On the New Python file dialog box, enter searchproducts in the Name field and click on the OK button: PyCharm will add a new tab searchproducts.py in the code editor area. Copy the following code in the searchproduct.py tab: from selenium import webdriver   # create a new Firefox session driver = webdriver.Firefox() driver.implicitly_wait(30) driver.maximize_window()   # navigate to the application home page driver.get("http://demo.magentocommerce.com/")   # get the search textbox search_field = driver.find_element_by_name("q") search_field.clear()   # enter search keyword and submit search_field.send_keys("phones") search_field.submit()   # get all the anchor elements which have product names displayed # currently on result page using find_elements_by_xpath method products = driver.find_elements_by_xpath("//h2[@class='product-name']/a")   # get the number of anchor elements found print "Found " + str(len(products)) + " products:"   # iterate through each anchor element and print the text that is # name of the product for product in products: printproduct.text   # close the browser window driver.quit() If you're using any other IDE or editor of your choice, create a new file, copy the code to the new file, and save the file as searchproducts.py. To run the script, press the Ctrl + Shift + F10 combination in the PyCharm code window or select Run 'searchproducts' from the Run menu. This will start the execution and you will see a new Firefox window navigating to the demo site and the Selenium commands getting executed in the Firefox window. If all goes well, at the end, the script will close the Firefox window. The script will print the list of products in the PyCharm console as shown in the following screenshot: We can also run this script through the command line with the following command. Open the command line, then open the setests directory, and run following command: python searchproducts.py We will use command line as the preferred method in the rest of the article to execute the tests. We'll spend some time looking into the script that we created just now. We will go through each statement and understand Selenium WebDriver in brief. The selenium.webdriver module implements the browser driver classes that are supported by Selenium, including Firefox, Chrome, Internet Explorer, Safari, and various other browsers, and RemoteWebDriver to test on browsers that are hosted on remote machines. We need to import webdriver from the Selenium package to use the Selenium WebDriver methods: from selenium import webdriver Next, we need an instance of a browser that we want to use. This will provide a programmatic interface to interact with the browser using the Selenium commands. In this example, we are using Firefox. We can create an instance of Firefox as shown in following code: driver = webdriver.Firefox() During the run, this will launch a new Firefox window. We also set a few options on the driver: driver.implicitly_wait(30) driver.maximize_window() We configured a timeout for Selenium to execute steps using an implicit wait of 30 seconds for the driver and maximized the Firefox window through the Selenium APINext, we will navigate to the demo version of the application using its URL by calling the driver.get() method. After the get() method is called, WebDriver waits until the page is fully loaded in the Firefox window and returns the control to the script. After loading the page, Selenium will interact with various elements on the page, like a human user. For example, on the Home page of the application, we need to enter a search term in a textbox and click on the Search button. These elements are implemented as HTML input elements and Selenium needs to find these elements to simulate the user action. Selenium WebDriver provides a number of methods to find these elements and interact with them to perform operations such as sending values, clicking buttons, selecting items in dropdowns, and so on. In this example, we are finding the Search textbox using the find_element_by_name method. This will return the first element matching the name attribute specified in the find method. The HTML elements are defined with tag and attributes. We can use this information to find an element, by following the given steps: In this example, the Search textbox has the name attribute defined as q and we can use this attribute as shown in the following code example: search_field = driver.find_element_by_name("q") Once the Search textbox is found, we will interact with this element by clearing the previous value (if entered) using the clear() method and enter the specified new value using the send_keys() method. Next, we will submit the search request by calling the submit() method: search_field.clear() search_field.send_keys("phones") search_field.submit() After submission of the search request, Firefox will load the result page returned by the application. The result page has a list of products that match the search term, which is phones. We can read the list of results and specifically the names of all the products that are rendered in the anchor <a> element using the find_elements_by_xpath() method. This will return more than one matching element as a list: products =   driver.find_elements_by_xpath("//h2[@class= 'product-name']/a") Next, we will print the number of products (that is the number of anchor <a> elements) that are found on the page and the names of the products using the .text property of all the anchor <a> elements: print "Found " + str(len(products)) + " products:" for product in products: printproduct.text At end of the script, we will close the Firefox browser using the driver.quit() method: driver.quit() This example script gives us a concise example of using Selenium WebDriver and Python together to create a simple automation script. We are not testing anything in this script yet. We will extend this simple script into a set of tests and use various other libraries and features of Python. Cross-browser support So far we have built and run our script with Firefox. Selenium has extensive support for cross-browser testing where you can automate on all the major browsers including Internet Explorer, Google Chrome, Safari, Opera, and headless browsers such as PhantomJS. In this section, we will set up and run the script that we created in the previous section with Internet Explorer and Google Chrome to see the cross-browser capabilities of Selenium WebDriver. Setting up Internet Explorer There is a little more to run scripts on Internet Explorer. To run tests on Internet Explorer, we need to download and set up the InternetExplorerDriver server. The InternetExplorerDriver server is a standalone server executable that implements WebDriver's wire protocol to work as glue between the test script and Internet Explorer. It supports major IE versions on Windows XP, Vista, Windows 7, and Windows 8 operating systems. Let's set up the InternetExplorerDriver server with the following steps: Download the InternetExplorerDriver server from http://www.seleniumhq.org/download/. You can download 32- or 64-bit versions based on the system configuration that you are using. After downloading the InternetExplorerDriver server, unzip and copy the file to the same directory where scripts are stored. On IE 7 or higher, the Protected Mode settings for each zone must have the same value. Protected Mode can either be on or off, as long as it is for all the zones. To set the Protected Mode settings: Choose Internet Options from the Tools menu. On the Internet Options dialog box, click on the Security tab. Select each zone listed in Select a zone to view or change security settings and make sure Enable Protected Mode (requires restarting Internet Explorer) is either checked or unchecked for all the zones. All the zones should have the same settings as shown in the following screenshot: While using the InternetExplorerDriver server, it is also important to keep the browser zoom level set to 100 percent so that the native mouse events can be set to the correct coordinates. Finally, modify the script to use Internet Explorer. Instead of creating an instance of the Firefox class, we will use the IE class in the following way: importos from selenium import webdriver   # get the path of IEDriverServer dir = os.path.dirname(__file__) ie_driver_path = dir + "IEDriverServer.exe"   # create a new Internet Explorer session driver = webdriver.Ie(ie_driver_path) driver.implicitly_wait(30) driver.maximize_window()   # navigate to the application home page driver.get("http://demo.magentocommerce.com/")   # get the search textbox search_field = driver.find_element_by_name("q") search_field.clear()   # enter search keyword and submit search_field.send_keys("phones") search_field.submit()   # get all the anchor elements which have product names displayed # currently on result page using find_elements_by_xpath method products = driver.find_elements_by_xpath("//h2[@class='product-name']/a")   # get the number of anchor elements found print "Found " + str(len(products)) + " products:"   # iterate through each anchor element and print the text that is # name of the product for product in products: printproduct.text   # close the browser window driver.quit() In this script, we passed the path of the InternetExplorerDriver server while creating the instance of an IE browser class. Run the script and Selenium will first launch the InternetExplorerDriver server, which launches the browser, and execute the steps. The InternetExplorerDriver server acts as an intermediary between the Selenium script and the browser. Execution of the actual steps is very similar to what we observed with Firefox. Read more about the important configuration options for Internet Explorer at https://code.google.com/p/selenium/wiki/InternetExplorerDriver and the DesiredCapabilities article at https://code.google.com/p/selenium/wiki/DesiredCapabilities. Setting up Google Chrome Setting up and running Selenium scripts on Google Chrome is similar to Internet Explorer. We need to download the ChromeDriver server similar to InternetExplorerDriver. The ChromeDriver server is a standalone server developed and maintained by the Chromium team. It implements WebDriver's wire protocol for automating Google Chrome. It is supported on Windows, Linux, and Mac operating systems. Set up the ChromeDriver server using the following steps: Download the ChromeDriver server from http://chromedriver.storage.googleapis.com/index.html. After downloading the ChromeDriver server, unzip and copy the file to the same directory where the scripts are stored. Finally, modify the sample script to use Chrome. Instead of creating an instance of the Firefox class, we will use the Chrome class in the following way: importos from selenium import webdriver   # get the path of chromedriver dir = os.path.dirname(__file__) chrome_driver_path = dir + "chromedriver.exe" #remove the .exe extension on linux or mac platform   # create a new Chrome session driver = webdriver.Chrome(chrome_driver_path) driver.implicitly_wait(30) driver.maximize_window()   # navigate to the application home page driver.get("http://demo.magentocommerce.com/")   # get the search textbox search_field = driver.find_element_by_name("q") search_field.clear()   # enter search keyword and submit search_field.send_keys("phones") search_field.submit()   # get all the anchor elements which have product names displayed # currently on result page using find_elements_by_xpath method products = driver.find_elements_by_xpath("//h2[@class='product-name']/a")   # get the number of anchor elements found print "Found " + str(len(products)) + " products:"   # iterate through each anchor element and print the text that is # name of the product for product in products: printproduct.text   # close the browser window driver.quit() In this script, we passed the path of the ChromeDriver server while creating an instance of the Chrome browser class. Run the script. Selenium will first launch the Chromedriver server, which launches the Chrome browser, and execute the steps. Execution of the actual steps is very similar to what we observed with Firefox. Read more about ChromeDriver at https://code.google.com/p/selenium/wiki/ChromeDriver and https://sites.google.com/a/chromium.org/chromedriver/home. Summary In this article, we introduced you to Selenium and its components. We installed the selenium package using the pip tool. Then we looked at various Editors and IDEs to ease our coding experience with Selenium and Python and set up PyCharm. Then we built a simple script on a sample application covering some of the high-level concepts of Selenium WebDriver Python client library using Firefox. We ran the script and analyzed the outcome. Finally, we explored the cross-browser testing support of Selenium WebDriver by configuring and running the script with Internet Explorer and Google Chrome. Resources for Article: Further resources on this subject: Quick Start into Selenium Tests [article] Exploring Advanced Interactions of WebDriver [article] Mobile Devices [article]
Read more
  • 0
  • 0
  • 10098

article-image-learning-qgis-python-api
Packt
23 Dec 2014
44 min read
Save for later

Learning the QGIS Python API

Packt
23 Dec 2014
44 min read
In this article, we will take a closer look at the Python libraries available for the QGIS Python developer, and also look at the various ways in which we can use  these libraries to perform useful tasks within QGIS. In particular, you will learn: How the QGIS Python libraries are based on the underlying C++ APIs How to use the C++ API documentation as a reference to work with the Python APIs How the PyQGIS libraries are organized The most important concepts and classes within the PyQGIS libraries and how to use them Some practical examples of performing useful tasks using PyQGIS About the QGIS Python APIs The QGIS system itself is written in C++ and has its own set of APIs that are also written in C++. The Python APIs are implemented as wrappers around these C++ APIs. For example, there is a Python class named QgisInterface that acts as a wrapper around a C++ class of the same name. All the methods, class variables, and the like, which are implemented by the C++ version of QgisInterface are made available through the Python wrapper. What this means is that when you access the Python QGIS APIs, you aren't accessing the API directly. Instead, the wrapper connects your code to the underlying C++ objects and methods, as follows :  Fortunately, in most cases, the QGIS Python wrappers simply hide away the complexity of the underlying C++ code, so the PyQGIS libraries work as you would expect them to. There are some gotchas, however, and we will cover these as they come up. Deciphering the C++ documentation As QGIS is implemented in C++, the documentation for the QGIS APIs is all based on C++. This can make it difficult for Python developers to understand and work with the QGIS APIs. For example, the API documentation for the QgsInterface.zoomToActiveLayer() method:  If you're not familiar with C++, this can be quite confusing. Fortunately, as a Python programmer, you can skip over much of this complexity because it doesn't apply to you. In particular: The virtual keyword is an implementation detail you don't need to worry about The word void indicates that the method doesn't return a value The double colons in QgisInterface::zoomToActiveLayer are simply a C++ convention for separating the class name from the method name Just like in Python, the parentheses show that the method doesn't take any parameters. So if you have an instance of QgisInterface (for example, as the standard iface variable available in the Python Console), you can call this method simply by typing the following: iface.zoomToActiveLayer() Now, let's take a look at a slightly more complex example: the C++ documentation for the QgisInterface.addVectorLayer() method looks like the following:  Notice how the virtual keyword is followed by QgsVectorLayer* instead of void. This is the return value for this method; it returns a QgsVector object. Technically speaking, * means that the method returns a pointer to an object of type QgsVectorLayer. Fortunately, Python wrappers automatically handle pointers, so you don't need to worry about this.  Notice the brief description at the bottom of the documentation for this method; while many of the C++ methods have very little, if any, additional information, other methods have quite extensive descriptions. Obviously, you should read these descriptions carefully as they tell you more about what the method does. Even without any description, the C++ documentation is still useful as it tells you what the method is called, what parameters it accepts, and what type of data is being returned. In the preceding method, you can see that there are three parameters listed in between the parentheses. As C++ is a strongly typed language, you have to define the type of each parameter when you define a function. This is helpful for Python programmers as it tells you what type of value to supply. Apart from QGIS objects, you might also encounter the following data types in the C++ documentation: Data type Description int A standard Python integer value long A standard Python long integer value float A standard Python floating point (real) number bool A Boolean value (true or false) QString A string value. Note that the QGIS Python wrappers automatically convert Python strings to C++ strings, so you don't need to deal with QString objects directly QList This object is used to encapsulate a list of other objects. For example, QList<QString*> represents a list of strings Just as in Python, a method can take default values for each parameter. For example, the QgisInterface.newProject() method looks like the following:  In this case, the thePromptToSaveFlag parameter has a default value, and this default value will be used if no value is supplied. In Python, classes are initialized using the __init__ method. In C++, this is called a constructor. For example, the constructor for the QgsLabel class looks like the following:  Just as in Python, C++ classes inherit the methods defined in their superclass. Fortunately, QGIS doesn't have an extensive class hierarchy, so most of the classes don't have a superclass. However, don't forget to check for a superclass if you can't find the method you're looking for in the documentation for the class itself. Finally, be aware that C++ supports the concept of method overloading. A single method can be defined more than once, where each version accepts a different set of parameters. For example, take a look at the constructor for the QgsRectangle class—you will see that there are four different versions of this method. The first version accepts the four coordinates as floating point numbers:  The second version constructs a rectangle using two QgsPoint objects:  The third version copies the coordinates from QRectF (which is a Qt data type) into a QgsRectangle object:  The final version copies the coordinates from another QgsRectangle object:  The C++ compiler chooses the correct method to use based on the parameters that have been supplied. Python has no concept of method overloading; just choose the version of the method that accepts the parameters you want to supply, and the QGIS Python wrappers will automatically choose the correct method for you. If you keep these guidelines in mind, deciphering the C++ documentation for QGIS isn't all that hard. It just looks more complicated than it really is, thanks to all the complexity specific to C++. However, it won't take long for your brain to start filtering out C++ and use the QGIS reference documentation almost as easily as if it was written for Python rather than C++. Organization of the QGIS Python libraries Now that we can understand the C++-oriented documentation, let's see how the PyQGIS libraries are structured. All of the PyQGIS libraries are organized under a package named qgis. You wouldn't normally import qgis directly, however, as all the interesting libraries are subpackages within this main package; here are the five packages that make up the PyQGIS library: qgis.core This provides access to the core GIS functionality used throughout QGIS. qgis.gui This defines a range of GUI widgets that you can include in your own programs. qgis.analysis This provides spatial analysis tools to analyze vector and raster format data. qgis.networkanalysis This provides tools to build and analyze topologies. qgis.utils This implements miscellaneous functions that allow you to work with the QGIS application using Python.  The first two packages (qgis.core and qgis.gui) implement the most important parts of the PyQGIS library, and it's worth spending some time to become more familiar with the concepts and classes they define. Now let's take a closer look at these two packages. The qgis.core package The qgis.core package defines fundamental classes used throughout the QGIS system. A large part of this package is dedicated to working with vector and raster format geospatial data, and displaying these types of data within a map. Let's take a look at how this is done. Maps and map layers A map consists of multiple layers drawn one on top of the other:  There are three types of map layers supported by QGIS: Vector layer: This layer draws geospatial features such as points, lines, and polygons Raster layer: This layer draws raster (bitmapped) data onto a map Plugin layer: This layer allows a plugin to draw directly onto a map Each of these types of map layers has a corresponding class within the qgis.core library. For example, a vector map layer will be represented by an object of type qgis.core.QgsVectorLayer. We will take a closer look at vector and raster map layers shortly. Before we do this, though, we need to learn how geospatial data (both vector and raster data) is positioned on a map. Coordinate reference systems Since the Earth is a three-dimensional object, maps will only represent the Earth's surface as a two-dimensional plane, so there has to be a way of translating points on the Earth's surface into (x,y) coordinates within a map. This is done using a Coordinate Reference System (CRS):  Globe image courtesy Wikimedia (http://commons.wikimedia.org/wiki/File:Rotating_globe.gif) A CRS has two parts: an ellipsoid, which is a mathematical model of the Earth's surface, and a projection, which is a formula that converts points on the surface of the spheroid into (x,y) coordinates on a map. Generally you won't need to worry about all these details. You can simply select the appropriate CRS that matches the CRS of the data you are using. However, as there are many different coordinate reference systems that have been devised over the years, it is vital that you use the correct CRS when plotting your geospatial data. If you don't do this, your features will be displayed in the wrong place or have the wrong shape. The majority of geospatial data available today uses the EPSG 4326 coordinate reference system (sometimes also referred to as WGS84). This CRS defines coordinates as latitude and longitude values. This is the default CRS used for new data imported into QGIS. However, if your data uses a different coordinate reference system, you will need to create and use a different CRS for your map layer. The qgis.core.QgsCoordinateReferenceSystem class represents a CRS. Once you create your coordinate reference system, you can tell your map layer to use that CRS when accessing the underlying data. For example: crs = QgsCoordinateReferenceSystem(4326,           QgsCoordinateReferenceSystem.EpsgCrsId))layer.setCrs(crs) Note that different map layers can use different coordinate reference systems. Each layer will use its CRS when drawing the contents of the layer onto the map. Vector layers A vector layer draws geospatial data onto a map in the form of points, lines, polygons, and so on. Vector-format geospatial data is typically loaded from a vector data source such as a shapefile or database. Other vector data sources can hold vector data in memory, or load data from a web service across the internet. A vector-format data source has a number of features, where each feature represents a single record within the data source. The qgis.core.QgsFeature class represents a feature within a data source. Each feature has the following principles: ID: This is the feature's unique identifier within the data source. Geometry: This is the underlying point, line, polygon, and so on, which represents the feature on the map. For example, a data source representing cities would have one feature for each city, and the geometry would typically be either a point that represents the center of the city, or a polygon (or a multipolygon) that represents the city's outline. Attributes: These are key/value pairs that provide additional information about the feature. For example, a city data source might have attributes such as total_area, population, elevation, and so on. Attribute values can be strings, integers, or floating point numbers. In QGIS, a data provider allows the vector layer to access the features within the data source. The data provider, an instance of qgis.core.QgsVectorDataProvider, includes: A geometry type that is stored in the data source A list of fields that provide information about the attributes stored for each feature The ability to search through the features within the data source, using the getFeatures() method and the QgsFeatureRequest class You can access the various vector (and also raster) data providers by using the qgis.core.QgsProviderRegistry class. The vector layer itself is represented by a qgis.core.QgsVectorLayer object. Each vector layer includes: Data provider: This is the connection to the underlying file or database that holds the geospatial information to be displayed Coordinate reference system: This indicates which CRS the geospatial data uses Renderer: This chooses how the features are to be displayed Let's take a closer look at the concept of a renderer and how features are displayed within a vector map layer. Displaying vector data The features within a vector map layer are displayed using a combination of renderer and symbol objects. The renderer chooses the symbol to use for a given feature, and the symbol that does the actual drawing. There are three basic types of symbols defined by QGIS: Marker symbol: This displays a point as a filled circle Line symbol: This draws a line using a given line width and color Fill symbol: This draws the interior of a polygon with a given color These three types of symbols are implemented as subclasses of the qgis.core.QgsSymbolV2 class: qgis.core.QgsMarkerSymbolV2 qgis.core.QgsLineSymbolV2 qgis.core.QgsFillSymbolV2 Internally, symbols are rather complex, as "symbol layers" allow multiple elements to be drawn on top of each other. In most cases, however, you can make use of the "simple" version of the symbol. This makes it easier to create a new symbol without having to deal with the internal complexity of symbol layers. For example: symbol = QgsMarkerSymbolV2.createSimple({'width' : 1.0, 'color' : "255,0,0"}) While symbols draw the features onto the map, a renderer is used to choose which symbol to use to draw a particular feature. In the simplest case, the same symbol is used for every feature within a layer. This is called a single symbol renderer, and is represented by the qgis.core.QgsSingleSymbolRenderV2 class. Other possibilities include: Categorized symbol renderer (qgis.core.QgsCategorizedSymbolRendererV2): This renderer chooses a symbol based on the value of an attribute. The categorized symbol renderer has a mapping from attribute values to symbols. Graduated symbol renderer (qgis.core.QgsGraduatedSymbolRendererV2): This type of renderer has a range of attribute, values, and maps each range to an appropriate symbol. Using a single symbol renderer is very straightforward: symbol = ...renderer = QgsSingleSymbolRendererV2(symbol)layer.setRendererV2(renderer) To use a categorized symbol renderer, you first define a list of qgis.core. QgsRendererCategoryV2 objects, and then use that to create the renderer. For example: symbol_male = ...symbol_female = ... categories = []categories.append(QgsRendererCategoryV2("M", symbol_male, "Male"))categories.append(QgsRendererCategoryV2("F", symbol_female, "Female"))renderer = QgsCategorizedSymbolRendererV2("", categories)renderer.setClassAttribute("GENDER")layer.setRendererV2(renderer) Notice that the QgsRendererCategoryV2 constructor takes three parameters: the desired value, the symbol used, and a label used to describe that category. Finally, to use a graduated symbol renderer, you define a list of qgis.core.QgsRendererRangeV2 objects and then use that to create your renderer. For example: symbol1 = ...symbol2 = ... ranges = []ranges.append(QgsRendererRangeV2(0, 10, symbol1, "Range 1"))ranges.append(QgsRendererRange(11, 20, symbol2, "Range 2")) renderer = QgsGraduatedSymbolRendererV2("", ranges)renderer.setClassAttribute("FIELD")layer.setRendererV2(renderer) Accessing vector data In addition to displaying the contents of a vector layer within a map, you can use Python to directly access the underlying data. This can be done using the data provider's getFeatures() method. For example, to iterate over all the features within the layer, you can do the following: provider = layer.dataProvider()for feature in provider.getFeatures(QgsFeatureRequest()):  ... If you want to search for features based on some criteria, you can use the QgsFeatureRequest object's setFilterExpression() method, as follows: provider = layer.dataProvider()request = QgsFeatureRequest()request.setFilterExpression('"GENDER" = "M"')for feature in provider.getFeatures(QgsFeatureRequest()):  ... Once you have the features, it's easy to get access to the feature's geometry, ID, and attributes. For example: geometry = feature.geometry()  id = feature.id()  name = feature.attribute("NAME") The object returned by the feature.geometry() call, which will be an instance of qgis.core.QgsGeometry, represents the feature's geometry. This object has a number of methods you can use to extract the underlying data and perform various geospatial calculations. Spatial indexes In the previous section, we searched for features based on their attribute values. There are times, though, when you might want to find features based on their position in space. For example, you might want to find all features that lie within a certain distance of a given point. To do this, you can use a spatial index, which indexes features according to their location and extent. Spatial indexes are represented in QGIS by the QgsSpatialIndex class. For performance reasons, a spatial index is not created automatically for each vector layer. However, it's easy to create one when you need it: provider = layer.dataProvider()index = QgsSpatialIndex()for feature in provider.getFeatures(QgsFeatureRequest()):  index.insertFeature(feature) Don't forget that you can use the QgsFeatureRequest.setFilterExpression() method to limit the set of features that get added to the index. Once you have the spatial index, you can use it to perform queries based on the position of the features. In particular: You can find one or more features that are closest to a given point using the nearestNeighbor() method. For example: features = index.nearestNeighbor(QgsPoint(long, lat), 5) Note that this method takes two parameters: the desired point as a QgsPoint object and the number of features to return. You can find all features that intersect with a given rectangular area by using the intersects() method, as follows: features = index.intersects(QgsRectangle(left, bottom, right, top)) Raster layers Raster-format geospatial data is essentially a bitmapped image, where each pixel or "cell" in the image corresponds to a particular part of the Earth's surface. Raster data is often organized into bands, where each band represents a different piece of information. A common use for bands is to store the red, green, and blue component of the pixel's color in a separate band. Bands might also represent other types of information, such as moisture level, elevation, or soil type. There are many ways in which raster information can be displayed. For example: If the raster data only has one band, the pixel value can be used as an index into a palette. The palette maps each pixel value maps to a particular color. If the raster data has only one band but no palette is provided. The pixel values can be used directly as a grayscale value; that is, larger numbers are lighter and smaller numbers are darker. Alternatively, the pixel values can be passed through a pseudocolor algorithm to calculate the color to be displayed. If the raster data has multiple bands, then typically, the bands would be combined to generate the desired color. For example, one band might represent the red component of the color, another band might represent the green component, and yet another band might represent the blue component. Alternatively, a multiband raster data source might be drawn using a palette, or as a grayscale or a pseudocolor image, by selecting a particular band to use for the color calculation. Let's take a closer look at how raster data can be drawn onto the map. Displaying raster data The drawing style associated with the raster band controls how the raster data will be displayed. The following drawing styles are currently supported: Drawing style Description PalettedColor For a single band raster data source, a palette maps each raster value to a color. SingleBandGray For a single band raster data source, the raster value is used directly as a grayscale value. SingleBandPseudoColor For a single band raster data source, the raster value is used to calculate a pseudocolor. PalettedSingleBandGray For a single band raster data source that has a palette, this drawing style tells QGIS to ignore the palette and use the raster value directly as a grayscale value. PalettedSingleBandPseudoColor For a single band raster data source that has a palette, this drawing style tells QGIS to ignore the palette and use the raster value to calculate a pseudocolor. MultiBandColor For multiband raster data sources, use a separate band for each of the red, green, and blue color components. For this drawing style, the setRedBand(), setGreenBand(), and setBlueBand() methods can be used to choose which band to use for each color component. MultiBandSingleBandGray For multiband raster data sources, choose a single band to use as the grayscale color value. For this drawing style, use the setGrayBand() method to specify the band to use. MultiBandSingleBandPseudoColor For multiband raster data sources, choose a single band to use to calculate a pseudocolor. For this drawing style, use the setGrayBand() method to specify the band to use.  To set the drawing style, use the layer.setDrawingStyle() method, passing in a string that contains the name of the desired drawing style. You will also need to call the various setXXXBand() methods, as described in the preceding table, to tell the raster layer which bands contain the value(s) to use to draw each pixel. Note that QGIS doesn't automatically update the map when you call the preceding functions to change the way the raster data is displayed. To have your changes displayed right away, you'll need to do the following: Turn off raster image caching. This can be done by calling layer.setImageCache(None). Tell the raster layer to redraw itself, by calling layer.triggerRepaint(). Accessing raster data As with vector-format data, you can access the underlying raster data via the data provider's identify() method. The easiest way to do this is to pass in a single coordinate and retrieve the value or values of that coordinate. For example: provider = layer.dataProvider()values = provider.identify(QgsPoint(x, y),              QgsRaster.IdentifyFormatValue)if values.isValid():  for band,value in values.results().items():    ... As you can see, you need to check whether the given coordinate exists within the raster data (using the isValid() call). The values.results() method returns a dictionary that maps band numbers to values. Using this technique, you can extract all the underlying data associated with a given coordinate within the raster layer. You can also use the provider.block() method to retrieve the band data for a large number of coordinates all at once. We will look at how to do this later in this article. Other useful qgis.core classes Apart from all the classes and functionality involved in working with data sources and map layers, the qgis.core library also defines a number of other classes that you might find useful: Class Description QgsProject This represents the current QGIS project. Note that this is a singleton object, as only one project can be open at a time. The QgsProject class is responsible for loading and storing properties, which can be useful for plugins. QGis This class defines various constants, data types, and functions used throughout the QGIS system. QgsPoint This is a generic class that stores the coordinates for a point within a two-dimensional plane. QgsRectangle This is a generic class that stores the coordinates for a rectangular area within a two-dimensional plane. QgsRasterInterface This is the base class to use for processing raster data. This can be used, to reproject a set of raster data into a new coordinate system, to apply filters to change the brightness or color of your raster data, to resample the raster data, and to generate new raster data by rendering the existing data in various ways. QgsDistanceArea This class can be used to calculate distances and areas for a given geometry, automatically converting from the source coordinate reference system into meters. QgsMapLayerRegistry This class provides access to all the registered map layers in the current project. QgsMessageLog This class provides general logging features within a QGIS program. This lets you send debugging messages, warnings, and errors to the QGIS "Log Messages" panel.  The qgis.gui package The qgis.gui package defines a number of user-interface widgets that you can include in your programs. Let's start by looking at the most important qgis.gui classes, and follow this up with a brief look at some of the other classes that you might find useful. The QgisInterface class QgisInterface represents the QGIS system's user interface. It allows programmatic access to the map canvas, the menu bar, and other parts of the QGIS application. When running Python code within a script or a plugin, or directly from the QGIS Python console, a reference to QgisInterface is typically available through the iface global variable. The QgisInterface object is only available when running the QGIS application itself. If you are running an external application and import the PyQGIS library into your application, QgisInterface won't be available. Some of the more important things you can do with the QgisInterface object are: Get a reference to the list of layers within the current QGIS project via the legendInterface() method. Get a reference to the map canvas displayed within the main application window, using the mapCanvas() method. Retrieve the currently active layer within the project, using the activeLayer() method, and set the currently active layer by using the setActiveLayer() method. Get a reference to the application's main window by calling the mainWindow() method. This can be useful if you want to create additional Qt windows or dialogs that use the main window as their parent. Get a reference to the QGIS system's message bar by calling the messageBar() method. This allows you to display messages to the user directly within the QGIS main window. The QgsMapCanvas class The map canvas is responsible for drawing the various map layers into a window. The QgsMapCanvas class represents a map canvas. This class includes: A list of the currently shown map layers. This can be accessed using the layers() method. Note that there is a subtle difference between the list of map layers available within the map canvas and the list of map layers included in the QgisInterface.legendInterface() method. The map canvas's list of layers only includes the list of layers currently visible, while QgisInterface.legendInterface() returns all the map layers, including those that are currently hidden.  The map units used by this map (meters, feet, degrees, and so on). The map's units can be retrieved by calling the mapUnits() method. An extent,which is the area of the map that is currently shown within the canvas. The map's extent will change as the user zooms in and out, and pans across the map. The current map extent can be obtained by calling the extent() method. A current map tool that controls the user's interaction with the contents of the map canvas. The current map tool can be set using the setMapTool() method, and you can retrieve the current map tool (if any) by calling the mapTool() method. A background color used to draw the background behind all the map layers. You can change the map's background color by calling the canvasColor() method. A coordinate transform that converts from map coordinates (that is, coordinates in the data source's coordinate reference system) to pixels within the window. You can retrieve the current coordinate transform by calling the getCoordinateTransform() method. The QgsMapCanvasItem class A map canvas item is an item drawn on top of the map canvas. The map canvas item will appear in front of the map layers. While you can create your own subclass of QgsMapCanvasItem if you want to draw custom items on top of the map canvas, it would be more useful for you to make use of an existing subclass that will do the work for you. There are currently three subclasses of QgsMapCanvasItem that you might find useful: QgsVertexMarker: This draws an icon (an "X", a "+", or a small box) centered around a given point on the map. QgsRubberBand: This draws an arbitrary polygon or polyline onto the map. It is intended to provide visual feedback as the user draws a polygon onto the map. QgsAnnotationItem: This is used to display additional information about a feature, in the form of a balloon that is connected to the feature. The QgsAnnotationItem class has various subclasses that allow you to customize the way the information is displayed. The QgsMapTool class A map tool allows the user to interact with and manipulate the map canvas, capturing mouse events and responding appropriately. A number of QgsMapTool subclasses provide standard map interaction behavior such as clicking to zoom in, dragging to pan the map, and clicking on a feature to identify it. You can also create your own custom map tools by subclassing QgsMapTool and implementing the various methods that respond to user-interface events such as pressing down the mouse button, dragging the canvas, and so on. Once you have created a map tool, you can allow the user to activate it by associating the map tool with a toolbar button. Alternatively, you can activate it from within your Python code by calling the mapCanvas.setMapTool(...) method. We will look at the process of creating your own custom map tool in the section Using the PyQGIS library in the following table: Other useful qgis.gui classes While the qgis.gui package defines a large number of classes, the ones you are most likely to find useful are given in the following table: Class Description QgsLegendInterface This provides access to the map legend, that is, the list of map layers within the current project. Note that map layers can be grouped, hidden, and shown within the map legend. QgsMapTip This displays a tip on a map canvas when the user holds the mouse over a feature. The map tip will show the display field for the feature; you can set this by calling layer.setDisplayField("FIELD"). QgsColorDialog This is a dialog box that allows the user to select a color. QgsDialog This is a generic dialog with a vertical box layout and a button box, making it easy to add content and standard buttons to your dialog. QgsMessageBar This is a user interface widget for displaying non-blocking messages to the user. We looked at the message bar class in the previous article. QgsMessageViewer This is a generic class that displays long messages to the user within a modal dialog. QgsBlendModeComboBox QgsBrushStyleComboBox QgsColorRampComboBox QgsPenCapStyleComboBox QgsPenJoinStyleComboBox QgsScaleComboBox These QComboBox user-interface widgets allow you to prompt the user for various drawing options. With the exception of the QgsScaleComboBox, which lets the user choose a map scale, all the other QComboBox subclasses let the user choose various Qt drawing options.  Using the PyQGIS library In the previous section, we looked at a number of classes provided by the PyQGIS library. Let's make use of these classes to perform some real-world geospatial development tasks. Analyzing raster data We're going to start by writing a program to load in some raster-format data and analyze its contents. To make this more interesting, we'll use a Digital Elevation Model (DEM) file, which is a raster format data file that contains elevation data. The Global Land One-Kilometer Base Elevation Project (GLOBE) provides free DEM data for the world, where each pixel represents one square kilometer of the Earth's surface. GLOBE data can be downloaded from http://www.ngdc.noaa.gov/mgg/topo/gltiles.html. Download the E tile, which includes the western half of the USA. The resulting file, which is named e10g, contains the height information you need. You'll also need to download the e10g.hdr header file so that QGIS can read the file—you can download this from http://www.ngdc.noaa.gov/mgg/topo/elev/esri/hdr. Once you've downloaded these two files, put them together into a convenient directory. You can now load the DEM data into QGIS using the following code: registry = QgsProviderRegistry.instance()provider = registry.provider("gdal", "/path/to/e10g") Unfortunately, there is a slight complexity here. Since QGIS doesn't know which coordinate reference system is used for the data, it displays a dialog box that asks you to choose the CRS. Since the GLOBE DEM data is in the WGS84 CRS, which QGIS uses by default, this dialog box is redundant. To disable it, you need to add the following to the top of your program: from PyQt4.QtCore import QSettingsQSettings().setValue("/Projections/defaultBehaviour", "useGlobal") Now that we've loaded our raster DEM data into QGIS, we can analyze. There are lots of things we can do with DEM data, so let's calculate how often each unique elevation value occurs within the data. Notice that we're loading the DEM data directly using QgsRasterDataProvider. We don't want to display this information on a map, so we don't want (or need) to load it into QgsRasterLayer.  Since the DEM data is in a raster format, you need to iterate over the individual pixels or cells to get each height value. The provider.xSize() and provider.ySize() methods tell us how many cells are in the DEM, while the provider.extent() method gives us the area of the Earth's surface covered by the DEM. Using this information, we can extract the individual elevation values from the contents of the DEM in the following way: raster_extent = provider.extent()raster_width = provider.xSize()raster_height = provider.ySize()block = provider.block(1, raster_extent, raster_width, raster_height) The returned block variable is an object of type QgsRasterBlock, which is essentially a two-dimensional array of values. Let's iterate over the raster and extract the individual elevation values: for x in range(raster_width):  for y in range(raster_height):    elevation = block.value(x, y)    .... Now that we've loaded the individual elevation values, it's easy to build a histogram out of those values. Here is the entire program to load the DEM data into memory, and calculate, and display the histogram: from PyQt4.QtCore import QSettingsQSettings().setValue("/Projections/defaultBehaviour", "useGlobal")registry = QgsProviderRegistry.instance()provider = registry.provider("gdal", "/path/to/e10g") raster_extent = provider.extent()raster_width = provider.xSize()raster_height = provider.ySize()no_data_value = provider.srcNoDataValue(1) histogram = {} # Maps elevation to number of occurrences. block = provider.block(1, raster_extent, raster_width,            raster_height)if block.isValid():  for x in range(raster_width):    for y in range(raster_height):      elevation = block.value(x, y)      if elevation != no_data_value:        try:          histogram[elevation] += 1        except KeyError:          histogram[elevation] = 1 for height in sorted(histogram.keys()):  print height, histogram[height] Note that we've added a no data value check to the code. Raster data often includes pixels that have no value associated with them. In the case of a DEM, elevation data is only provided for areas of land; pixels over the sea have no elevation, and we have to exclude them, or our histogram will be inaccurate. Manipulating vector data and saving it to a shapefile Let's create a program that takes two vector data sources, subtracts one set of vectors from the other, and saves the resulting geometries into a new shapefile. Along the way, we'll learn a few important things about the PyQGIS library. We'll be making use of the QgsGeometry.difference() function. This function performs a geometrical subtraction of one geometry from another, similar to this:  Let's start by asking the user to select the first shapefile and open up a vector data provider for that file: filename_1 = QFileDialog.getOpenFileName(iface.mainWindow(),                     "First Shapefile",                     "~", "*.shp")if not filename_1:  return registry = QgsProviderRegistry.instance()provider_1 = registry.provider("ogr", filename_1) We can then read the geometries from that file into memory: geometries_1 = []for feature in provider_1.getFeatures(QgsFeatureRequest()):  geometries_1.append(QgsGeometry(feature.geometry())) This last line of code does something very important that may not be obvious at first. Notice that we use the following: QgsGeometry(feature.geometry()) We use the preceding line instead of the following: feature.geometry() This creates a new instance of the QgsGeometry object, copying the geometry into a new object, rather than just adding the existing geometry object to the list. We have to do this because of a limitation of the way the QGIS Python wrappers work: the feature.geometry() method returns a reference to the geometry, but the C++ code doesn't know that you are storing this reference away in your Python code. So, when the feature is no longer needed, the memory used by the feature's geometry is also released. If you then try to access that geometry later on, the entire QGIS system will crash. To get around this, we make a copy of the geometry so that we can refer to it even after the feature's memory has been released. Now that we've loaded our first set of geometries into memory, let's do the same for the second shapefile: filename_2 = QFileDialog.getOpenFileName(iface.mainWindow(),                     "Second Shapefile",                     "~", "*.shp")if not filename_2:  return provider_2 = registry.provider("ogr", filename_2) geometries_2 = []for feature in provider_2.getFeatures(QgsFeatureRequest()):  geometries_2.append(QgsGeometry(feature.geometry())) With the two sets of geometries loaded into memory, we're ready to start subtracting one from the other. However, to make this process more efficient, we will combine the geometries from the second shapefile into one large geometry, which we can then subtract all at once, rather than subtracting one at a time. This will make the subtraction process much faster: combined_geometry = Nonefor geometry in geometries_2:  if combined_geometry == None:    combined_geometry = geometry  else:    combined_geometry = combined_geometry.combine(geometry) We can now calculate the new set of geometries by subtracting one from the other: dst_geometries = []for geometry in geometries_1:  dst_geometry = geometry.difference(combined_geometry)  if not dst_geometry.isGeosValid(): continue  if dst_geometry.isGeosEmpty(): continue  dst_geometries.append(dst_geometry) Notice that we check to ensure that the destination geometry is mathematically valid and isn't empty. Invalid geometries are a common problem when manipulating complex shapes. There are options for fixing them, such as splitting apart multi-geometries and performing a buffer operation.  Our last task is to save the resulting geometries into a new shapefile. We'll first ask the user for the name of the destination shapefile: dst_filename = QFileDialog.getSaveFileName(iface.mainWindow(),                      "Save results to:",                      "~", "*.shp")if not dst_filename:  return We'll make use of a vector file writer to save the geometries into a shapefile. Let's start by initializing the file writer object: fields = QgsFields()writer = QgsVectorFileWriter(dst_filename, "ASCII", fields,               dst_geometries[0].wkbType(),               None, "ESRI Shapefile")if writer.hasError() != QgsVectorFileWriter.NoError:  print "Error!"  return We don't have any attributes in our shapefile, so the fields list is empty. Now that the writer has been set up, we can save the geometries into the file: for geometry in dst_geometries:  feature = QgsFeature()  feature.setGeometry(geometry)  writer.addFeature(feature) Now that all the data has been written to the disk, let's display a message box that informs the user that we've finished: QMessageBox.information(iface.mainWindow(), "",            "Subtracted features saved to disk.") As you can see, creating a new shapefile is very straightforward in PyQGIS, and it's easy to manipulate geometries using Python—just so long as you copy QgsGeometry you want to keep around. If your Python code starts to crash while manipulating geometries, this is probably the first thing you should look for. Using different symbols for different features within a map Let's use World Borders Dataset that you downloaded in the previous article to draw a world map, using different symbols for different continents. This is a good example of using a categorized symbol renderer, though we'll combine it into a script that loads the shapefile into a map layer, and sets up the symbols and map renderer to display the map exactly as you want it. We'll then save this map as an image. Let's start by creating a map layer to display the contents of the World Borders Dataset shapefile: layer = iface.addVectorLayer("/path/to/TM_WORLD_BORDERS-0.3.shp",               "continents", "ogr") Each unique region code in the World Borders Dataset shapefile corresponds to a continent. We want to define the name and color to use for each of these regions, and use this information to set up the various categories to use when displaying the map: from PyQt4.QtGui import QColorcategories = []for value,color,label in [(0,   "#660000", "Antarctica"),                          (2,   "#006600", "Africa"),                          (9,   "#000066", "Oceania"),                          (19,  "#660066", "The Americas"),                          (142, "#666600", "Asia"),                          (150, "#006666", "Europe")]:  symbol = QgsSymbolV2.defaultSymbol(layer.geometryType())  symbol.setColor(QColor(color))  categories.append(QgsRendererCategoryV2(value, symbol, label)) With these categories set up, we simply update the map layer to use a categorized renderer based on the value of the region attribute, and then redraw the map: layer.setRendererV2(QgsCategorizedSymbolRendererV2("region",                          categories))layer.triggerRepaint() There's only one more thing to do; since this is a script that can be run multiple times, let's have our script automatically remove the existing continents layer, if it exists, before adding a new one. To do this, we can add the following to the start of our script: layer_registry = QgsMapLayerRegistry.instance() for layer in layer_registry.mapLayersByName("continents"):   layer_registry.removeMapLayer(layer.id()) When our script is running, it will create one (and only one) layer that shows the various continents in different colors. These will appear as different shades of gray in the printed article, but the colors will be visible on the computer screen: Now, let's use the same data set to color each country based on its relative population. We'll start by removing the existing population layer, if it exists: layer_registry = QgsMapLayerRegistry.instance()for layer in layer_registry.mapLayersByName("population"):  layer_registry.removeMapLayer(layer.id()) Next, we open the World Borders Dataset into a new layer called "population": layer = iface.addVectorLayer("/path/to/TM_WORLD_BORDERS-0.3.shp",               "population", "ogr") We then need to set up our various population ranges: from PyQt4.QtGui import QColorranges = []for min_pop,max_pop,color in [(0,        99999,     "#332828"),                              (100000,   999999,    "#4c3535"),                              (1000000,  4999999,   "#663d3d"),                              (5000000,  9999999,   "#804040"),                              (10000000, 19999999,  "#993d3d"),                              (20000000, 49999999,  "#b33535"),                              (50000000, 999999999, "#cc2828")]:  symbol = QgsSymbolV2.defaultSymbol(layer.geometryType())  symbol.setColor(QColor(color))  ranges.append(QgsRendererRangeV2(min_pop, max_pop,                   symbol, "")) Now that we have our population ranges and their associated colors, we simply set up a graduated symbol renderer to choose a symbol based on the value of the pop2005 attribute, and tell the map to redraw itself: layer.setRendererV2(QgsGraduatedSymbolRendererV2("pop2005",                         ranges))layer.triggerRepaint() The result will be a map layer that shades each country according to its population:  Calculating the distance between two user-defined points  In our final example of using the PyQGIS library, we'll write some code that, when run, starts listening for mouse events from the user. If the user clicks on a point, drags the mouse, and then releases the mouse button again, we will display the distance between those two points. This is a good example of how to add your  own map interaction logic to QGIS, using the QgsMapTool class. This is the basic structure for our QgsMapTool subclass: class DistanceCalculator(QgsMapTool):  def __init__(self, iface):    QgsMapTool.__init__(self, iface.mapCanvas())    self.iface = iface   def canvasPressEvent(self, event):    ...   def canvasReleaseEvent(self, event):    ... To make this map tool active, we'll create a new instance of it and pass it to the mapCanvas.setMapTool() method. Once this is done, our canvasPressEvent() and canvasReleaseEvent() methods will be called whenever the user clicks or releases the mouse button over the map canvas. Let's start with the code that handles the user clicking on the canvas. In this method, we're going to convert from the pixel coordinates that the user clicked on to the map coordinates (that is, a latitude and longitude value). We'll then remember these coordinates so that we can refer to them later. Here is the necessary code: def canvasPressEvent(self, event):  transform = self.iface.mapCanvas().getCoordinateTransform()  self._startPt = transform.toMapCoordinates(event.pos().x(),                        event.pos().y()) When the canvasReleaseEvent() method is called, we'll want to do the same with the point at which the user released the mouse button: def canvasReleaseEvent(self, event):  transform = self.iface.mapCanvas().getCoordinateTransform()  endPt = transform.toMapCoordinates(event.pos().x(),                    event.pos().y()) Now that we have the two desired coordinates, we'll want to calculate the distance between them. We can do this using a QgsDistanceArea object:      crs = self.iface.mapCanvas().mapRenderer().destinationCrs()  distance_calc = QgsDistanceArea()  distance_calc.setSourceCrs(crs)  distance_calc.setEllipsoid(crs.ellipsoidAcronym())  distance_calc.setEllipsoidalMode(crs.geographicFlag())  distance = distance_calc.measureLine([self._startPt,                     endPt]) / 1000 Notice that we divide the resulting value by 1000. This is because the QgsDistanceArea object returns the distance in meters, and we want to display the distance in kilometers. Finally, we'll display the calculated distance in the QGIS message bar:   messageBar = self.iface.messageBar()  messageBar.pushMessage("Distance = %d km" % distance,              level=QgsMessageBar.INFO,              duration=2) Now that we've created our map tool, we need to activate it. We can do this by adding the following to the end of our script: calculator = DistanceCalculator(iface)iface.mapCanvas().setMapTool(calculator) With the map tool activated, the user can click and drag on the map. When the mouse button is released, the distance (in kilometers) between the two points will be displayed in the message bar: Summary In this article, we took an in-depth look at the PyQGIS libraries and how you can use them in your own programs. We learned that the QGIS Python libraries are implemented as wrappers around the QGIS APIs implemented in C++. We saw how Python programmers can understand and work with the QGIS reference documentation, even though it is written for C++ developers. We also looked at the way the PyQGIS libraries are organized into different packages, and learned about the most important classes defined in the qgis.core and qgis.gui packages. We then saw how a coordinate reference systems (CRS) is used to translate from points on the three-dimensional surface of the Earth to coordinates within a two-dimensional map plane. We learned that vector format data is made up of features, where each feature has an ID, a geometry, and a set of attributes, and that symbols are used to draw vector geometries onto a map layer, while renderers are used to choose which symbol to use for a given feature. We learned how a spatial index can be used to speed up access to vector features. Next, we saw how raster format data is organized into bands that represent information such as color, elevation, and so on, and looked at the various ways in which a raster data source can be displayed within a map layer. Along the way, we learned how to access the contents of a raster data source. Finally, we looked at various techniques for performing useful tasks using the PyQGIS library. In the next article, we will learn more about QGIS Python plugins, and then go on to use the plugin architecture as a way of implementing a useful feature within a mapping application. Resources for Article:   Further resources on this subject: QGIS Feature Selection Tools [article] Server Logs [article]
Read more
  • 0
  • 0
  • 14536

article-image-combining-vector-and-raster-datasets
Packt
22 Dec 2014
12 min read
Save for later

Combining Vector and Raster Datasets

Packt
22 Dec 2014
12 min read
This article by Michael Dorman, the author of Learning R for Geospatial Analysis, explores the interplay between vector and raster layers, and the way it is implemented in the raster package. The way rasters and vector layers can be interchanged and queried one according to the other will be demonstrated through examples. (For more resources related to this topic, see here.) Creating vector layers from a raster The opposite operation to rasterization, which has been presented in the previous section, is the creation of vector layers from raster data. The procedure of extracting features of interest out of rasters, in the form of vector layers, is often necessary for analogous reasons underlying rasterization—when the data held in a raster is better represented using a vector layer, within the context of specific subsequent analysis or visualization tasks. Scenarios where we need to create points, lines, and polygons from a raster can all be encountered. In this section, we are going to see an example of each. Raster-to-points conversion In raster-to-points conversion, each raster cell center (excluding NA cells) is converted to a point. The resulting point layer has an attribute table with the values of the respective raster cells in it. Conversion to points can be done with the rasterToPoints function. This function has a parameter named spatial that determines whether the returned object is going to be SpatialPointsDataFrame or simply a matrix holding the coordinates, and the respective cell values (spatial=FALSE, the default value). For our purposes, it is thus important to remember to specify spatial=TRUE. As an example of a raster, let's create a subset of the raster r, with only layers 1-2, rows 1-3, and columns 1-3: > u = r[[1:2]][1:3, 1:3, drop = FALSE] To make the example more instructive, we will place NA in some of the cells and see how this affects the raster-to-points conversion: > u[2, 3] = NA > u[[1]][3, 2] = NA Now, we will apply rasterToPoints to create a SpatialPointsDataFrame object named u_pnt out of u: > u_pnt = rasterToPoints(u, spatial = TRUE) Let's visually examine the result we got with the first layer of u serving as the background: > plot(u[[1]]) > plot(u_pnt, add = TRUE) The graphical output is shown in the following screenshot: We can see that a point has been produced at the center of each raster cell, except for the cell at position (2,3), where we assigned NA to both layers. However, at the (3,2) position, NA has been assigned to only one of the layers (the first one); therefore, a point feature has been generated there nevertheless. The attribute table of u_pnt has eight rows (since there are eight points) and two columns (corresponding to the raster layers). > u_pnt@data layer.1 layer.2 1 0.4242 0.4518 2 0.3995 0.3334 3 0.4190 0.3430 4 0.4495 0.4846 5 0.2925 0.3223 6 0.4998 0.5841 7     NA 0.5841 8 0.7126 0.5086 We can see that the seventh point feature, the one corresponding to the (3,2) raster position, indeed contains an NA value corresponding to layer 1. Raster-to-contours conversion Creating points (see the previous section) and polygons (see the next section) from a raster is relatively straightforward. In the former case, points are generated at cell centroids, while in the latter, rectangular polygons are drawn according to cell boundaries. On the other hand, lines can be created from a raster using various different algorithms designed for more specific purposes. Two common procedures where lines are generated based on a raster are constructing contours (lines connecting locations of equal value on the raster) and finding least-cost paths (lines going from one location to another along the easiest route when cost of passage is defined by raster values). In this section, we will see an example of how to create contours (readers interested in least-cost path calculation can refer to the gdistance package, which provides this capability in R). As an example, we will create contours from the DEM of Haifa (dem). Creating contours can be done using the rasterToContour function. This function accepts a RasterLayer object and returns a SpatialLinesDataFrame object with the contour lines. The rasterToContour function internally uses the base function contourLines, and arguments can be passed to the latter as part of the rasterToContour function call. For example, using the levels parameter, we can specify the breaks where contours will be generated (rather than letting them be determined automatically). The raster dem consists of elevation values ranging between -14 meters and 541 meters: > range(dem[], na.rm = TRUE) [1] -14 541 Therefore, we may choose to generate six contour lines, at 0, 100, 200, …, 500 meter levels: > dem_contour = rasterToContour(dem, levels = seq(0, 500, 100)) Now, we will plot the resulting SpatialLinesDataFrame object on top of the dem raster: > plot(dem) > plot(dem_contour, add = TRUE) The graphical output is shown in the following screenshot: Mount Carmel is densely covered with elevation contours compared to the plains surrounding it, which are mostly within the 0-100 meter elevation range and thus, have only few a contour lines. Let's take a look at the attribute table of dem_contour: > dem_contour@data    level C_1     0 C_2   100 C_3   200 C_4   300 C_5   400 C_6   500 Indeed, the layer consists of six line features—one for each break we specified with the levels argument. Raster-to-polygons conversion As mentioned previously, raster-to-polygons conversion involves the generation of rectangular polygons in the place of each raster cell (once again, excluding NA cells). Similar to the raster-to-points conversion, the resulting attribute table contains the respective raster values for each polygon created. The conversion to polygons is most useful with categorical rasters when we would like to generate polygons defining certain areas in order to exploit the analysis tools this type of data is associated with (such as extraction of values from other rasters, geometry editing, and overlay). Creation of polygons from a raster can be performed with a function whose name the reader may have already guessed, rasterToPolygons. A useful option in this function is to immediately dissolve the resulting polygons according to their attribute table values; that is, all polygons having the same value are dissolved into a single feature. This functionality internally utilizes the rgeos package and it can be triggered by specifying dissolve=TRUE. In our next example, we will visually compare the average NDVI time series of Lahav and Kramim forests (see earlier), based on all of our Landsat (three dates) and MODIS (280 dates) satellite images. In this article, we will only prepare the necessary data by going through the following intermediate steps: Creating the Lahav and Kramim forests polygonal layer. Extracting NDVI values from the satellite images. Creating a data.frame object that can be passed to graphical functions later. Commencing with the first step, using l_rec_focal_clump, we will first create a polygonal layer holding all NDVI>0.2 patches, then subset only those two polygons corresponding to Lahav and Kramim forests. The former is achieved using rasterToPolygons with dissolve=TRUE, converting the patches in l_rec_focal_clumpto 507 individual polygons in a new SpatialPolygonsDataFrame that we hereby name pol: > pol = rasterToPolygons(l_rec_focal_clump, dissolve = TRUE) Plotting pol will show that we have quite a few large patches and many small ones. Since the Lahav and Kramim forests are relatively large, to make things easier, we can omit all polygons with area less than or equal to 1 km2: > pol$area = gArea(pol, byid = TRUE) / 1000^2 > pol = pol[pol$area > 1, ] The attribute table shows that we are left with eight polygons, with area sizes of 1-10 km2. The clumps column, by the way, is where the original l_rec_focal_clump raster value (the clump ID) has been kept ("clumps" is the name of the l_rec_focal_clump raster layer from which the values came). > pol@data    clumps   area 112     2 1.2231 114   200 1.3284 137   221 1.9314 203   281 9.5274 240   314 6.7842 371   432 2.0007 445     5 10.2159 460     56 1.0998 Let's make a map of pol: > plotRGB(l_00, r = 3, g = 2, b = 1, stretch = "lin") > plot(pol, border = "yellow", lty = "dotted", add = TRUE) The graphical output is shown in the following screenshot: The preceding screenshot shows the continuous NDVI>0.2 patches, which are 1 km2 or larger, within the studied area. Two of these, as expected, are the forests we would like to examine. How can we select them? Obviously, we could export pol to a Shapefile and select the features of interest interactively in a GIS software (such as QGIS), then import the result back into R to continue our analysis. The raster package also offers some capabilities for interactive selection (that we do not cover here); for example, a function named click can be used to obtain the properties of the pol features we click in a graphical window such as the one shown in the preceding screenshot. However, given the purpose of this book, we will try to write a code to make the selection automatically without further user input. To write a code that makes the selection, we must choose a certain criterion (either spatial or nonspatial) that separates the features of interest. In this case, for example, we can see that the pol features we wish to select are those closest to Lahav Kibbutz. Therefore, we can utilize the towns point layer (see earlier) to find the distance of each polygon from Lahav Kibbutz, and select the two most proximate ones. Using the gDistance function, we will first find out the distances between each polygon in pol and each point in towns: > dist_towns = gDistance(towns, pol, byid = TRUE) > dist_towns              1         2 112 14524.94060 12697.151 114 5484.66695 7529.195 137 3863.12168 5308.062 203   29.48651 1119.090 240 1910.61525 6372.634 371 11687.63594 11276.683 445 12751.21123 14371.268 460 14860.25487 12300.319 The returned matrix, named dist_towns, contains the pairwise distances, with rows corresponding to the pol feature and columns corresponding to the towns feature. Since Lahav Kibbutz corresponds to the first towns feature (column "1"), we can already see that the fourth and fifth pol features (rows "203" and "240") are the most proximate ones, thus corresponding to the Lahav and Kramim forests. We could subset both forests by simply using their IDs—pol[c("203","240"),]. However, as always, we are looking for general code that will select, in this case, the two closest features irrespective of the specific IDs or row indices. For this purpose, we can use the order function, which we have not encountered so far. This function, given a numeric vector, returns the element indices in an increasing order according to element values. For example, applying order to the first column of dist_towns, we can see that the smallest element in this column is in the fourth row, the second smallest is in the fifth row, the third smallest is in the third row, and so on: > dist_order = order(dist_towns[, 1]) > dist_order [1] 4 5 3 2 6 7 1 8 We can use this result to select the relevant features of pol as follows: > forests = pol[dist_order[1:2], ] The subset SpatialPolygonsDataFrame, named forests, now contains only the two features from pol corresponding to the Lahav and Kramim forests. > forests@data    clumps   area 203   281 9.5274 240   314 6.7842 Let's visualize forests within the context of the other data we have by now. We will plot, once again, l_00 as the RGB background and pol on top of it. In addition, we will plot forests (in red) and the location of Lahav Kibbutz (as a red point). We will also add labels for each feature in pol, corresponding to its distance (in meters) from Lahav Kibbutz: > plotRGB(l_00, r = 3, g = 2, b = 1, stretch = "lin") > plot(towns[1, ], col = "red", pch = 16, add = TRUE) > plot(pol, border = "yellow", lty = "dotted", add = TRUE) > plot(forests, border = "red", lty = "dotted", add = TRUE) > text(gCentroid(pol, byid = TRUE), + round(dist_towns[,1]), + col = "White") The graphical output is shown in the following screenshot: The preceding screenshot demonstrates that we did indeed correctly select the features of interest. We can also assign the forest names to the attribute table of forests, relying on our knowledge that the first feature of forests (ID "203") is larger and more proximate to Lahav Kibbutz and corresponds to the Lahav forest, while the second feature (ID "240") corresponds to Kramim. > forests$name = c("Lahav", "Kramim") > forests@data    clumps   area   name 203   281 9.5274 Lahav 240   314 6.7842 Kramim We now have a polygonal layer named forests, with two features delineating the Lahav and Kramim forests, named accordingly in the attribute table. In the next section, we will proceed with extracting the NDVI data for these forests. Summary In this article, we closed the gap between the two main spatial data types (rasters and vector layers). We now know how to make the conversion from a vector layer to raster and vice versa, and we can transfer the geometry and data components from one data model to another when the need arises. We also saw how raster values can be extracted from a raster according to a vector layer, which is a fundamental step in many analysis tasks involving raster data. Resources for Article:  Further resources on this subject: Data visualization[article] Machine Learning in Bioinformatics[article] Specialized Machine Learning Topics[article]
Read more
  • 0
  • 0
  • 3646

article-image-pipeline-and-producer-consumer-design-patterns
Packt
20 Dec 2014
48 min read
Save for later

Pipeline and Producer-consumer Design Patterns

Packt
20 Dec 2014
48 min read
In this article created by Rodney Ringler, the author of C# Multithreaded and Parallel Programming, we will explore two popular design patterns to solve concurrent problems—Pipeline and producer-consumer, which are used in developing parallel applications using the TPL. A Pipeline design is one where an application is designed with multiple tasks or stages of functionality with queues of work items between them. So, for each stage, the application will read from a queue of work to be performed, execute the work on that item, and then queue the results for the next stage. By designing the application this way, all of the stages can execute in parallel. Each stage just reads from its work queue, performs the work, and puts the results of the work into the queue for the next stage. Each stage is a task and can run independently of the other stages or tasks. They continue executing until their queue is empty and marked completed. They also block and wait for more work items if the queue is empty but not completed. The producer-consumer design pattern is a similar concept but different. In this design, we have a set of functionality that produces data that is then consumed by another set of functionality. Each set of functionality is a TPL task. So, we have a producer task and a consumer task, with a buffer between them. Each of these tasks can run independently of each other. We can also have multiple producer tasks and multiple consumer tasks. The producers run independently and produce queue results to the buffer. The consumers run independently and dequeue from the buffer and perform work on the item. The producer can block if the buffer is full and wait for room to become available before producing more results. Also, the consumer can block if the buffer is empty, waiting on more results to be available to consume. In this article, you will learn the following: Designing an application with a Pipeline design Designing an application with a producer-consumer design Learning how to use BlockingCollection Learning how to use BufferedBlocks Understanding the classes of the System.Threading.Tasks.Dataflow library (For more resources related to this topic, see here.) Pipeline design pattern The Pipeline design is very useful in parallel design when you can divide an application up into series of tasks to be performed in such a way that each task can run concurrently with other tasks. It is important that the output of each task is in the same order as the input. If the order does not matter, then a parallel loop can be performed. When the order matters and we don't want to wait until all items have completed task A before the items start executing task B, then a Pipeline implementation is perfect. Some applications that lend themselves to pipelining are video streaming, compression, and encryption. In each of these examples, we need to perform a set of tasks on the data and preserve the data's order, but we do not want to wait for each item of data to perform a task before any of the data can perform the next task. The key class that .NET has provided for implementing this design pattern is BlockingCollection of the System.Collections.Concurrent namespace. The BlockingCollection class was introduced with .NET 4.5. It is a thread-safe collection specifically designed for producer-consumer and Pipeline design patterns. It supports concurrently adding and removing items by multiple threads to and from the collection. It also has methods to add and remove that block when the collection is full or empty. You can specify a maximum collection size to ensure a producing task that outpaces a consuming task does not make the queue too large. It supports cancellation tokens. Finally, it supports enumerations so that you can use the foreach loop when processing items of the collection. A producer of items to the collection can call the CompleteAdding method when the last item of data has been added to the collection. Until this method is called if a consumer is consuming items from the collection with a foreach loop and the collection is empty, it will block until an item is put into the collection instead of ending the loop. Next, we will see a simple example of a Pipeline design implementation using an encryption program. This program will implement three stages in our pipeline. The first stage will read a text file character-by-character and place each character into a buffer (BlockingCollection). The next stage will read each character out of the buffer and encrypt it by adding 1 to its ASCII number. It will then place the new character into our second buffer and write it to an encryption file. Our final stage will read the character out of the second buffer, decrypt it to its original character, and write it out to a new file and to the screen. As you will see, stages 2 and 3 will start processing characters before stage 1 has finished reading all the characters from the input file. And all of this will be done while maintaining the order of the characters so that the final output file is identical to the input file: Let's get started. How to do it First, let's open up Visual Studio and create a new Windows Presentation Foundation (WPF) application named PipeLineApplication and perform the following steps: Create a new class called Stages.cs. Next, make sure it has the following using statements. using System; using System.Collections.Concurrent; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text; using System.Threading.Tasks; using System.Threading; In the MainWindow.xaml.cs file, make sure the following using statements are present: using System; using System.Collections.Concurrent; using System.Collections.Generic; using System.IO; using System.Linq; using System.Text; using System.Threading.Tasks; using System.Threading; Next, we will add a method for each of the three stages in our pipeline. First, we will create a method called FirstStage. It will take two parameters: one will be a BlockingCollection object that will be the output buffer of this stage, and the second will be a string pointing to the input data file. This will be a text file containing a couple of paragraphs of text to be encrypted. We will place this text file in the projects folder on C:. The FirstStage method will have the following code: public void FirstStage(BlockingCollection<char> output, String PipelineInputFile)        {            String DisplayData = "";            try            {                foreach (char C in GetData(PipelineInputFile))                { //Displayed characters read in from the file.                   DisplayData = DisplayData + C.ToString();   // Add each character to the buffer for the next stage.                    output.Add(C);                  }            }            finally            {                output.CompleteAdding();             }      } Next, we will add a method for the second stage called StageWorker. This method will not return any values and will take three parameters. One will be a BlockingCollection value that will be its input buffer, the second one will be the output buffer of the stage, and the final one will be a file path to store the encrypted text in a data file. The code for this method will look like this: public void StageWorker(BlockingCollection<char> input, BlockingCollection<char> output, String PipelineEncryptFile)        {            String DisplayData = "";              try            {                foreach (char C in input.GetConsumingEnumerable())                {                    //Encrypt each character.                    char encrypted = Encrypt(C);                      DisplayData = DisplayData + encrypted.ToString();   //Add characters to the buffer for the next stage.                    output.Add(encrypted);                  }   //write the encrypted string to the output file.                 using (StreamWriter outfile =                            new StreamWriter(PipelineEncryptFile))                {                    outfile.Write(DisplayData);                }              }            finally            {                output.CompleteAdding();            }        } Now, we will add a method for the third and final stage of the Pipeline design. This method will be named FinalStage. It will not return any values and will take two parameters. One will be a BlockingCollection object that is the input buffer and the other will be a string pointing to an output data file. It will have the following code in it: public void FinalStage(BlockingCollection<char> input, String PipelineResultsFile)        {            String OutputString = "";            String DisplayData = "";              //Read the encrypted characters from the buffer, decrypt them, and display them.            foreach (char C in input.GetConsumingEnumerable())            {                //Decrypt the data.                char decrypted = Decrypt(C);                  //Display the decrypted data.                DisplayData = DisplayData + decrypted.ToString();                  //Add to the output string.                OutputString += decrypted.ToString();              }              //write the decrypted string to the output file.            using (StreamWriter outfile =                        new StreamWriter(PipelineResultsFile))            {                outfile.Write(OutputString);            }        } Now that we have methods for the three stages of our pipeline, let's add a few utility methods. The first of these methods will be one that reads in the input data file and places each character in the data file in a List object. This method will take a string parameter that has a filename and will return a List object of characters. It will have the following code: public List<char> GetData(String PipelineInputFile)        {            List<char> Data = new List<char>();              //Get the Source data.            using (StreamReader inputfile = new StreamReader(PipelineInputFile))            {                while (inputfile.Peek() >= 0)                {                    Data.Add((char)inputfile.Read());                }              }              return Data;        } Now we will need a method to encrypt the characters. This will be a simple encryption method. The encryption method is not really important to this exercise. This exercise is designed to demonstrate the Pipeline design, not implement the world's toughest encryption. This encryption will simply take each character and add one to its ASCII numerical value. The method will take a character type as an input parameter and return a character. The code for it will be as follows: public char Encrypt(char C)        {            //Take the character, convert to an int, add 1, then convert back to a character.            int i = (int)C;            i = i + 1;            C = Convert.ToChar(i);              return C; } Now we will add one final method to the Stages class to decrypt a character value. It will simply do the reverse of the encrypt method. It will take the ASCII numerical value and subtract 1. The code for this method will look like this: public char Decrypt(char C)      {            int i = (int)C;            i = i - 1;            C = Convert.ToChar(i);              return C;        } Now that we are done with the Stages class, let's switch our focus back to the MainWindow.xaml.cs file. First, you will need to add three using statements. They are for the StreamReader, StreamWriter, Threads, and BlockingCollection classes: using System.Collections.Concurrent; using System.IO; using System.Threading; At the top of the MainWindow class, we need four variables available for the whole class. We need three strings that point to our three data files—the input data, encrypted data, and output data. Then we will need a Stages object. These declarations will look like this: private static String PipelineResultsFile = @"c:projectsOutputData.txt";        private static String PipelineEncryptFile = @"c:projectsEncryptData.txt";        private static String PipelineInputFile = @"c:projectsInputData.txt";        private Stages Stage; Then, in the MainWindow constructor method, right after the InitializeComponent call, add a line to instantiate our Stages object: //Create the Stage object and register the event listeners to update the UI as the stages work. Stage = new Stages(); Next, add a button to the MainWindow.xaml file that will initiate the pipeline and encryption. Name this button control butEncrypt, and set its Content property to Encrypt File. Next, add a click event handler for this button in the MainWindow.xaml.cs file. Its event handler method will be butEncrypt_Click and will contain the main code for this application. It will instantiate two BlockingCollection objects for two queues. One queue between stages 1 and 2, and one queue between stages 2 and 3. This method will then create a task for each stage that executes the corresponding methods from the Stages classes. It will then start these three tasks and wait for them to complete. Finally, it will write the output of each stage to the input, encrypted, and results data files and text blocks for viewing. The code for it will look like the following code: private void butEncrpt_Click(object sender, RoutedEventArgs e)        {            //PipeLine Design Pattern              //Create queues for input and output to stages.            int size = 20;            BlockingCollection<char> Buffer1 = new BlockingCollection<char>(size);            BlockingCollection<char> Buffer2 = new BlockingCollection<char>(size);              TaskFactory tasks = new TaskFactory(TaskCreationOptions.LongRunning, TaskContinuationOptions.None);              Task Stage1 = tasks.StartNew(() => Stage.FirstStage(Buffer1, PipelineInputFile));            Task Stage2 = tasks.StartNew(() => Stage.StageWorker(Buffer1, Buffer2, PipelineEncryptFile));            Task Stage3 = tasks.StartNew(() => Stage.FinalStage(Buffer2, PipelineResultsFile));              Task.WaitAll(Stage1, Stage2, Stage3);              //Display the 3 files.            using (StreamReader inputfile = new StreamReader(PipelineInputFile))            {                while (inputfile.Peek() >= 0)                {                    tbStage1.Text = tbStage1.Text + (char)inputfile.Read();                }              }            using (StreamReader inputfile = new StreamReader(PipelineEncryptFile))            {                 while (inputfile.Peek() >= 0)                {                    tbStage2.Text = tbStage2.Text + (char)inputfile.Read();                }              }            using (StreamReader inputfile = new StreamReader(PipelineResultsFile))             {                while (inputfile.Peek() >= 0)                {                    tbStage3.Text = tbStage3.Text + (char)inputfile.Read();                }              }      } One last thing. Let's add three textblocks to display the outputs. We will call these tbStage1, tbStage2, and tbStage3. We will also add three label controls with the text Input File, Encrypted File, and Output File. These will be placed by the corresponding textblocks. Now, the MainWindow.xaml file should look like the following screenshot: Now we will need an input data file to encrypt. We will call this file InputData.txt and put it in the C:projects folder on our computer. For our example, we have added the following text to it: We are all finished and ready to try it out. Compile and run the application and you should have a window that looks like the following screenshot: Now, click on the Encrypt File button and you should see the following output: As you can see, the input and output files look the same and the encrypted file looks different. Remember that Input File is the text we put in the input data text file; this is the input from the end of stage 1 after we have read the file in to a character list. Encrypted File is the output from stage 2 after we have encrypted each character. Output File is the output of stage 3 after we have decrypted the characters again. It should match Input File. Now, let's take a look at how this works. How it works Let's look at the butEncrypt click event handler method in the MainWindow.xaml.cs file, as this is where a lot of the action takes place. Let's examine the following lines of code:            //Create queues for input and output to stages.            int size = 20;            BlockingCollection<char> Buffer1 = new BlockingCollection<char>(size);            BlockingCollection<char> Buffer2 = new BlockingCollection<char>(size);            TaskFactory tasks = new TaskFactory(TaskCreationOptions.LongRunning, TaskContinuationOptions.None);              Task Stage1 = tasks.StartNew(() => Stage.FirstStage(Buffer1, PipelineInputFile));            Task Stage2 = tasks.StartNew(() => Stage.StageWorker(Buffer1, Buffer2, PipelineEncryptFile));            Task Stage3 = tasks.StartNew(() => Stage.FinalStage(Buffer2, PipelineResultsFile)); First, we create two queues that are implemented using BlockingCollection objects. Each of these is set with a size of 20 items. These two queues take a character datatype. Then we create a TaskFactory object and use it to start three tasks. Each task uses a lambda expression that executes one of the stages methods from the Stages class—FirstStage, StageWorker, and FinalStage. So, now we have three separate tasks running besides the main UI thread. Stage1 will read the input data file character by character and place each character in the queue Buffer1. Remember that this queue can only hold 20 items before it will block the FirstStage method waiting on room in the queue. This is how we know that Stage2 starts running before Stage1 completes. Otherwise, Stage1 will only queue the first 20 characters and then block. Once Stage1 has read all of the characters from the input file and placed them into Buffer1, it then makes the following call:            finally            {                output.CompleteAdding();            } This lets the BlockingCollection instance, Buffer1, to know that there are no more items to be put in the queue. So, when Stage2 has emptied the queue after Stage1 has called this method, it will not block but will instead continue until completion. Prior to the CompleteAdding method call, Stage2 will block if Buffer1 is empty, waiting until more items are placed in the queue. This is why a BlockingCollection instance was developed for Pipeline and producer-consumer applications. It provides the perfect mechanism for this functionality. When we created the TaskFactory, we used the following parameter: TaskCreationOptions.LongRunning This tells the threadpool that these tasks may run for a long time and could occasionally block waiting on their queues. In this way, the threadpool can decide how to best manage the threads allocated for these tasks. Now, let's look at the code in Stage2—the StageWorker method. We need a way to remove items in an enumerable way so that we can iterate over the queues items with a foreach loop because we do not know how many items to expect. Also, since BlockingCollection objects support multiple consumers, we need a way to remove items that no other consumer might remove. We use this method of the BlockingCollection class: foreach (char C in input.GetConsumingEnumerable()) This allows multiple consumers to remove items from a BlockingCollection instance while maintaining the order of the items. To further improve performance of this application (assuming we have enough available processing cores), we could create a fourth task that also runs the StageWorker method. So, then we would have two stages and two tasks running. This might be helpful if there are enough processing cores and stage 1 runs faster than stage 2. If this happens, it will continually fill the queue and block until space becomes available. But if we run multiple stage 2 tasks, then we will be able to keep up with stage 1. Then, finally we have this line of code: Task.WaitAll(Stage1, Stage2, Stage3); This tells our button handler to wait until all of the tasks are complete. Once we have called the CompleteAdding method on each BlockingCollection instance and the buffers are then emptied, all of our stages will complete and the TaskFactory.WaitAll command will be satisfied and this method on the UI thread can complete its processing, which in this application is to update the UI and data files:            //Display the 3 files.            using (StreamReader inputfile = new StreamReader(PipelineInputFile))            {                while (inputfile.Peek() >= 0)                {                    tbStage1.Text = tbStage1.Text + (char)inputfile.Read();                }              }            using (StreamReader inputfile = new StreamReader(PipelineEncryptFile))            {                while (inputfile.Peek() >= 0)                {                    tbStage2.Text = tbStage2.Text + (char)inputfile.Read();                }              }            using (StreamReader inputfile = new StreamReader(PipelineResultsFile))            {                while (inputfile.Peek() >= 0)                {                    tbStage3.Text = tbStage3.Text + (char)inputfile.Read();                }              } Next, experiment with longer running, more complex stages and multiple consumer stages. Also, try stepping through the application with the Visual Studio debugger. Make sure you understand the interaction between the stages and the buffers. Explaining message blocks Let's talk for a minute about message blocks and the TPL. There is a new library that Microsoft has developed as part of the TPL, but it does not ship directly with .NET 4.5. This library is called the TPL Dataflow library. It is located in the System.Threading.Tasks.Dataflow namespace. It comes with various dataflow components that assist in asynchronous concurrent applications where messages need to be passed between multiple tasks or the data needs to be passed when it becomes available, as in the case of a web camera streaming video. The Dataflow library's message blocks are very helpful for design patterns such as Pipeline and producer-consumer where you have multiple producers producing data that can be consumed by multiple consumers. The two that we will take a look at are BufferBlock and ActionBlock. The TPL Dataflow library contains classes to assist in message passing and parallelizing I/O-heavy applications that have a lot of throughput. It provides explicit control over how data is buffered and passed. Consider an application that asynchronously loads large binary files from storage and manipulates that data. Traditional programming requires that you use callbacks and synchronization classes, such as locks, to coordinate tasks and have access to data that is shared. By using the TPL Dataflow objects, you can create objects that process image files as they are read in from a disk location. You can set how data is handled when it becomes available. Because the CLR runtime engine manages dependencies between data, you do not have to worry about synchronizing access to shared data. Also, since the CLR engine schedules the work depending on the asynchronous arrival of data, the TPL Dataflow objects can improve performance by managing the threads the tasks run on. In this section, we will cover two of these classes, BufferBlock and ActionBlock. The TPL Dataflow library (System.Threading.Tasks.Dataflow) does not ship with .NET 4.5. To install System.Threading.Tasks.Dataflow, open your project in Visual Studio, select Manage NuGet Packages from under the Project menu and then search online for Microsoft.Tpl.Dataflow. BufferBlock The BufferBlock object in the Dataflow library provides a buffer to store data. The syntax is, BufferBlock<T>. The T indicates that the datatype is generic and can be of any type. All static variables of this object type are guaranteed to be thread-safe. BufferBlock is an asynchronous message structure that stores messages in a first-in-first-out queue. Messages can be "posted" to the queue by multiple producers and "received" from the queue by multiple consumers. The TPL DatafLow library provides interfaces for three types of objects—source blocks, target blocks, and propagator blocks. BufferBlock is a general-purpose message block that can act as both a source and a target message buffer, which makes it perfect for a producer-consumer application design. To act as both a source and a target, it implements two interfaces defined by the TPL Dataflow library—ISourceBlock<TOutput> and ITargetBlock<TOutput>. So, in the application that we will develop in the Producer-consumer design pattern section of this article, you will see that the producer method implements BufferBlock using the ITargetBlock interface and the consumer implements BufferBlock with the ISourceBlock interface. This will be the same BufferBlock object that they will act on but by defining their local objects with a different interface there will be different methods available to use. The producer method will have Post and Complete methods, and the consumer method will use the OutputAvailableAsync and Receive methods. The BufferBlock object only has two properties, namely Count, which is a count of the number of data messages in the queue, and Completion, which gets a task that is an asynchronous operation and completion of the message block. The following is a set of methods for this class: Referenced from http://msdn.microsoft.com/en-us/library/hh160414(v=vs.110).aspx Here is a list of the extension methods provided by the interfaces that it implements: Referenced from http://msdn.microsoft.com/en-us/library/hh160414(v=vs.110).aspx Finally, here are the interface references for this class: Referenced from http://msdn.microsoft.com/en-us/library/hh160414(v=vs.110).aspx So, as you can see, these interfaces make using the BufferBlock object as a general-purpose queue between stages of a pipeline very easy. This technique is also useful between producers and consumers in a producer-consumer design pattern. ActionBlock Another very useful object in the Dataflow library is ActionBlock. Its syntax is ActionBlock<TInput>, where TInput is an Action object. ActionBlock is a target block that executes a delegate when a message of data is received. The following is a very simple example of using an ActionBlock:            ActionBlock<int> action = new ActionBlock<int>(x => Console.WriteLine(x));              action.Post(10); In this sample piece of code, the ActionBlock object is created with an integer parameter and executes a simple lambda expression that does a Console.WriteLine when a message of data is posted to the buffer. So, when the action.Post(10) command is executed, the integer, 10, is posted to the ActionBlock buffer and then the ActionBlock delegate, implemented as a lambda expression in this case, is executed. In this example, since this is a target block, we would then need to call the Complete method to ensure the message block is completed. Another handy method of the BufferBlock is the LinkTo method. This method allows you to link ISourceBlock to ITargetBlock. So, you can have a BufferBlock that is implemented as an ISourceBlock and link it to an ActionBlock since it is an ITargetBlock. In this way, an Action delegate can be executed when a BufferBlock receives data. This does not dequeue the data from the message block. It just allows you to execute some task when data is received into the buffer. ActionBlock only has two properties, namely InputCount, which is a count of the number of data messages in the queue, and Completion, which gets a task that is an asynchronous operation and completion of the message block. It has the following methods: Referenced from http://msdn.microsoft.com/en-us/library/hh194684(v=vs.110).aspx The following extension methods are implemented from its interfaces: Referenced from http://msdn.microsoft.com/en-us/library/hh194684(v=vs.110).aspx Also, it implements the following interfaces: Referenced from http://msdn.microsoft.com/en-us/library/hh194684(v=vs.110).aspx Now that we have examined a little of the Dataflow library that Microsoft has developed, let's use it in a producer-consumer application. Producer-consumer design pattern Now, that we have covered the TPL's Dataflow library and the set of objects it provides to assist in asynchronous message passing between concurrent tasks, let's take a look at the producer-consumer design pattern. In a typical producer-consumer design, we have one or more producers putting data into a queue or message data block. Then we have one or more consumers taking data from the queue and processing it. This allows for asynchronous processing of data. Using the Dataflow library objects, we can create a consumer task that monitors a BufferBlock and pulls items of the data from it when they arrive. If no items are available, the consumer method will block until items are available or the BufferBlock has been set to Complete. Because of this, we can start our consumer at any time, even before the producer starts to put items into the queue. Then we create one or more tasks that produce items and place them into the BufferBlock. Once the producers are finished processing all items of data to the BufferBlock, they can mark the block as Complete. Until then, the BufferBlock object is still available to add items into. This is perfect for long-running tasks and applications when we do not know when the data will arrive. Because the producer task is implementing an input parameter of a BufferBlock as an ITargetBlock object and the consumer task is implementing an input parameter of a BufferBlock as an ISourceBlock, they can both use the same BufferBlock object but have different methods available to them. One has methods to produces items to the block and mark it complete. The other one has methods to receive items and wait for more items until the block is marked complete. In this way, the Dataflow library implements the perfect object to act as a queue between our producers and consumers. Now, let's take a look at the application we developed previously as a Pipeline design and modify it using the Dataflow library. We will also remove a stage so that it just has two stages, one producer and one consumer. How to do it The first thing we need to do is open Visual Studio and create a new console application called ProducerConsumerConsoleApp. We will use a console application this time just for ease. Our main purpose here is to demonstrate how to implement the producer-consumer design pattern using the TPL Dataflow library. Once you have opened Visual Studio and created the project, we need to perform the following steps: First, we need to install and add a reference to the TPL Dataflow library. The TPL Dataflow library (System.Threading.Tasks.Dataflow) does not ship with .NET 4.5. Select Manage NuGet Packages from under the Project menu and then search online for Microsoft.Tpl.Dataflow. Now, we will need to add two using statements to our program. One for StreamReader and StreamWriter and one for the BufferBlock object: using System.Threading.Tasks.Dataflow; using System.IO; Now, let's add two static strings that will point to our input data file and the encrypted data file that we output: private static String PipelineEncryptFile = @"c:projectsEncryptData.txt";        private static String PipelineInputFile = @"c:projectsInputData.txt"; Next, let's add a static method that will act as our producer. This method will have the following code:        // Our Producer method.        static void Producer(ITargetBlock<char> Target)        {            String DisplayData = "";              try            {                foreach (char C in GetData(PipelineInputFile))                {                      //Displayed characters read in from the file.                    DisplayData = DisplayData + C.ToString();                      // Add each character to the buffer for the next stage.                    Target.Post(C);                  }            }              finally            {                Target.Complete();            }          } Then we will add a static method to perform our consumer functionality. It will have the following code:        // This is our consumer method. IT runs asynchronously.        static async Task<int> Consumer(ISourceBlock<char> Source)        {            String DisplayData = "";              // Read from the source buffer until the source buffer has no            // available output data.            while (await Source.OutputAvailableAsync())            {                    char C = Source.Receive();                      //Encrypt each character.                    char encrypted = Encrypt(C);                      DisplayData = DisplayData + encrypted.ToString();              }              //write the decrypted string to the output file.            using (StreamWriter outfile =                         new StreamWriter(PipelineEncryptFile))            {                outfile.Write(DisplayData);            }              return DisplayData.Length;        } Then, let's create a simple static helper method to read our input data file and put it in a List collection character by character. This will give us a character list for our producer to use. The code in this method will look like this:        public static List<char> GetData(String PipelineInputFile)        {            List<char> Data = new List<char>();              //Get the Source data.            using (StreamReader inputfile = new StreamReader(PipelineInputFile))            {                while (inputfile.Peek() >= 0)                {                    Data.Add((char)inputfile.Read());                }              }              return Data;        } Next, we will add a static method to encrypt our characters. This method will work like the one we used in our pipelining application. It will add one to the ASCII numerical value of the character:        public static char Encrypt(char C)        {            //Take the character, convert to an int, add 1, then convert back to a character.            int i = (int)C;            i = i + 1;            C = Convert.ToChar(i);              return C;        } Then, we need to add the code for our Then, we need to add the code for our Main method. This method will start our consumer and producer tasks. Then, when they have completed processing, it will display the results in the console. The code for this method looks like this:        static void Main(string[] args)        {            // Create the buffer block object to use between the producer and consumer.            BufferBlock<char> buffer = new BufferBlock<char>();              // The consumer method runs asynchronously. Start it now.            Task<int> consumer = Consumer(buffer);              // Post source data to the dataflow block.            Producer(buffer);              // Wait for the consumer to process all data.            consumer.Wait();              // Print the count of characters from the input file.            Console.WriteLine("Processed {0} bytes from input file.", consumer.Result);              //Print out the input file to the console.            Console.WriteLine("rnrn");            Console.WriteLine("This is the input data file. rn");            using (StreamReader inputfile = new StreamReader(PipelineInputFile))            {                while (inputfile.Peek() >= 0)                {                    Console.Write((char)inputfile.Read());                }              }              //Print out the encrypted file to the console.            Console.WriteLine("rnrn");            Console.WriteLine("This is the encrypted data file. rn");            using (StreamReader encryptfile = new StreamReader(PipelineEncryptFile))            {                while (encryptfile.Peek() >= 0)                {                    Console.Write((char)encryptfile.Read());                }              }             //Wait before closing the application so we can see the results.            Console.ReadLine();        } That is all the code that is needed. Now, let's build and run the application using the following input data file: Once it runs and completes, your output should look like the following screenshot: Now, try this with your own data files and inputs. Let's examine what happened and how this works. How it works First we will go through the Main method. The first thing Main does is create a BufferBlock object called buffer. This will be used as the queue of items between our producer and consumer. This BufferBlock is defined to accept character datatypes. Next, we start our consumer task using this command: Task<int> consumer = Consumer(buffer); Also, note that when this buffer object goes into the consumer task, it is cast as ISourceBlock. Notice the method header of our consumer: static async Task<int> Consumer(ISourceBlock<char> Source) Next, our Main method starts our producer task using the following command: Producer(buffer); Then we wait until our consumer task finishes, using this command: consumer.Wait(); So, now our Main method just waits. Its work is done for now. It has started both the producer and consumer tasks. Now our consumer is waiting for items to appear in its BufferBlock so it can process them. The consumer will stay in the following loop until all items are removed from the message block and the block has been completed, which is done by someone calling its Complete method:      while (await Source.OutputAvailableAsync())            {                    char C = Source.Receive();                      //Encrypt each character.                    char encrypted = Encrypt(C);                      DisplayData = DisplayData + encrypted.ToString();              } So, now our consumer task will loop asynchronously, removing items from the message queue as they appear. It uses the following command in the while loop to do this: await Source.OutputAvailableAsync()) Likewise, other consumer tasks can run at the same time and do the same thing. If the producer is adding items to the block quicker than the consumer can process them, then adding another consumer will improve performance. Once an item is available, then the consumer calls the following command to get the item from the buffer: char C = Source.Receive(); Since the buffer contains items of type character, we place the item received into a character value. Then the consumer processes it by encrypting the character and appending it to our display string: Now, let's look at the consumer. The consumer first gets its data by calling the following command: GetData(PipelineInputFile) This method returns a List collection of characters that has an item for each character in the input data file. Now the producer iterates through the collection and uses the following command to place each item into the buffer block: Target.Post(C); Also, notice in the method header for our consumer that we cast our buffer as an ITargetBlock type: static void Producer(ITargetBlock<char> Target) Once the producer is done processing characters and adding them to the buffer, it officially closes the BufferBlock object using this command: Target.Complete(); That is it for the producer and consumer. Once the Main method is done waiting on the consumer to finish, it then uses the following code to write out the number of characters processed, the input data, and the encrypted data:      // Print the count of characters from the input file.            Console.WriteLine("Processed {0} bytes from input file.", consumer.Result);              //Print out the input file to the console.            Console.WriteLine("rnrn");            Console.WriteLine("This is the input data file. rn");            using (StreamReader inputfile = new StreamReader(PipelineInputFile))            {                while (inputfile.Peek() >= 0)                {                    Console.Write((char)inputfile.Read());                }              }              //Print out the encrypted file to the console.            Console.WriteLine("rnrn");            Console.WriteLine("This is the encrypted data file. rn");            using (StreamReader encryptfile = new StreamReader(PipelineEncryptFile))            {                while (encryptfile.Peek() >= 0)                {                    Console.Write((char)encryptfile.Read());                }              } Now that you are comfortable implementing a basic producer-consumer design using objects from the TPL Dataflow library, try experimenting with this basic idea but use multiple producers and multiple consumers all with the same BufferBlock object as the queue between them all. Also, try converting our original Pipeline application from the beginning of the article into a TPL Dataflow producer-consumer application with two sets of producers and consumers. The first will act as stage 1 and stage 2, and the second will act as stage 2 and stage 3. So, in effect, stage 2 will be both a consumer and a producer. Summary We have covered a lot in this article. We have learned the benefits and how to implement a Pipeline design pattern and a producer-consumer design pattern. As we saw, these are both very helpful design patterns when building parallel and concurrent applications that require multiple asynchronous processes of data between tasks. In the Pipeline design, we are able to run multiple tasks or stages concurrently even though the stages rely on data being processed and output by other stages. This is very helpful for performance since all functionality doesn't have to wait on each stage to finish processing every item of data. In our example, we are able to start decrypting characters of data while a previous stage is still encrypting data and placing it into the queue. In the Pipeline example, we examined the benefits of the BlockingCollection class in acting as a queue between stages in our pipeline. Next, we explored the new TPL Dataflow library and some of its message block classes. These classes implement several interfaces defined in the library—ISourceBlock, ITargetBlock, and IPropogatorBlock. By implementing these interfaces, it allows us to write generic producer and consumer task functionality that can be reused in a variety of applications. Both of these design patterns and the Dataflow library allow for easy implementations of common functionality in a concurrent manner. You will use these techniques in many applications, and this will become a go-to design pattern when you evaluate a system's requirements and determine how to implement concurrency to help improve performance. Like all programming, parallel programming is made easier when you have a toolbox of easy-to-use techniques that you are comfortable with. Most applications that benefit from parallelism will be conducive to some variation of a producer-consumer or Pipeline pattern. Also, the BlockingCollection and Dataflow message block objects are useful mechanisms for coordinating data between parallel tasks, no matter what design pattern is used in the application. It will be very useful to become comfortable with these messaging and queuing classes. Resources for Article: Further resources on this subject: Parallel Programming Patterns [article] Watching Multiple Threads in C# [article] Clusters, Parallel Computing, and Raspberry Pi – A Brief Background [article]
Read more
  • 0
  • 0
  • 11302

article-image-working-your-team
Packt
19 Dec 2014
14 min read
Save for later

Working with Your Team

Packt
19 Dec 2014
14 min read
In this article by Jarosław Krochmalski, author of the book IntelliJ IDEA Essentials, we will talk about working with VCS systems such as Git and Subversion. While working on the code, one of the most important aspects is version control. A Version Control System (VCS) (also known as a Revision Control System) is a repository of source code files with monitored access. Every change made to the source is tracked, along with who made the change, why they made it, and comments about problems fixed or enhancements introduced by the change. It doesn't matter if you work alone or in a team, having the tool to efficiently work with different versions of the code is crucial. Software development is usually carried out by teams, either distributed or colocated. The version control system lets developers work on a copy of the source code and then release their changes back to the common codebase when ready. Other developers work on their own copies of the same code at the same time, unaffected by each other's changes until they choose to merge or commit their changes back to the project. Currently, probably the most popular version control system is Git. After reading this article, you will be able to set up the version control mechanism of your choice, get files from the repository, commit your work, and browse the changes. Let's start with the version control setup. (For more resources related to this topic, see here.) Enabling version control At the IDE level, version control integration is provided through a set of plugins. IntelliJ IDEA comes bundled with a number of plugins to integrate with the most popular version control systems. They include Git, CVS, Subversion, and Mercurial. The Ultimate edition additionally contains Clearcase, Visual SourceSafe, and Perforce plugins. You will need to enable them in the Plugins section of the Settings dialog box. If you find the VCS feature is not enough and you are using some other VCS, try to find it in the Browse Repositories dialog box by choosing VCS Integration from the Category drop-down menu, as shown here: The list of plugins here contains not only integration plugins, but also some useful add-ons for the installed integrations. For example, the SVN Bar plugin will create a quick access toolbar with buttons specific for Subversion (SVN) actions. Feel free to browse the list of plugins here and read the descriptions; you might find some valuable extensions. The basic principles of working with the version control systems in IntelliJ IDEA are rather similar. We will focus on the Git and Subversion integration. This article should give you an overview of how to deal with the setup and version control commands in IntelliJ IDEA in general. If you have the necessary plugins enabled in the Settings dialog box, you can start working with the version control. We will begin with fetching the project out of the version control. Doing this will set up the version control automatically so that further steps will not be required unless you decide not to use the default workflow. Later, we will cover setting the VCS integration manually, so you will be able to tweak IntelliJ's behavior then. Checking out the project from the repository To be able to work on the files, first you need to get them from the repository. To get the files from the remote Git repository, you need to use the clone command available in the VCS menu, under the Checkout from Version Control option, as shown here: In the Clone Repository dialog box, provide necessary options, such as the remote repository URL, parent directory, and the directory name to clone into, as shown in the following screenshot: After successful cloning, IntelliJ IDEA will suggest creating a project based on the cloned sources. If you don't have the remote repository for your project, you can work with the offline local Git repository. To create a local Git repository, select Create Git repository from the VCS menu, as shown in the following screenshot: This option will execute the git init command in the directory of your choice; it will most probably be the root directory of your project. For the time being, the Git plugin does not allow you to set up remote repositories. You will probably need to set up the remote host for your newly created Git repository before you can actually fetch and push changes. If you are using GitHub for your projects, the great GitHub integration plugin gives you the option to share the project on GitHub. This will create the remote repository automatically. Later, when you want to get the files from the remote repository, just use the Git Pull command. This will basically retrieve changes (fetch) and apply them to the local branch (merge). To obtain a local working copy of a subversion repository, choose Checkout from Version Control and then Subversion from the VCS menu. In the SVN Checkout Options dialog box, you will be able to specify Subversion-specific settings, such as a revision that needs to be checked (HEAD, for example). Again, IntelliJ IDEA will ask if you want to create the project from checked out sources. If you accept the suggestion to create a new project, New Project from Existing Code Wizard will start. Fetching the project out of the repository will create some default VCS configuration in IntelliJ IDEA. It is usually sufficient, but if needed, the configuration can be changed. Let's discuss how to change the configuration in the next section. Configuring version control The VCS configuration in IntelliJ IDEA can be changed at the project level. Head to the Version Control section in the Settings dialog box, as shown here: The Version Control section contains options that are common for all version control systems and also specific options for the different VCS systems (enabled by installing the corresponding plugins). IntelliJ IDEA uses a directory-based model for version control. The versioning mechanism is assigned to a specific directory that can either be a part of a project or can be just related to the project. This directory is not required to be located under the project root. Multiple directories can have different version control systems linked. To add a directory into the version control integration, use the Alt + Insert keyboard shortcut or click on the green plus button; the Add VCS Directory Mapping dialog box will appear. You have the option to put all the project contents, starting from its base directory to the version control or limit the version control only to specific directories. Select the VCS system you need from the VCS drop-down menu, as shown in the following screenshot: By default, IntelliJ IDEA will mark the changed files with a color in the Project tool window, as shown here: If you select the Show directories with changed descendants option, IntelliJ IDEA will additionally mark the directories containing the changed files with a color, giving you the possibility to quickly notice the changes without expanding the project tree, as shown in the following screenshot: The Show changed in last <number> days option will highlight the files changed recently during the debugging process and when displaying stacktraces. Displaying the changed files in color can be very useful. If you see the colored file in the stacktrace, maybe the last change to the file is causing a problem. The subsequent panes contain general version control settings, which apply to all version control systems integrated with the IDE. They include specifying actions that require confirmation, background operations set up, the ignored files list, and issuing of navigation configuration. In the Confirmation section, you specify what version control actions will need your confirmation. The Background section will tell IntelliJ IDEA what operation it should perform in the background, as shown in the following screenshot: If you choose to perform the operation in the background, IntelliJ IDEA will not display any modal windows during and after the operation. The progress and result will be presented in the status bar of the IDE and in the corresponding tool windows. For example, after the successful execution of the Git pull command, IntelliJ IDEA will present the Update Project Info tool window with the files changed and the Event Log tool window with the status of the operation, as shown in the following screenshot: In the Ignored Files section, you can specify a list of files and directories that you do not want to put under version control, as shown in the following screenshot: To add a file or directory, use the Alt + Insert keyboard shortcut or hit the green plus (+) icon. The Ignore Unversioned Files dialog box will pop up as shown here: You can now specify a single file or the directory you want to ignore. There is also the possibility to construct the filename pattern for files to be ignored. Backup and logfiles are good candidates to be specified here, for example. Most of the version control systems support the file with a list of file patterns to ignore. For Git, this will be the .gitignore file. IntelliJ IDEA will analyze such files during the project checkout from the existing repository and will fill the Ignored files list automatically. In the Issue Navigation section, you can create a list of patterns to issue navigation. IntelliJ IDEA will try to use these patterns to create links from the commit messages. These links will then be displayed in the Changes and Version Control tool windows. Clicking on the link will open the browser and take you to the issue tracker of your choice. IntelliJ IDEA comes with predefined patterns for the most popular issue trackers: JIRA and YouTrack. To create a link to JIRA, click on the first button and provide the URL for your JIRA instance, as shown in the following screenshot: To create a link to the YouTrack instance, click on the OK button and provide the URL to the YouTrack instance. If you do not use JIRA or YouTrack, you can also specify a generic pattern. Press the Alt + Insert keyboard shortcut to add a new pattern. In the IssueID field, enter the regular expression that IntelliJ IDEA will use to extract a part of the link. In the Issue Link field, provide the link expression that IntelliJ IDEA will use to replace a issue number within. Use the Example section to check if the resulting link is correct, as shown in the following screenshot: The next sections in the Version Control preferences list contain options specific to the version control system you are using. For example, the Git-specific options can be configured in the Git section, as shown here: You can specify the Git command executable here or select the associated SSH executable that will be used to perform the network Git operations such as pull and push. The Auto-update if push of the current branch was rejected option is quite useful—IntelliJ IDEA will execute the pull command first if the push command fails because of the changes in the repository revision. This saves some time.We should now have version control integration up and running. Let's use it. Working with version control Before we start working with version control, we need to know about the concept of the changelist in IntelliJ IDEA. Let's focus on this now. Changelists When it comes to newly created or modified files, IntelliJ IDEA introduces the concept of a changelist. A changelist is a set of file modifications that represents a logical change in the source. Any modified file will go to the Default changelist. You can create new changelists if you like. The changes contained in a specific changelist are not stored in the repository until committed. Only the active changelist contains the files that are going to be committed. If you modify the file that is contained in the non-active change list, there is a risk that it will not be committed. This takes us to the last section of the common VCS settings at Settings | Version Control | Changelist conflicts. In this section, you can configure the protection of files that are present in the changelist that is not currently active. In other words, you define how IntelliJ IDEA should behave when you modify the file that is not in the active changelist. The protection is turned on by default (Enable changelist conflict tracking is checked). If the Resolve Changelist Conflict checkbox is marked, the IDE will display the Resolve Changelist Conflict dialog box when you try to modify such a file. The possible options are to either shelve the changes (we will talk about the concept of shelving in a while), move a file to the active changelist, switch changelists to make the current changelist active, or ignore the conflict. If Highlight files with conflicts is checked and if you try to modify a file from the non-active change list, a warning will pop up in the editor, as shown in the following screenshot: Again, you will have the possibility to move the changes to another change list, switch the active change list, or ignore the conflict. If you select Ignore, the change will be listed in the Files with ignored conflicts list, as shown in the following screenshot: The list of all changelists in the project is listed in the Commit Changes dialog box (we will cover committing files in a while) and in the first tab of the Changes tool window, as shown here: You can create a new changelist by using the Alt + Insert keyboard shortcut. The active list will have its name highlighted in bold. The last list is special; it contains the list of unversioned files. You can drag-and-drop files between the changelists (with the exception of unversioned files). Now that we know what a changelist is, let's add some files to the repository now. Adding files to version control You will probably want newly created files to be placed in version control. If you create a file in a directory already associated with the version control system, IntelliJ IDEA will add the file to the active changelist automatically, unless you configured this differently in the Confirmation section of the Version Control pane in the Settings dialog box. If you decided to have Show options before adding to version control checked, IntelliJ IDEA will ask if you want to add the file to the VCS, as shown here: If you decide to check the Remember, don't ask again checkbox, IntelliJ IDEA will throw the future new files into version control silently. You can also add new files to the version control explicitly. Click on the file or directory you want to add in the Project tool window and choose the corresponding VCS command; for example: Alternatively, you can open the Changes tool window, and browse Unversioned Files, where you can right-click on the file you want to add and select Add to VCS from the context menu, as shown in the following screenshot: If there are many unversioned files, IntelliJ IDEA will render a link that allows you to browse the files in a separate dialog box, as shown in the following screenshot: In the Unversioned Files dialog box, right-click on the file you want to add and select Add to VCS from the context menu, as shown in the following screenshot: From now on, the file will be ready to commit to the repository. If you've accidently added some files to version control and want to change them to unversioned, you can always revert the file so that it is no longer marked as part of the versioned files. Summary After reading this article, you know how to set up version control, get the project from the repository, commit your work, and get the changes made by other members of your team. Version control in IntelliJ IDEA is tightly integrated into the IDE. All the versioning activities can be executed from the IDE itself—you will not need to use an external tool for this. I believe it will shortly become natural for you to use the provided functionalities. Not being distracted by the use of external tools will result in higher effectiveness. Resources for Article: Further resources on this subject: Improving Your Development Speed [Article] Ridge Regression [Article] Function passing [Article]
Read more
  • 0
  • 0
  • 2443

article-image-performance-optimization
Packt
19 Dec 2014
30 min read
Save for later

Performance Optimization

Packt
19 Dec 2014
30 min read
In this article is written by Mark Kerzner and Sujee Maniyam, the authors of HBase Design Patterns, we will talk about how to write high performance and scalable HBase applications. In particular, will take a look at the following topics: The bulk loading of data into HBase Profiling HBase applications Tips to get good performance on writes Tips to get good performance on reads (For more resources related to this topic, see here.) Loading bulk data into HBase When deploying HBase for the first time, we usually need to import a significant amount of data. This is called initial loading or bootstrapping. There are three methods that can be used to import data into HBase, given as follows: Using the Java API to insert data into HBase. This can be done in a single client, using single or multiple threads. Using MapReduce to insert data in parallel (this approach also uses the Java API), as shown in the following diagram:  Using MapReduce to generate HBase store files in parallel in bulk and then import them into HBase directly. (This approach does not require the use of the API; it does not require code and is very efficient.)  On comparing the three methods speed wise, we have the following order: Java client < MapReduce insert < HBase file import The Java client and MapReduce use HBase APIs to insert data. MapReduce runs on multiple machines and can exploit parallelism. However, both of these methods go through the write path in HBase. Importing HBase files directly, however, skips the usual write path. HBase files already have data in the correct format that HBase understands. That's why importing them is much faster than using MapReduce and the Java client. We covered the Java API earlier. Let's start with how to insert data using MapReduce. Importing data into HBase using MapReduce MapReduce is the distributed processing engine of Hadoop. Usually, programs read/write data from HDFS. Luckily, HBase supports MapReduce. HBase can be the source and the sink for MapReduce programs. A source means MapReduce programs can read from HBase, and sink means results from MapReduce can be sent to HBase. The following diagram illustrates various sources and sinks for MapReduce:     The diagram we just saw can be summarized as follows: Scenario Source Sink Description 1 HDFS HDFS This is a typical MapReduce method that reads data from HDFS and also sends the results to HDFS. 2 HDFS HBase This imports the data from HDFS into HBase. It's a very common method that is used to import data into HBase for the first time. 3 HBase HBase Data is read from HBase and written to it. It is most likely that these will be two separate HBase clusters. It's usually used for backups and mirroring.  Importing data from HDFS into HBase Let's say we have lots of data in HDFS and want to import it into HBase. We are going to write a MapReduce program that reads from HDFS and inserts data into HBase. This is depicted in the second scenario in the table we just saw. Now, we'll be setting up the environment for the following discussion. In addition, you can find the code and the data for this discussion in our GitHub repository at https://github.com/elephantscale/hbase-book. The dataset we will use is the sensor data. Our (imaginary) sensor data is stored in HDFS as CSV (comma-separated values) text files. This is how their format looks: Sensor_id, max temperature, min temperature Here is some sample data: sensor11,90,70 sensor22,80,70 sensor31,85,72 sensor33,75,72 We have two sample files (sensor-data1.csv and sensor-data2.csv) in our repository under the /data directory. Feel free to inspect them. The first thing we have to do is copy these files into HDFS. Create a directory in HDFS as follows: $   hdfs   dfs -mkdir   hbase-import Now, copy the files into HDFS: $   hdfs   dfs   -put   sensor-data*   hbase-import/ Verify that the files exist as follows: $   hdfs   dfs -ls   hbase-import We are ready to insert this data into HBase. Note that we are designing the table to match the CSV files we are loading for ease of use. Our row key is sensor_id. We have one column family and we call it f (short for family). Now, we will store two columns, max temperature and min temperature, in this column family. Pig for MapReduce Pig allows you to write MapReduce programs at a very high level, and inserting data into HBase is just as easy. Here's a Pig script that reads the sensor data from HDFS and writes it in HBase: -- ## hdfs-to-hbase.pigdata = LOAD 'hbase-import/' using PigStorage(',') as (sensor_id:chararray, max:int, min:int);-- describe data;-- dump data; Now, store the data in hbase://sensors using the following line of code: org.apache.pig.backend.hadoop.hbase.HBaseStorage('f:max,f:min'); After creating the table, in the first command, we will load data from the hbase-import directory in HDFS. The schema for the data is defined as follows: Sensor_id : chararray (string)max : intmin : int The describe and dump statements can be used to inspect the data; in Pig, describe will give you the structure of the data object you have, and dump will output all the data to the terminal. The final STORE command is the one that inserts the data into HBase. Let's analyze how it is structured: INTO 'hbase://sensors': This tells Pig to connect to the sensors HBase table. org.apache.pig.backend.hadoop.hbase.HBaseStorage: This is the Pig class that will be used to write in HBase. Pig has adapters for multiple data stores. The first field in the tuple, sensor_id, will be used as a row key. We are specifying the column names for the max and min fields (f:max and f:min, respectively). Note that we have to specify the column family (f:) to qualify the columns. Before running this script, we need to create an HBase table called sensors. We can do this from the HBase shell, as follows: $ hbase shell$ create 'sensors' , 'f'$ quit Then, run the Pig script as follows: $ pig hdfs-to-hbase.pig Now watch the console output. Pig will execute the script as a MapReduce job. Even though we are only importing two small files here, we can insert a fairly large amount of data by exploiting the parallelism of MapReduce. At the end of the run, Pig will print out some statistics: Input(s):Successfully read 7 records (591 bytes) from: "hdfs://quickstart.cloudera:8020/user/cloudera/hbase-import"Output(s):Successfully stored 7 records in: "hbase://sensors" Looks good! We should have seven rows in our HBase sensors table. We can inspect the table from the HBase shell with the following commands: $ hbase shell$ scan 'sensors' This is how your output might look: ROW                      COLUMN+CELL sensor11                 column=f:max, timestamp=1412373703149, value=90 sensor11                 column=f:min, timestamp=1412373703149, value=70 sensor22                 column=f:max, timestamp=1412373703177, value=80 sensor22                column=f:min, timestamp=1412373703177, value=70 sensor31                 column=f:max, timestamp=1412373703177, value=85 sensor31                 column=f:min, timestamp=1412373703177, value=72 sensor33                 column=f:max, timestamp=1412373703177, value=75 sensor33                 column=f:min, timestamp=1412373703177, value=72 sensor44                 column=f:max, timestamp=1412373703184, value=55 sensor44                 column=f:min, timestamp=1412373703184, value=42 sensor45                 column=f:max, timestamp=1412373703184, value=57 sensor45                 column=f:min, timestamp=1412373703184, value=47 sensor55                 column=f:max, timestamp=1412373703184, value=55 sensor55                 column=f:min, timestamp=1412373703184, value=427 row(s) in 0.0820 seconds There you go; you can see that seven rows have been inserted! With Pig, it was very easy. It took us just two lines of Pig script to do the import. Java MapReduce We have just demonstrated MapReduce using Pig, and you now know that Pig is a concise and high-level way to write MapReduce programs. This is demonstrated by our previous script, essentially the two lines of Pig code. However, there are situations where you do want to use the Java API, and it would make more sense to use it than using a Pig script. This can happen when you need Java to access Java libraries or do some other detailed tasks for which Pig is not a good match. For that, we have provided the Java version of the MapReduce code in our GitHub repository. Using HBase's bulk loader utility HBase is shipped with a bulk loader tool called ImportTsv that can import files from HDFS into HBase tables directly. It is very easy to use, and as a bonus, it uses MapReduce internally to process files in parallel. Perform the following steps to use ImportTsv: Stage data files into HDFS (remember that the files are processed using MapReduce). Create a table in HBase if required. Run the import. Staging data files into HDFS The first step to stage data files into HDFS has already been outlined in the previous section. The following sections explain the next two steps to stage data files. Creating an HBase table We will do this from the HBase shell. A note on regions is in order here. Regions are shards created automatically by HBase. It is the regions that are responsible for the distributed nature of HBase. However, you need to pay some attention to them in order to assure performance. If you put all the data in one region, you will cause what is called region hotspotting. What is especially nice about a bulk loader is that when creating a table, it lets you presplit the table into multiple regions. Precreating regions will allow faster imports (because the insert requests will go out to multiple region servers). Here, we are creating a single column family: $ hbase shellhbase> create 'sensors', {NAME => 'f'}, {SPLITS => ['sensor20', 'sensor40', 'sensor60']}0 row(s) in 1.3940 seconds=> Hbase::Table - sensors hbase > describe 'sensors'DESCRIPTION                                       ENABLED'sensors', {NAME => 'f', DATA_BLOCK_ENCODING => true'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE=> '0', VERSIONS => '1', COMPRESSION => 'NONE',MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}1 row(s) in 0.1140 seconds We are creating regions here. Why there are exactly four regions will be clear from the following diagram:   On inspecting the table in the HBase Master UI, we will see this. Also, you can see how Start Key and End Key, which we specified, are showing up. Run the import Ok, now it's time to insert data into HBase. To see the usage of ImportTsv, do the following: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv This will print the usage as follows: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns=HBASE_ROW_KEY,f:max,f:min sensors   hbase-import/ The following table explains what the parameters mean: Parameter Description -Dimporttsv.separator Here, our separator is a comma (,). The default value is tab (t). -Dimporttsv.columns=HBASE_ROW_KEY,f:max,f:min This is where we map our input files into HBase tables. The first field, sensor_id, is our key, and we use HBASE_ROW_KEY to denote that the rest we are inserting into column family f. The second field, max temp, maps to f:max. The last field, min temp, maps to f:min. sensors This is the table name. hbase-import This is the HDFS directory where the data files are located.  When we run this command, we will see that a MapReduce job is being kicked off. This is how an import is parallelized. Also, from the console output, we can see that MapReduce is importing two files as follows: [main] mapreduce.JobSubmitter: number of splits:2 While the job is running, we can inspect the progress from YARN (or the JobTracker UI). One thing that we can note is that the MapReduce job only consists of mappers. This is because we are reading a bunch of files and inserting them into HBase directly. There is nothing to aggregate. So, there is no need for reducers. After the job is done, inspect the counters and we can see this: Map-Reduce Framework Map input records=7 Map output records=7 This tells us that mappers read seven records from the files and inserted seven records into HBase. Let's also verify the data in HBase: $   hbase shellhbase >   scan 'sensors'ROW                 COLUMN+CELLsensor11           column=f:max, timestamp=1409087465345, value=90sensor11           column=f:min, timestamp=1409087465345, value=70sensor22           column=f:max, timestamp=1409087465345, value=80sensor22           column=f:min, timestamp=1409087465345, value=70sensor31           column=f:max, timestamp=1409087465345, value=85sensor31           column=f:min, timestamp=1409087465345, value=72sensor33           column=f:max, timestamp=1409087465345, value=75sensor33           column=f:min, timestamp=1409087465345, value=72sensor44            column=f:max, timestamp=1409087465345, value=55sensor44           column=f:min, timestamp=1409087465345, value=42sensor45           column=f:max, timestamp=1409087465345, value=57sensor45           column=f:min, timestamp=1409087465345, value=47sensor55           column=f:max, timestamp=1409087465345, value=55sensor55           column=f:min, timestamp=1409087465345, value=427 row(s) in 2.1180 seconds Your output might vary slightly. We can see that seven rows are inserted, confirming the MapReduce counters! Let's take another quick look at the HBase UI, which is shown here:    As you can see, the inserts go to different regions. So, on a HBase cluster with many region servers, the load will be spread across the cluster. This is because we have presplit the table into regions. Here are some questions to test your understanding. Run the same ImportTsv command again and see how many records are in the table. Do you get duplicates? Try to find the answer and explain why that is the correct answer, then check these in the GitHub repository (https://github.com/elephantscale/hbase-book). Bulk import scenarios Here are a few bulk import scenarios: Scenario Methods Notes The data is already in HDFS and needs to be imported into HBase. The two methods that can be used to do this are as follows: If the ImportTsv tool can work for you, then use it as it will save time in writing custom MapReduce code. Sometimes, you might have to write a custom MapReduce job to import (for example, complex time series data, doing data mapping, and so on). It is probably a good idea to presplit the table before a bulk import. This spreads the insert requests across the cluster and results in a higher insert rate. If you are writing a custom MapReduce job, consider using a high-level MapReduce platform such as Pig or Hive. They are much more concise to write than the Java code. The data is in another database (RDBMs/NoSQL) and you need to import it into HBase. Use a utility such as Sqoop to bring the data into HDFS and then use the tools outlined in the first scenario. Avoid writing MapReduce code that directly queries databases. Most databases cannot handle many simultaneous connections. It is best to bring the data into Hadoop (HDFS) first and then use MapReduce. Profiling HBase applications Just like any software development process, once we have our HBase application working correctly, we would want to make it faster. At times, developers get too carried away and start optimizing before the application is finalized. There is a well-known rule that premature optimization is the root of all evil. One of the sources for this rule is Scott Meyers Effective C++. We can perform some ad hoc profiling in our code by timing various function calls. Also, we can use profiling tools to pinpoint the trouble spots. Using profiling tools is highly encouraged for the following reasons: Profiling takes out the guesswork (and a good majority of developers' guesses are wrong). There is no need to modify the code. Manual profiling means that we have to go and insert the instrumentation code all over the code. Profilers work by inspecting the runtime behavior. Most profilers have a nice and intuitive UI to visualize the program flow and time flow. The authors use JProfiler. It is a pretty effective profiler. However, it is neither free nor open source. So, for the purpose of this article, we are going to show you a simple manual profiling, as follows: public class UserInsert {      static String tableName = "users";    static String familyName = "info";      public static void main(String[] args) throws Exception {        Configuration config = HBaseConfiguration.create();        // change the following to connect to remote clusters        // config.set("hbase.zookeeper.quorum", "localhost");        long t1a = System.currentTimeMillis();        HTable htable = new HTable(config, tableName);        long t1b = System.currentTimeMillis();        System.out.println ("Connected to HTable in : " + (t1b-t1a) + " ms");        int total = 100;        long t2a = System.currentTimeMillis();        for (int i = 0; i < total; i++) {            int userid = i;            String email = "user-" + i + "@foo.com";            String phone = "555-1234";              byte[] key = Bytes.toBytes(userid);            Put put = new Put(key);              put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email));            put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone));            htable.put(put);          }        long t2b = System.currentTimeMillis();        System.out.println("inserted " + total + " users in " + (t2b - t2a) + " ms");        htable.close();      } } The code we just saw inserts some sample user data into HBase. We are profiling two operations, that is, connection time and actual insert time. A sample run of the Java application yields the following: Connected to HTable in : 1139 msinserted 100 users in 350 ms We spent a lot of time in connecting to HBase. This makes sense. The connection process has to go to ZooKeeper first and then to HBase. So, it is an expensive operation. How can we minimize the connection cost? The answer is by using connection pooling. Luckily, for us, HBase comes with a connection pool manager. The Java class for this is HConnectionManager. It is very simple to use. Let's update our class to use HConnectionManager: Code : File name: hbase_dp.ch8.UserInsert2.java   package hbase_dp.ch8;   import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.HConnection; import org.apache.hadoop.hbase.client.HConnectionManager; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.HTableInterface; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.util.Bytes;   public class UserInsert2 {      static String tableName = "users";    static String familyName = "info";      public static void main(String[] args) throws Exception {        Configuration config = HBaseConfiguration.create();        // change the following to connect to remote clusters        // config.set("hbase.zookeeper.quorum", "localhost");               long t1a = System.currentTimeMillis();        HConnection hConnection = HConnectionManager.createConnection(config);        long t1b = System.currentTimeMillis();        System.out.println ("Connection manager in : " + (t1b-t1a) + " ms");          // simulate the first 'connection'        long t2a = System.currentTimeMillis();        HTableInterface htable = hConnection.getTable(tableName) ;        long t2b = System.currentTimeMillis();        System.out.println ("first connection in : " + (t2b-t2a) + " ms");               // second connection        long t3a = System.currentTimeMillis();        HTableInterface htable2 = hConnection.getTable(tableName) ;        long t3b = System.currentTimeMillis();        System.out.println ("second connection : " + (t3b-t3a) + " ms");          int total = 100;        long t4a = System.currentTimeMillis();        for (int i = 0; i < total; i++) {            int userid = i;            String email = "user-" + i + "@foo.com";            String phone = "555-1234";              byte[] key = Bytes.toBytes(userid);            Put put = new Put(key);              put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email));            put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone));            htable.put(put);          }      long t4b = System.currentTimeMillis();        System.out.println("inserted " + total + " users in " + (t4b - t4a) + " ms");        hConnection.close();    } } A sample run yields the following timings: Connection manager in : 98 ms first connection in : 808 ms second connection : 0 ms inserted 100 users in 393 ms The first connection takes a long time, but then take a look at the time of the second connection. It is almost instant ! This is cool! If you are connecting to HBase from web applications (or interactive applications), use connection pooling. More tips for high-performing HBase writes Here we will discuss some techniques and best practices to improve writes in HBase. Batch writes Currently, in our code, each time we call htable.put (one_put), we make an RPC call to an HBase region server. This round-trip delay can be minimized if we call htable.put() with a bunch of put records. Then, with one round trip, we can insert a bunch of records into HBase. This is called batch puts. Here is an example of batch puts. Only the relevant section is shown for clarity. For the full code, see hbase_dp.ch8.UserInsert3.java:        int total = 100;        long t4a = System.currentTimeMillis();        List<Put> puts = new ArrayList<>();        for (int i = 0; i < total; i++) {            int userid = i;            String email = "user-" + i + "@foo.com";            String phone = "555-1234";              byte[] key = Bytes.toBytes(userid);            Put put = new Put(key);              put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email));            put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone));                       puts.add(put); // just add to the list        }        htable.put(puts); // do a batch put        long t4b = System.currentTimeMillis();        System.out.println("inserted " + total + " users in " + (t4b - t4a) + " ms"); A sample run with a batch put is as follows: inserted 100 users in 48 ms The same code with individual puts took around 350 milliseconds! Use batch writes when you can to minimize latency. Note that the HTableUtil class that comes with HBase implements some smart batching options for your use and enjoyment. Setting memory buffers We can control when the puts are flushed by setting the client write buffer option. Once the data in the memory exceeds this setting, it is flushed to disk. The default setting is 2 M. Its purpose is to limit how much data is stored in the buffer before writing it to disk. There are two ways of setting this: In hbase-site.xml (this setting will be cluster-wide): <property>  <name>hbase.client.write.buffer</name>    <value>8388608</value>   <!-- 8 M --></property> In the application (only applies for that application): htable.setWriteBufferSize(1024*1024*10); // 10 Keep in mind that a bigger buffer takes more memory on both the client side and the server side. As a practical guideline, estimate how much memory you can dedicate to the client and put the rest of the load on the cluster. Turning off autofush If autoflush is enabled, each htable.put() object incurs a round trip RPC call to HRegionServer. Turning autoflush off can reduce the number of round trips and decrease latency. To turn it off, use this code: htable.setAutoFlush(false); The risk of turning off autoflush is if the client crashes before the data is sent to HBase, it will result in a data loss. Still, when will you want to do it? The answer is: when the danger of data loss is not important and speed is paramount. Also, see the batch write recommendations we saw previously. Turning off WAL Before we discuss this, we need to emphasize that the write-ahead log (WAL) is there to prevent data loss in the case of server crashes. By turning it off, we are bypassing this protection. Be very careful when choosing this. Bulk loading is one of the cases where turning off WAL might make sense. To turn off WAL, set it for each put: put.setDurability(Durability.SKIP_WAL); More tips for high-performing HBase reads So far, we looked at tips to write data into HBase. Now, let's take a look at some tips to read data faster. The scan cache When reading a large number of rows, it is better to set scan caching to a high number (in the 100 seconds or 1,000 seconds range). Otherwise, each row that is scanned will result in a trip to HRegionServer. This is especially encouraged for MapReduce jobs as they will likely consume a lot of rows sequentially. To set scan caching, use the following code: Scan scan = new Scan(); scan.setCaching(1000); Only read the families or columns needed When fetching a row, by default, HBase returns all the families and all the columns. If you only care about one family or a few attributes, specifying them will save needless I/O. To specify a family, use this: scan.addFamily( Bytes.toBytes("familiy1")); To specify columns, use this: scan.addColumn( Bytes.toBytes("familiy1"),   Bytes.toBytes("col1")) The block cache When scanning large rows sequentially (say in MapReduce), it is recommended that you turn off the block cache. Turning off the cache might be completely counter-intuitive. However, caches are only effective when we repeatedly access the same rows. During sequential scanning, there is no caching, and turning on the block cache will introduce a lot of churning in the cache (new data is constantly brought into the cache and old data is evicted to make room for the new data). So, we have the following points to consider: Turn off the block cache for sequential scans Turn off the block cache for random/repeated access Benchmarking or load testing HBase Benchmarking is a good way to verify HBase's setup and performance. There are a few good benchmarks available: HBase's built-in benchmark The Yahoo Cloud Serving Benchmark (YCSB) JMeter for custom workloads HBase's built-in benchmark HBase's built-in benchmark is PerformanceEvaluation. To find its usage, use this: $   hbase org.apache.hadoop.hbase.PerformanceEvaluation To perform a write benchmark, use this: $ hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred randomWrite 5 Here we are using five threads and no MapReduce. To accurately measure the throughput, we need to presplit the table that the benchmark writes to. It is TestTable. $ hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred --presplit=3 randomWrite 5 Here, the table is split in three ways. It is good practice to split the table into as many regions as the number of region servers. There is a read option along with a whole host of scan options. YCSB The YCSB is a comprehensive benchmark suite that works with many systems such as Cassandra, Accumulo, and HBase. Download it from GitHub, as follows: $   git clone git://github.com/brianfrankcooper/YCSB.git Build it like this: $ mvn -DskipTests package Create an HBase table to test against: $ hbase shellhbase> create 'ycsb', 'f1' Now, copy hdfs-site.xml for your cluster into the hbase/src/main/conf/ directory and run the benchmark: $ bin/ycsb load hbase -P workloads/workloada -p columnfamily=f1 -p table=ycsb YCSB offers lots of workloads and options. Please refer to its wiki page at https://github.com/brianfrankcooper/YCSB/wiki. JMeter for custom workloads The standard benchmarks will give you an idea of your HBase cluster's performance. However, nothing can substitute measuring your own workload. We want to measure at least the insert speed or the query speed. We also want to run a stress test. So, we can measure the ceiling on how much our HBase cluster can support. We can do a simple instrumentation as we did earlier too. However, there are tools such as JMeter that can help us with load testing. Please refer to the JMeter website and check out the Hadoop or HBase plugins for JMeter. Monitoring HBase Running any distributed system involves decent monitoring. HBase is no exception. Luckily, HBase has the following capabilities: HBase exposes a lot of metrics These metrics can be directly consumed by monitoring systems such as Ganglia We can also obtain these metrics in the JSON format via the REST interface and JMX Monitoring is a big subject and we consider it as part HBase administration. So, in this section, we will give pointers to tools and utilities that allow you to monitor HBase. Ganglia Ganglia is a generic system monitor that can monitor hosts (such as CPU, disk usage, and so on). The Hadoop stack has had a pretty good integration with Ganglia for some time now. HBase and Ganglia integration is set up by modern installers from Cloudera and Hortonworks. To enable Ganglia metrics, update the hadoop-metrics.properties file in the HBase configuration directory. Here's a sample file: hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 hbase.period=10 hbase.servers=ganglia-server:PORT jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 jvm.period=10 jvm.servers=ganglia-server:PORT rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 rpc.period=10 rpc.servers=ganglia-server:PORT This file has to be uploaded to all the HBase servers (master servers as well as region servers). Here are some sample graphs from Ganglia (these are Wikimedia statistics, for example): These graphs show cluster-wide resource utilization. OpenTSDB OpenTSDB is a scalable time series database. It can collect and visualize metrics on a large scale. OpenTSDB uses collectors, light-weight agents that send metrics to the open TSDB server to collect metrics, and there is a collector library that can collect metrics from HBase. You can see all the collectors at http://opentsdb.net/docs/build/html/user_guide/utilities/tcollector.html. An interesting factoid is that OpenTSDB is built on Hadoop/HBase. Collecting metrics via the JMX interface HBase exposes a lot of metrics via JMX. This page can be accessed from the web dashboard at http://<hbase master>:60010/jmx. For example, for a HBase instance that is running locally, it will be http://localhost:60010/jmx. Here is a sample screenshot of the JMX metrics via the web UI: Here's a quick example of how to programmatically retrieve these metrics using curl: $ curl 'localhost:60010/jmx' Since this is a web service, we can write a script/application in any language (Java, Python, or Ruby) to retrieve and inspect the metrics. Summary In this article, you learned how to push the performance of our HBase applications up. We looked at how to effectively load a large amount of data into HBase. You also learned about benchmarking and monitoring HBase and saw tips on how to do high-performing reads/writes. Resources for Article:   Further resources on this subject: The HBase's Data Storage [article] Hadoop and HDInsight in a Heartbeat [article] Understanding the HBase Ecosystem [article]
Read more
  • 0
  • 0
  • 10649
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-how-to-build-a-koa-web-application-part-1
Christoffer Hallas
15 Dec 2014
8 min read
Save for later

How to Build a Koa Web Application - Part 1

Christoffer Hallas
15 Dec 2014
8 min read
You may be a seasoned or novice web developer, but no matter your level of experience, you must always be able to set up a basic MVC application. This two part series will briefly show you how to use Koa, a bleeding edge Node.js web application framework to create a web application using MongoDB as its database. Koa has a low footprint and tries to be as unbiased as possible. For this series, we will also use Jade and Mongel, two Node.js libraries that provide HTML template rendering and MongoDB model interfacing, respectively. Note that this series requires you to use Node.js version 0.11+. At the end of the series, we will have a small and basic app where you can create pages with a title and content, list your pages, and view them. Let’s get going! Using NPM and Node.js If you do not already have Node.js installed, you can download installation packages at the official Node.js website, http://nodejs.org. I strongly suggest that you install Node.js in order to code along with the article. Once installed, Node.js will add two new programs to your computer that you can access from your terminal; they’re node and npm. The first program is the main Node.js program and is used to run Node.js applications, and the second program is the Node Package Manager and it’s used to install Node.js packages. For this application we start out in an empty folder by using npm to install four libraries: $ npm install koa jade mongel co-body Once this is done, open your favorite text editor and create an index.js file in the folder in which we will now start our creating our application. We start by using the require function to load the four libraries we just installed: var koa = require('koa'); var jade = require('jade'); var mongel = require('mongel'); var parse = require(‘co-body'); This simply loads the functionality of the libraries into the respective variables. This lets us create our Page model and our Koa app variables: var Page = mongel('pages', ‘mongodb://localhost/app'); var app = koa(); As you can see, we now use the variables mongel and koa that we previously loaded into our program using require. To create a model with mongel, all we have to do is give the name of our MongoDB collection and a MongoDB connection URI that represents the network location of the database; in this case we’re using a local installation of MongoDB and a database called app. It’s simple to create a basic Koa application, and as seen in the code above, all we do is create a new variable called app that is the result of calling the Koa library function. Middleware, generators, and JavaScript Koa uses a new feature in JavaScript called generators. Generators are not widely available in browsers yet except for some versions of Google Chrome, but since Node.js is built on the same JavaScript as Google Chrome it can use generators. The generators function is much like a regular JavaScript function, but it has a special ability to yield several values along with the normal ability of returning a single value. Some expert JavaScript programmers used this to create a new and improved way of writing asynchronous code in JavaScript, which is required when building a networked application such as a web application. The generators function is a complex subject and we won’t cover it in detail. We’ll just show you how to use it in our small and basic app. In Koa, generators are used as something called middleware, a concept that may be familiar to you from other languages such as Ruby and Python. Think of middleware as a stack of functions through which an HTTP request must travel in order to create an appropriate response. Middleware should be created so that the functionality of a given middleware is encapsulated together. In our case, this means we’ll be creating two pieces of middleware: one to create pages and one to list pages or show a page. Let’s create our first middleware: app.use(function* (next) { … }); As you can see, we start by calling the app.use function, which takes a generator as its argument, and this effectively pushes the generator into the stack. To create a generator, we use a special function syntax where an asterisk is added as seen in the previous code snippet. We let our generator take a single argument called next, which represents the next middleware in the stack, if any. From here on, it is simply a matter of checking and responding to the parameters of the HTTP request, which are accessible to us in the Koa context. This is also the function context, which in JavaScript is the keyword this, similar to other languages and the keyword self: if (this.path != '/create') { yield next; return } Since we’re creating some middleware that helps us create pages, we make sure that this request is for the right path, in our case, /create; if not, we use the yield keyword and the next argument to pass the control of the program to the next middleware. Please note the return keyword that we also use; this is very important in this case as the middleware would otherwise continue while also passing control to the next middleware. This is not something you want to happen unless the middleware you’re in will not modify the Koa context or HTTP response, because subsequent middleware will always expect that they’re now in control. Now that we have checked that the path is correct, we still have to check the method to see if we’re just showing the form to create a page, or if we should actually create a page in the database: if (this.method == 'POST') { var body = yield parse.form(this); var page = yield Page.createOne({    title: body.title,    contents: body.contents }); this.redirect('/' + page._id); return } else if (this.method != 'GET') { this.status = 405; this.body = 'Method Not Allowed'; return } To check the method, we use the Koa context again and the method attribute. If we’re handling a POST request we now know how to create a page, but this also means that we must extract extra information from the request. Koa does not process the body of a request, only the headers, so we use the co-body library that we downloaded early and loaded in as the parse variable. Notice how we yield on the parse.form function; this is because this is an asynchronous function and we have to wait until it is done before we continue the program. Then we proceed to use our mongel model Page to create a page using the data we found in the body of the request, again this is an asynchronous function and we use yield to wait before we finally redirect the request using the page’s database id. If it turns out the method was not POST, we still want to use this middleware to show the form that is actually used to issue the request. That means we have to make sure that the method is GET, so we added an else if statement to the original check, and if the request is neither POST or GET we respond with an HTTP status 405 and the message Method Not Allowed, which is the appropriate response for this case. Notice how we don’t yield next; this is because the middleware was able to determine a satisfying response for the request and it requires no further processing. Finally, if the method was actually POST, we use the Jade library that we also installed using npm to render a create.jade template in HTML: var html = jade.renderFile('create.jade'); this.body = html; Notice how we set the Koa context’s body attribute to the rendered HTML from Jade; all this does is tell Koa that we want to send that back to the browser that sent the request. Wrapping up You are well on your way to creating your Koa app. In Part 2 we will implement Jade templates and list and view pages. Ready for the next step? Read Part 2 here. Explore all of our top Node.js content in one place - visit our Node.js page today! About the author Christoffer Hallas is a software developer and entrepreneur from Copenhagen, Denmark. He is a computer polyglot and contributes to and maintains a number of open source projects. When not contemplating his next grand idea (which remains an idea) he enjoys music, sports, and design of all kinds. Christoffer can be found on GitHub as hallas and at Twitter as @hamderhallas.
Read more
  • 0
  • 0
  • 5178

article-image-qgis-feature-selection-tools
Packt
05 Dec 2014
4 min read
Save for later

QGIS Feature Selection Tools

Packt
05 Dec 2014
4 min read
 In this article by Anita Graser, the author of Learning QGIS Third Edition, we will cover the following topics: Selecting features with the mouse Selecting features using expressions Selecting features using Spatial queries (For more resources related to this topic, see here.) Selecting features with the mouse The first group of tools in the Attributes toolbar allows us to select features on the map using the mouse. The following screenshot shows the Select Feature(s) tool. We can select a single feature by clicking on it or select multiple features by drawing a rectangle. The other tools can be used to select features by drawing different shapes: polygons, freehand areas, or circles around the features. All features that intersect with the drawn shape are selected. Holding down the Ctrl key will add the new selection to an existing one. Similarly, holding down Ctrl + Shift will remove the new selection from the existing selection. Selecting features by expression The second type of select tool is called Select by Expression, and it is also available in the Attribute toolbar. It selects features based on expressions that can contain references and functions using feature attributes and/or geometry. The list of available functions is pretty long, but we can use the search box to filter the list by name to find the function we are looking for faster. On the right-hand side of the window, we will find Selected Function Help, which explains the functionality and how to use the function in an expression. The Function List option also shows the layer attribute fields, and by clicking on Load all unique values or Load 10 sample values, we can easily access their content. As with the mouse tools, we can choose between creating a new selection or adding to or deleting from an existing selection. Additionally, we can choose to only select features from within an existing selection. Let's have a look at some example expressions that you can build on and use in your own work: Using the lakes.shp file in our sample data, we can, for example, select big lakes with an area bigger than 1,000 square miles using a simple attribute query, "AREA_MI" > 1000.0, or using geometry functions such as $area > (1000.0 * 27878400). Note that the lakes.shp CRS uses feet, and we, therefore, have to multiply by 27,878,400 to convert from square feet to square miles. The dialog will look like the one shown in the following screenshot. We can also work with string functions, for example, to find lakes with long names, such as length("NAMES") > 12, or lakes with names that contain the s or S character, such as lower("NAMES") LIKE '%s%', which first converts the names to lowercase and then looks for any appearance of s. Selecting features using spatial queries The third type of tool is called Spatial Query and allows us to select features in one layer based on their location, relative to the features in a second layer. These tools can be accessed by going to Vector | Research Tools | Select by location and then going to Vector | Spatial Query | Spatial Query. Enable it in Plugin Manager if you cannot find it in the Vector menu. In general, we want to use the Spatial Query plugin, as it supports a variety of spatial operations such as crosses, equals, intersects, is disjoint, overlaps, touches, and contains, depending on the layer's geometry type. Let's test the Spatial Query plugin using railroads.shp and pipelines.shp from the sample data. For example, we might want to find all the railroad features that cross a pipeline; we will, therefore, select the railroads layer, the Crosses operation, and the pipelines layer. After clicking on Apply, the plugin presents us with the query results. There is a list of IDs of the result features on the right-hand side of the window, as you can see in the following screenshot. Below this list, we can select the Zoom to item checkbox, and QGIS will zoom to the feature that belongs to the selected ID. Additionally, the plugin offers buttons to directly save all the resulting features to a new layer. Summary This article introduced you to three solutions to select features in QGIS: selecting features with mouse, using spatial queries, and using expressions. Resources for Article: Further resources on this subject: Editing attributes [article] Server Logs [article] Improving proximity filtering with KNN [article]
Read more
  • 0
  • 0
  • 12869

article-image-ogc-esri-professionals
Packt
27 Nov 2014
16 min read
Save for later

OGC for ESRI Professionals

Packt
27 Nov 2014
16 min read
In this article by Stefano Iacovella, author of GeoServer Cookbook, we look into a brief comparison between GeoServer and ArcGIS for Server, a map server created by ESRI. The importance of adopting OGC standards when building a geographical information system is stressed. We will also learn how OGC standards let us create a system where different pieces of software cooperate with each other. (For more resources related to this topic, see here.) ArcGIS versus GeoServer As an ESRI professional, you obviously know the server product from this vendor that can be compared to GeoServer well. It is called ArcGIS for Server and in many ways it can play the same role as that of GeoServer, and the opposite is true as well, of course. Undoubtedly, the big question for you is: why should I use GeoServer and not stand safely on the vendor side, leveraging on integration with the other software members of the big ArcGIS family? Listening to colleagues, asking to experts, and browsing on the Internet, you'll find a lot of different answers to this question, often supported by strong arguments and somehow by a religious and fanatic approach. There are a few benchmarks available on the Internet that compare performances of GeoServer and other open source map servers versus ArcGIS for Server. Although they're not definitely authoritative, a reasonably objective advantage of GeoServer and its OS cousins on ArcGIS for Server is recognizable. Anyway, I don't think that your choice should overestimate the importance of its performance. I'm sorry but my answer to your original question is another question: why should you choose a particular piece of software? This may sound puzzling, so let me elaborate a bit on the topic. Let's say you are an IT architect and a customer asked you to design a solution for a GIS portal. Of course, in that specific case, you have to give him or her a detailed response, containing specific software that'll be used for data publication. Also, as a professional, you'll arrive to the solution by accurately considering all requirements and constraints that can be inferred from the talks and surveying what is already up and running at the customer site. Then, a specific answer to what the software best suited for the task is should exist in any specific case. However, if you consider the question from a more general point of view, you should be aware that a map server, which is the best choice for any specific case, does not exist. You may find that the licensing costs a limit in some case or the performances in some other cases will lead you to a different choice. Also, as in any other job, the best tool is often the one you know better, and this is quite true when you are in a hurry and your customer can't wait to have the site up and running. So the right approach, although a little bit generic, is to keep your mind open and try to pick the right tool for any scenario. However, a general answer does exist. It's not about the vendor or the name of the piece of software you're going to use; it's about the way the components or your system communicate among them and with external systems. It's about standard protocol. This is a crucial consideration for any GIS architect or developer; nevertheless, if you're going to use an ESRI suite of products or open source tools, you should create your system with special care to expose data with open standards. Understanding standards Let's take a closer look at what standards are and why they're so important when you are designing your GIS solution. The term standard as mentioned in Wikipedia (http://en.wikipedia.org/wiki/ Technical_standard) may be explained as follows: "An established norm or requirement in regard to technical systems. It is usually a formal document that establishes uniform engineering or technical criteria, methods, processes and practices. In contrast, a custom, convention, company product, corporate standard, etc. that becomes generally accepted and dominant is often called a de facto standard." Obviously, a lot of standards exist if you consider the Information Technology domain. Standards are usually formalized by standards organization, which usually involves several members from different areas, such as government agencies, private companies, education, and so on. In the GIS world, an authoritative organization is the Open Geospatial Consortium (OGC), which you may find often cited in this book in many links to the reference information. In recent years, OGC has been publishing several standards that cover the interaction of the GIS system and details on how data is transferred from one software to another. We'll focus on three of them that are widely used and particularly important for GeoServer and ArcGIS for Server: WMS: This is the acronym for Web Mapping Service. This standard describes how a server should publish data for mapping purposes, which is a static representation of data. WFS: This is the acronym for Web Feature Service. This standard describes the details of publishing data for feature streaming to a client. WCS: This is the acronym for Web Coverage Service. This standard describes the details of publishing data for raster data streaming to a client. It's the equivalent of WFS applied to raster data. Now let's dive into these three standards. We'll explore the similarities and differences among GeoServer and ArcGIS for Server. WMS versus the mapping service As an ESRI user, you surely know how to publish some data in a map service. This lets you create a web service that can be used by a client who wants to show the map and data. This is the proprietary equivalent of exposing data through a WMS service. With WMS, you can inquire the server for its capabilities with an HTTP request: $ curl -XGET -H 'Accept: text/xml' 'http://localhost:8080/geoserver/wms?service=WMS &version=1.1.1&request=GetCapabilities' -o capabilitiesWMS.xml Browsing through the XML document, you'll know which data is published and how this can be represented. If you're using the proprietary way of exposing map services with ESRI, you can perform a similar query that starts from the root: $ curl -XGET 'http://localhost/arcgis/rest/services?f=pjson' -o capabilitiesArcGIS.json The output, in this case formatted as a JSON file, is a text file containing the first of the services and folders available to an anonymous user. It looks like the following code snippet: {"currentVersion": 10.22,"folders": ["Geology","Cultural data",…"Hydrography"],"services": [{"name": "SampleWorldCities","type": "MapServer"}]} At a glance, you can recognize two big differences here. Firstly, there are logical items, which are the folders that work only as a container for services. Secondly, there is no complete definition of items, just a list of elements contained at a certain level of a publishing tree. To obtain specific information about an element, you can perform another request pointing to the item: $ curl -XGET 'http://localhost/arcgis/rest/ services/SampleWorldCities/MapServer?f=pjson' -o SampleWorldCities.json Setting up an ArcGIS site is out of the scope of this book; besides, this appendix assumes that you are familiar with the software and its terminology. Anyway, all the examples use the SampleWorldCities service, which is a default service created by the standard installation. In the new JSON file, you'll find a lot of information about the specific service: {"currentVersion": 10.22,"serviceDescription": "A sample service just for demonstation.","mapName": "World Cities Population","description": "","copyrightText": "","supportsDynamicLayers": false,"layers": [{"id": 0,"name": "Cities","parentLayerId": -1,"defaultVisibility": true,"subLayerIds": null,"minScale": 0,"maxScale": 0},…"supportedImageFormatTypes":"PNG32,PNG24,PNG,JPG,DIB,TIFF,EMF,PS,PDF,GIF,SVG,SVGZ,BMP",…"capabilities": "Map,Query,Data","supportedQueryFormats": "JSON, AMF","exportTilesAllowed": false,"maxRecordCount": 1000,"maxImageHeight": 4096,"maxImageWidth": 4096,"supportedExtensions": "KmlServer"} Please note the information about the image format supported. We're, in fact, dealing with a map service. As for the operation supported, this one shows three different operations: Map, Query, and Data. For the first two, you can probably recognize the equivalent of the GetMap and GetFeatureinfo operations of WMS, while the third one is little bit more mysterious. In fact, it is not relevant to map services and we'll explore it in the next paragraph. If you're familiar with the GeoServer REST interface, you can see the similarities in the way you can retrieve information. We don't want to explore the ArcGIS for Server interface in detail and how to handle it. What is important to understand is the huge difference with the standard WMS capabilities document. If you're going to create a client to interact with maps produced by a mix of ArcGIS for Server and GeoServer, you should create different interfaces for both. In one case, you can interact with the proprietary REST interface and use the standard WMS for GeoServer. However, there is good news for you. ESRI also supports standards. If you go to the map service parameters page, you can change the way the data is published.   The situation shown in the previous screenshot is the default capabilities configuration. As you can see, there are options for WMS, WFS, and WCS, so you can expose your data with ArcGIS for Server according to the OGC standards. If you enable the WMS option, you can now perform this query: $ curl -XGET 'http://localhost/arcgis/ services/SampleWorldCities/MapServer/ WMSServer?SERVICE=WMS&VERSION=1.3.0&REQUEST=GetCapabilities'    -o capabilitiesArcGISWMS.xml The information contained is very similar to that of the GeoServer capabilities. A point of attention is about fundamental differences in data publishing with the two software. In ArcGIS for Server, you always start from a map project. A map project is a collection of datasets, containing vector or raster data, with a drawing order, a coordinate reference system, and rules to draw. It is, in fact, very similar to a map project you can prepare with a GIS desktop application. Actually, in the ESRI world, you should use ArcGIS for desktop to prepare the map project and then publish it on the server. In GeoServer, the map concept doesn't exist. You publish data, setting several parameters, and the map composition is totally demanded to the client. You can only mimic a map, server side, using the group layer for a logical merge of several layers in a single entity. In ArcGIS for Server, the map is central to the publication process; also, if you just want to publish a single dataset, you have to create a map project, containing just that dataset, and publish it. Always remember this different approach; when using WMS, you can use the same operation on both servers. A GetMap request on the previous map service will look like this: $ curl -XGET 'http://localhost/arcgis/services/ SampleWorldCities/MapServer/WMSServer?service= WMS&version=1.1.0&request=GetMap&layers=fields&styles =&bbox=47.130647,8.931116,48.604188,29.54223&srs= EPSG:4326&height=445&width=1073&format=img/png' -o map.png Please note that you can filter what layers will be drawn in the map. By default, all the layers contained in the map service definition will be drawn. WFS versus feature access If you open the capabilities panel for the ArcGIS service again, you will note that there is an option called feature access. This lets you enable the feature streaming to a client. With this option enabled, your clients can acquire features and symbology information to ArcGIS and render them directly on the client side. In fact, feature access can also be used to edit features, that is, you can modify the features on the client and then post the changes on the server. When you check the Feature Access option, many specific settings appear. In particular, you'll note that by default, the Update operation is enabled, but the Geometry Updates is disabled, so you can't edit the shape of each feature. If you want to stream features using a standard approach, you should instead turn on the WFS option. ArcGIS for Server supports versions 1.1 and 1.0 of WFS. Moreover, the transactional option, also known as WFS-T, is fully supported.   As you can see in the previous screenshot, when you check the WFS option, several more options appear. In the lower part of the panel, you'll find the option to enable the transaction, which is the editing feature. In this case, there is no separate option for geometry and attributes; you can only decide to enable editing on any part of your features. After you enable the WFS, you can access the capabilities from this address: $ curl -XGET 'http://localhost/arcgis/services/ SampleWorldCities/MapServer/WFSServer?SERVICE=WFS&VERSION=1.1. 0&REQUEST=GetCapabilities' -o capabilitiesArcGISWFS.xml Also, a request for features is shown as follows: $ curl -XGET "http://localhost/arcgis/services/SampleWorldCities /MapServer/WFSServer?service=wfs&version=1.1.0 &request=GetFeature&TypeName=SampleWorldCities: cities&maxFeatures=1" -o getFeatureArcGIS.xml This will output a GML code as a result of your request. As with WMS, the syntax is the same. You only need to pay attention to the difference between the service and the contained layers: <wfs:FeatureCollection xsi:schemaLocation="http://localhost/arcgis/services/SampleWorldCities/MapServer/WFSServer http://localhost/arcgis/services/SampleWorldCities/MapServer/WFSServer?request=DescribeFeatureType%26version=1.1.0%26typename=citieshttp://www.opengis.net/wfs http://schemas.opengis.net/wfs/1.1.0/wfs.xsd"><gml:boundedBy><gml:Envelope srsName="urn:ogc:def:crs:EPSG:6.9:4326"><gml:lowerCorner>-54.7919921875 -176.1514892578125</gml:lowerCorner><gml:upperCorner>78.2000732421875179.221923828125</gml:upperCorner></gml:Envelope></gml:boundedBy><gml:featureMember><SampleWorldCities:cities gml_id="F4__1"><SampleWorldCities:OBJECTID>1</SampleWorldCities:OBJECTID><SampleWorldCities:Shape><gml:Point><gml:pos>-15.614990234375 -56.093017578125</gml:pos></gml:Point></SampleWorldCities:Shape><SampleWorldCities:CITY_NAME>Cuiaba</SampleWorldCities:CITY_NAME><SampleWorldCities:POP>521934</SampleWorldCities:POP><SampleWorldCities:POP_RANK>3</SampleWorldCities:POP_RANK><SampleWorldCities:POP_CLASS>500,000 to999,999</SampleWorldCities:POP_CLASS><SampleWorldCities:LABEL_FLAG>0</SampleWorldCities:LABEL_FLAG></SampleWorldCities:cities></gml:featureMember></wfs:FeatureCollection> Publishing raster data with WCS The WCS option is always present in the panel to configure services. As we already noted, WCS is used to publish raster data, so this may sound odd to you. Indeed, ArcGIS for Server lets you enable the WCS option, only if the map project for the service contains one of the following: A map containing raster or mosaic layers A raster or mosaic dataset A layer file referencing a raster or mosaic dataset A geodatabase that contains raster data If you try to enable the WCS option on SampleWorldCities, you won't get an error. Then, try to ask for the capabilities: $ curl -XGET "http://localhost/arcgis/services /SampleWorldCities/MapServer/ WCSServer?SERVICE=WCS&VERSION=1.1.1&REQUEST=GetCapabilities" -o capabilitiesArcGISWCS.xml You'll get a proper document, compliant to the standard and well formatted, but containing no reference to any dataset. Indeed, the sample service does not contain any raster data:  <Capabilities xsi_schemaLocation="http://www.opengis.net/wcs/1.1.1http://schemas.opengis.net/wcs/1.1/wcsGetCapabilities.xsdhttp://www.opengis.net/ows/1.1/http://schemas.opengis.net/ows/1.1.0/owsAll.xsd"version="1.1.1"><ows:ServiceIdentification><ows:Title>WCS</ows:Title><ows:ServiceType>WCS</ows:ServiceType><ows:ServiceTypeVersion>1.0.0</ows:ServiceTypeVersion><ows:ServiceTypeVersion>1.1.0</ows:ServiceTypeVersion><ows:ServiceTypeVersion>1.1.1</ows:ServiceTypeVersion><ows:ServiceTypeVersion>1.1.2</ows:ServiceTypeVersion><ows:Fees>NONE</ows:Fees><ows:AccessConstraints>None</ows:AccessConstraints></ows:ServiceIdentification>...<Contents><SupportedCRS>urn:ogc:def:crs:EPSG::4326</SupportedCRS><SupportedFormat>image/GeoTIFF</SupportedFormat><SupportedFormat>image/NITF</SupportedFormat><SupportedFormat>image/JPEG</SupportedFormat><SupportedFormat>image/PNG</SupportedFormat><SupportedFormat>image/JPEG2000</SupportedFormat><SupportedFormat>image/HDF</SupportedFormat></Contents></Capabilities> If you want to try out WCS, other than the GetCapabilities operation, you need to publish a service with raster data; or, you may take a look at the sample service from ESRI arcgisonline™. Try the following request: $ curl -XGET "http://sampleserver3.arcgisonline.com/ ArcGIS/services/World/Temperature/ImageServer/ WCSServer?SERVICE=WCS&VERSION=1.1.0&REQUEST=GETCAPABILITIES" -o capabilitiesArcGISWCS.xml Parsing the XML file, you'll find that the contents section now contains coverage, raster data that you can retrieve from that server:  …<Contents><CoverageSummary><ows:Title>Temperature1950To2100_1</ows:Title><ows:Abstract>Temperature1950To2100</ows:Abstract><ows:WGS84BoundingBox><ows:LowerCorner>-179.99999999999994 -55.5</ows:LowerCorner><ows:UpperCorner>180.00000000000006 83.5</ows:UpperCorner></ows:WGS84BoundingBox><Identifier>1</Identifier></CoverageSummary><SupportedCRS>urn:ogc:def:crs:EPSG::4326</SupportedCRS><SupportedFormat>image/GeoTIFF</SupportedFormat><SupportedFormat>image/NITF</SupportedFormat><SupportedFormat>image/JPEG</SupportedFormat><SupportedFormat>image/PNG</SupportedFormat><SupportedFormat>image/JPEG2000</SupportedFormat><SupportedFormat>image/HDF</SupportedFormat></Contents> You can, of course, use all the operations supported by standard. The following request will return a full description of one or more coverages within the service in the GML format. An example of the URL is shown as follows: $ curl -XGET "http://sampleserver3.arcgisonline.com/ ArcGIS/services/World/Temperature/ImageServer/ WCSServer?SERVICE=WCS&VERSION=1.1.0&REQUEST=DescribeCoverage& COVERAGE=1" -o describeCoverageArcGISWCS.xml Also, you can obviously request for data, and use requests that will return coverage in one of the supported formats, namely GeoTIFF, NITF, HDF, JPEG, JPEG2000, and PNG. Another URL example is shown as follows: $ curl -XGET "http://sampleserver3.arcgisonline.com/ ArcGIS/services/World/Temperature/ImageServer/ WCSServer?SERVICE=WCS&VERSION=1.0.0 &REQUEST=GetCoverage&COVERAGE=1&CRS=EPSG:4326 &RESPONSE_CRS=EPSG:4326&BBOX=-158.203125,- 105.46875,158.203125,105.46875&WIDTH=500&HEIGHT=500&FORMAT=jpeg" -o coverage.jpeg  Summary In this article, we started with the differences between ArcGIS and GeoServer and then moved on to understanding standards. Then we went on to compare WMS with mapping service as well as WFS with feature access. Finally we successfully published a raster dataset with WCS. Resources for Article: Further resources on this subject: Getting Started with GeoServer [Article] Enterprise Geodatabase [Article] Sending Data to Google Docs [Article]
Read more
  • 0
  • 0
  • 2703

article-image-setting-qt-creator-android
Packt
27 Nov 2014
8 min read
Save for later

Setting up Qt Creator for Android

Packt
27 Nov 2014
8 min read
This article by Ray Rischpater, the author of the book Application Development with Qt Creator Second Edition, focusses on setting up Qt Creator for Android. Android's functionality is delimited in API levels; Qt for Android supports Android level 10 and above: that's Android 2.3.3, a variant of Gingerbread. Fortunately, most devices in the market today are at least Gingerbread, making Qt for Android a viable development platform for millions of devices. Downloading all the pieces To get started with Qt Creator for Android, you're going to need to download a lot of stuff. Let's get started: Begin with a release of Qt for Android, which was either. For this, you need to download it from http://qt-project.org/downloads. The Android developer tools require the current version of the Java Development Kit (JDK) (not just the runtime, the Java Runtime Environment, but the whole kit and caboodle); you can download it from http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html. You need the latest Android Software Development Kit (SDK), which you can download for Mac OS X, Linux, or Windows at http://developer.android.com/sdk/index.html. You need the latest Android Native Development Kit (NDK), which you can download at http://developer.android.com/tools/sdk/ndk/index.html. You need the current version of Ant, the Java build tool, which you can download at http://ant.apache.org/bindownload.cgi. Download, unzip, and install each of these, in the given order. On Windows, I installed the Android SDK and NDK by unzipping them to the root of my hard drive and installed the JDK at the default location I was offered. Setting environment variables Once you install the JDK, you need to be sure that you've set your JAVA_HOME environment variable to point to the directory where it was installed. How you will do this differs from platform to platform; on a Mac OS X or Linux box, you'd edit .bashrc, .tcshrc, or the likes; on Windows, go to System Properties, click on Environment Variables, and add the JAVA_HOME variable. The path should point to the base of the JDK directory; for me, it was C:Program FilesJavajdk1.7.0_25, although the path for you will depend on where you installed the JDK and which version you installed. (Make sure you set the path with the trailing directory separator; the Android SDK is pretty fussy about that sort of thing.) Next, you need to update your PATH to point to all the stuff you just installed. Again, this is an environment variable and you'll need to add the following: There are two different components/subsystems shown in the diagram. The first is YARN, which is the new resource management layer introduced in Hadoop 2.0. The second is HDFS. Let's first delve into HDFS since that has not changed much since Hadoop 1.0. The bin directory of your JDK The androidsdktools directory The androidsdkplatform-tools directory For me, on my Windows 8 computer, my PATH includes this now: …C:Program FilesJavajdk1.7.0_25bin;C:adt-bundle- windows-x86_64-20130729sdktools;;C:adt-bundlewindows-x86_64- 20130729sdkplatform-tools;… Don't forget the separators: on Windows, it's a semicolon (;), while on Mac OS X and Linux, it's a colon (:). An environment variable is a variable maintained by your operating system which affects its configuration; see http://en.wikipedia.org/wiki/Environment_variable for more details. At this point, it's a good idea to restart your computer (if you're running Windows) or log out and log in again (on Linux or Mac OS X) to make sure that all these settings take effect. If you're on a Mac OS X or Linux box, you might be able to start a new terminal and have the same effect (or reload your shell configuration file) instead, but I like the idea of restarting at this point to ensure that the next time I start everything up, it'll work correctly. Finishing the Android SDK installation Now, we need to use the Android SDK tools to ensure that you have a full version of the SDK for at least one Android API level installed. We'll need to start Eclipse, the Android SDK's development environment, and run the Android SDK manager. To do this, follow these steps: Find Eclipse. It's probably in the Eclipse directory of the directory where you installed the Android SDK. If Eclipse doesn't start, check your JAVA_HOME and PATH variables; the odds are that Eclipse will not find the Java environment it needs to run. Click on OK when Eclipse prompts you for a workspace. This doesn't matter; you won't use Eclipse except to download Android SDK components. Click on the Android SDK Manager button in the Eclipse toolbar (circled in the next screenshot): Make sure that you have at least one Android API level above API level 10 installed, along with the Google USB Driver (you'll need this to debug on the hardware). Quit Eclipse. Next, let's see whether the Android Debug Bridge—the software component that transfers your executables to your Android device and supports on-device debugging—is working as it should. Fire up a shell prompt and type adb. If you see a lot of output and no errors, the bridge is correctly installed. If not, go back and check your PATH variable to be sure it's correct. While you're at it, you should developer-enable your Android device too so that it'll work with ADB. Follow the steps provided at http://bit.ly/1a29sal. Configuring Qt Creator Now, it's time to tell Qt Creator about all the stuff you just installed. Perform the following steps: Start Qt Creator but don't create a new project. Under the Tools menu, select Options and then click on Android. Fill in the blanks, as shown in the next screenshot. They should be: The path to the SDK directory, in the directory where you installed the Android SDK. The path to where you installed the Android NDK. Check Automatically create kits for Android tool chains. The path to Ant; here, enter either the path to the Ant executable itself on Mac OS X and Linux platforms or the path to ant.bat in the bin directory of the directory where you unpacked Ant. The directory where you installed the JDK (this might be automatically picked up from your JAVA_HOME directory), as shown in the following screenshot: Click on OK to close the Options window. You should now be able to create a new Qt GUI or Qt Quick application for Android! Do so, and ensure that Android is a target option in the wizard, as the next screenshot shows; be sure to choose at least one ARM target, one x86 target, and one target for your desktop environment: If you want to add Android build configurations to an existing project, the process is slightly different. Perform the following steps: Load the project as you normally would. Click on Projects in the left-hand side pane. The Projects pane will open. Click on Add Kit and choose the desired Android (or other) device build kit. The following screenshot shows you where the Projects and Add Kit buttons are in Qt Creator: Building and running your application Write and build your application normally. A good idea is to build the Qt Quick Hello World application for Android first before you go to town and make a lot of changes, and test the environment by compiling for the device. When you're ready to run on the device, perform the following steps: Navigate to Projects (on the left-hand side) and then select the Android for arm kit's Run Settings. Under Package Configurations, ensure that the Android SDK level is set to the SDK level of the SDK you installed. Ensure that the Package name reads something similar to org.qtproject.example, followed by your project name. Connect your Android device to your computer using the USB cable. Select the Android for arm run target and then click on either Debug or Run to debug or run your application on the device. Summary Qt for Android gives you an excellent leg up on mobile development, but it's not a panacea. If you're planning to target mobile devices, you should be sure to have a good understanding of the usage patterns for your application's users as well as the constraints in CPU, GPU, memory, and network that a mobile application must run on. Once we understand these, however, all of our skills with Qt Creator and Qt carry over to the mobile arena. To develop for Android, begin by installing the JDK, Android SDK, Android NDK, and Ant, and then develop applications as usual: compiling for the device and running on the device frequently to iron out any unexpected problems along the way. Resources for Article: Further resources on this subject: Reversing Android Applications [article] Building Android (Must know) [article] Introducing an Android platform [article]
Read more
  • 0
  • 0
  • 14930
article-image-concurrency-practice
Packt
26 Nov 2014
25 min read
Save for later

Concurrency in Practice

Packt
26 Nov 2014
25 min read
This article written by Aleksandar Prokopec, the author of Learning Concurrent Programming in Scala, helps you develop skills that are necessary to write correct and efficient concurrent programs. It teaches you about concurrency in Scala through a sequence of programs. (For more resources related to this topic, see here.) "The best theory is inspired by practice."                                          -Donald Knuth We have studied a plethora of different concurrency facilities in this article. By now, you will have learned about dozens of different ways of starting concurrent computations and accessing shared data. Knowing how to use different styles of concurrency is useful, but it might not yet be obvious when to use which. The goal of this article is to introduce the big picture of concurrent programming. We will study the use cases for various concurrency abstractions, see how to debug concurrent programs, and how to integrate different concurrency libraries in larger applications. In this article, we perform the following tasks: Investigate how to deal with various kinds of bugs appearing in concurrent applications Learn how to identify and resolve performance bottlenecks Apply the previous knowledge about concurrency to implement a larger concurrent application, namely, a remote file browser We start with an overview of the important concurrency frameworks that we learned about in this article, and a summary of when to use each of them. Choosing the right tools for the job In this section, we present an overview of the different concurrency libraries that we learned about. We take a step back and look at the differences between these libraries, and what they have in common. This summary will give us an insight into what different concurrency abstractions are useful for. A concurrency framework usually needs to address several concerns: It must provide a way to declare data that is shared between concurrent executions It must provide constructs for reading and modifying program data It must be able to express conditional execution, triggered when a certain set of conditions are fulfilled It must define a way to start concurrent executions Some of the frameworks from this article address all of these concerns; others address only a subset, and transfer part of the responsibility to another framework. Typically, in a concurrent programming model, we express concurrently shared data differently from data intended to be accessed only from a single thread. This allows the JVM runtime to optimize sequential parts of the program more effectively. So far, we've seen a lot of different ways to express concurrently shared data, ranging from the low-level facilities to advanced high-level abstractions. We summarize different data abstractions in the following table: Data abstraction Datatype or annotation Description Volatile variables (JDK) @volatile Ensure visibility and the happens-before relationship on class fields and local variables that are captured in closures. Atomic variables (JDK) AtomicReference[T] AtomicInteger AtomicLong Provide basic composite atomic operations, such as compareAndSet and incrementAndGet. Futures and promises (scala.concurrent) Future[T] Promise[T] Sometimes called single-assignment variables, these express values that might not be computed yet, but will eventually become available. Observables and subjects (Rx) Observable[T] Subject[T] Also known as first-class event streams, these describe many different values that arrive one after another in time. Transactional references (Scala Software Transactional Memory (STM)) Ref[T] These describe memory locations that can only be accessed from within memory transactions. Their modifications only become visible after the transaction successfully commits. The next important concern is providing access to shared data, which includes reading and modifying shared memory locations. Usually, a concurrent program uses special constructs to express such accesses. We summarize the different data access constructs in the following table: Data abstraction Data access constructs Description Arbitrary data (JDK) synchronized   Uses intrinsic object locks to exclude access to arbitrary shared data. Atomic variables and classes (JDK) compareAndSet Atomically exchanges the value of a single memory location. It allows implementing lock-free programs. Futures and promises (scala.concurrent) value tryComplete Used to assign a value to a promise, or to check the value of the corresponding future. The value method is not a preferred way to interact with a future. Transactional references (ScalaSTM) atomic orAtomic single Atomically modify the values of a set of memory locations. Reduces the risk of deadlocks, but disallow side effects inside the transactional block. Concurrent data access is not the only concern of a concurrency framework. Concurrent computations sometimes need to proceed only after a certain condition is met. In the following table, we summarize different constructs that enable this: Concurrency framework Conditional execution constructs Description JVM concurrency wait notify notifyAll Used to suspend the execution of a thread until some other thread notifies that the conditions are met. Futures and promises onComplete Await.ready Conditionally schedules an asynchronous computation. The Await.ready method suspends the thread until the future completes. Reactive extensions subscribe Asynchronously or synchronously executes a computation when an event arrives. Software transactional memory retry retryFor withRetryTimeout Retries the current memory transaction when some of the relevant memory locations change. Actors receive Executes the actor's receive block when a message arrives. Finally, a concurrency model must define a way to start a concurrent execution. We summarize different concurrency constructs in the following table: Concurrency framework Concurrency constructs Description JVM concurrency Thread.start Starts a new thread of execution. Execution contexts execute Schedules a block of code for execution on a thread pool. Futures and promises Future.apply Schedules a block of code for execution, and returns the future value with the result of the execution. Parallel collections par Allows invoking data-parallel versions of collection methods. Reactive extensions Observable.create observeOn The create method defines an event source. The observeOn method schedules the handling of events on different threads. Actors actorOf Schedules a new actor object for execution. This breakdown shows us that different concurrency libraries focus on different tasks. For example, parallel collections do not have conditional waiting constructs, because a data-parallel operation proceeds on separate elements independently. Similarly, software transactional memory does not come with a construct to express concurrent computations, and focuses only on protecting access to shared data. Actors do not have special constructs for modeling shared data and protecting access to it, because data is encapsulated within separate actors and accessed serially only by the actor that owns it. Having classified concurrency libraries according to how they model shared data and express concurrency, we present a summary of what different concurrency libraries are good for: The classical JVM concurrency model uses threads, the synchronized statement, volatile variables, and atomic primitives for low-level tasks. Uses include implementing a custom concurrency utility, a concurrent data structure, or a concurrency framework optimized for specific tasks. Futures and promises are best suited for referring to concurrent computations that produce a single result value. Futures model latency in the program, and allow composing values that become available later during the execution of the program. Uses include performing remote network requests and waiting for replies, referring to the result of an asynchronous long-running computation, or reacting to the completion of an I/O operation. Futures are usually the glue of a concurrent application, binding the different parts of a concurrent program together. We often use futures to convert single-event callback APIs into a standardized representation based on the Future type. Parallel collections are best suited for efficiently executing data-parallel operations on large datasets. Usages include file searching, text processing, linear algebra applications, numerical computations, and simulations. Long-running Scala collection operations are usually good candidates for parallelization. Reactive extensions are used to express asynchronous event-based programs. Unlike parallel collections, in reactive extensions, data elements are not available when the operation starts, but arrive while the application is running. Uses include converting callback-based APIs, modeling events in user interfaces, modeling events external to the application, manipulating program events with collection-style combinators, streaming data from input devices or remote locations, or incrementally propagating changes in the data model throughout the program. Use STM to protect program data from getting corrupted by concurrent accesses. An STM allows building complex data models and accessing them with the reduced risk of deadlocks and race conditions. A typical use is to protect concurrently accessible data, while retaining good scalability between threads whose accesses to data do not overlap. Actors are suitable for encapsulating concurrently accessible data, and seamlessly building distributed systems. Actor frameworks provide a natural way to express concurrent tasks that communicate by explicitly sending messages. Uses include serializing concurrent access to data to prevent corruption, expressing stateful concurrency units in the system, and building distributed applications like trading systems, P2P networks, communication hubs, or data mining frameworks. Advocates of specific programming languages, libraries, or frameworks might try to convince you that their technology is the best for any task and any situation, often with the intent of selling it. Richard Stallman once said how computer science is the only industry more fashion-driven than women's fashion. As engineers, we need to know better than to succumb to programming fashion and marketing propaganda. Different frameworks are tailored towards specific use cases, and the correct way to choose a technology is to carefully weigh its advantages and disadvantages when applied to a specific situation. There is no one-size-fits-all technology. Use your own best judgment when deciding which concurrency framework to use for a specific programming task. Sometimes, choosing the best-suited concurrency utility is easier said than done. It takes a great deal of experience to choose the correct technology. In many cases, we do not even know enough about the requirements of the system to make an informed decision. Regardless, a good rule of thumb is to apply several concurrency frameworks to different parts of the same application, each best suited for a specific task. Often, the real power of different concurrency frameworks becomes apparent when they are used together. This is the topic of the next section. Putting it all together – a remote file browser In this section, we use our knowledge about different concurrency frameworks to build a remote file browser. This larger application example illustrates how different concurrency libraries work together, and how to apply them to different situations. We will name our remote file browser ScalaFTP. The ScalaFTP browser is divided into two main components: the server and the client process. The server process will run on the machine whose filesystem we want to manipulate. The client will run on our own computer, and comprise of a graphical user interface used to navigate the remote filesystem. To keep things simple, the protocol that the client and the server will use to communicate will not really be FTP, but a custom communication protocol. By choosing the correct concurrency libraries to implement different parts of ScalaFTP, we will ensure that the complete ScalaFTP implementation fits inside just 500 lines of code. Specifically, the ScalaFTP browser will implement the following features: Displaying the names of the files and the directories in a remote filesystem, and allow navigating through the directory structure Copying files between directories in a remote filesystem Deleting files in a remote filesystem To implement separate pieces of this functionality, we will divide the ScalaFTP server and client programs into layers. The task of the server program is to answer to incoming copy and delete requests, and to answer queries about the contents of specific directories. To make sure that its view of the filesystem is consistent, the server will cache the directory structure of the filesystem. We divide the server program into two layers: the filesystem API and the server interface. The filesystem API will expose the data model of the server program, and define useful utility methods to manipulate the filesystem. The server interface will receive requests and send responses back to the client. Since the server interface will require communicating with the remote client, we decide to use the Akka actor framework. Akka comes with remote communication facilities. The contents of the filesystem, that is, its state, will change over time. We are therefore interested in choosing proper constructs for data access. In the filesystem API, we can use object monitors and locking to synchronize access to shared state, but we will avoid these due to the risk of deadlocks. We similarly avoid using atomic variables, because they are prone to race conditions. We could encapsulate the filesystem state within an actor, but note that this can lead to a scalability bottleneck:an actor would serialize all accesses to the filesystem state. Therefore, we decide to use the ScalaSTM framework to model the filesystem contents. An STM avoids the risk of deadlocks and race conditions, and ensures good horizontal scalability. The task of the client program will be to graphically present the contents of the remote filesystem, and communicate with the server. We divide the client program into three layers of functionality. The GUI layer will render the contents of the remote filesystem and register user requests such as button clicks. The client API will replicate the server interface on the client side and communicate with the server. We will use Akka to communicate with the server, but expose the results of remote operations as futures. Finally, the client logic will be a gluing layer, which binds the GUI and the client API together. The architecture of the ScalaFTP browser is illustrated in the following diagram, in which we indicate which concurrency libraries will be used by separate layers. The dashed line represents the communication path between the client and the server: We now start by implementing the ScalaFTP server, relying on the bottom-up design approach. In the next section, we will describe the internals of the filesystem API. Modeling the filesystem We used atomic variables and concurrent collections to implement a non-blocking, thread-safe filesystem API, which allowed copying files and retrieving snapshots of the filesystem. In this section, we repeat this task using STM. We will see that it is much more intuitive and less error-prone to use an STM. We start by defining the different states that a file can be in. The file can be currently created, in the idle state, being copied, or being deleted. We model this with a sealed State trait, and its four cases: sealed trait Statecase object Created extends Statecase object Idle extends Statecase class Copying(n: Int) extends Statecase object Deleted extends State A file can only be deleted if it is in the idle state, and it can only be copied if it is in the idle state or in the copied state. Since a file can be copied to multiple destinations at a time, the Copying state encodes how many copies are currently under way. We add the methods inc and dec to the State trait, which return a new state with one more or one fewer copy, respectively. For example, the implementation of inc and dec for the Copying state is as follows: def inc: State = Copying(n + 1)def dec: State = if (n > 1) Copying(n - 1) else Idle Similar to the File class in the java.io package, we represent both the files and directories with the same entity, and refer to them more generally as files. Each file is represented by the FileInfo class that encodes the path, its name, its parent directory, and the date of the last modification to the file; a Boolean value denoting if the file is a directory, the size of the file, and its State object. The FileInfo class is immutable, and updating the state of the file will require creating a fresh FileInfo object: case class FileInfo(path: String, name: String,parent: String, modified: String, isDir: Boolean,size: Long, state: State) We separately define the factory methods apply and creating that take a File object and return a FileInfo object in the Idle or Created state, respectively. Depending on where the server is started, the root of the ScalaFTP directory structure is a different subdirectory in the actual filesystem. A FileSystem object tracks the files in the given rootpath directory, using a transactional map called files: class FileSystem(val rootpath: String) {val files = TMap[String, FileInfo]()} We introduce a separate init method to initialize the FileSystem object. The init method starts a transaction, clears the contents of the files map, and traverses the files and directories under rootpath using the Apache Commons IO library. For each file and directory, the init method creates a FileInfo object and adds it to the files map, using its path as the key: def init() = atomic { implicit txn =>files.clear()val rootDir = new File(rootpath)val all = TrueFileFilter.INSTANCEval fileIterator =FileUtils.iterateFilesAndDirs(rootDir, all, all).asScalafor (file <- fileIterator) {val info = FileInfo(file)files(info.path) = info} Recall that the ScalaFTP browser must display the contents of the remote filesystem. To enable directory queries, we first add the getFileList method to the FileSystem class, which retrieves the files in the specified dir directory. The getFileList method starts a transaction and filters the files whose direct parent is equal to dir: def getFileList(dir: String): Map[String, FileInfo] =atomic { implicit txn =>files.filter(_._2.parent == dir)} We implement the copying logic in the filesystem API with the copyFile method. This method takes a path to the src source file and the dest destination file, and starts a transaction. After checking whether the dest destination file exists or not, the copyFile method inspects the state of the source file entry, and fails unless the state is Idle or Copying. It then calls inc to create a new state with the increased copy count, and updates the source file entry in the files map with the new state. Similarly, the copyFile method creates a new entry for the destination file in the files map. Finally, the copyFile method calls the afterCommit handler to physically copy the file to disk after the transaction completes. Recall that it is not legal to execute side-effecting operations from within the transaction body, so the private copyOnDisk method is called only after the transaction commits: def copyFile(src: String, dest: String) = atomic { implicit txn =>val srcfile = new File(src)val destfile = new File(dest)val info = files(src)if (files.contains(dest)) sys.error(s"Destination exists.")info.state match {case Idle | Copying(_) =>files(src) = info.copy(state = info.state.inc)files(dest) = FileInfo.creating(destfile, info.size)Txn.afterCommit { _ => copyOnDisk(srcfile, destfile) }src}} The copyOnDisk method calls the copyFile method on the FileUtils class from the Apache Commons IO library. After the file transfer completes, the copyOnDisk method starts another transaction, in which it decreases the copy count of the source file and sets the state of the destination file to Idle: private def copyOnDisk(srcfile: File, destfile: File) = {FileUtils.copyFile(srcfile, destfile)atomic { implicit txn =>val ninfo = files(srcfile.getPath)files(srcfile.getPath) = ninfo.copy(state = ninfo.state.dec)files(destfile.getPath) = FileInfo(destfile)}} The deleteFile method deletes a file in a similar way. It changes the file state to Deleted, deletes the file, and starts another transaction to remove the file entry: def deleteFile(srcpath: String): String = atomic { implicit txn =>val info = files(srcpath)info.state match {case Idle =>files(srcpath) = info.copy(state = Deleted)Txn.afterCommit { _ =>FileUtils.forceDelete(info.toFile)files.single.remove(srcpath)}srcpath}} Modeling the server data model with the STM allows seamlessly adding different concurrent computations to the server program. In the next section, we will implement a server actor that uses the server API to execute filesystem operations. Use STM to model concurrently accessible data, as an STM works transparently with most concurrency frameworks. Having completed the filesystem API, we now proceed to the server interface layer of the ScalaFTP browser. The Server interface The server interface comprises of a single actor called FTPServerActor. This actor will receive client requests and respond to them serially. If it turns out that the server actor is the sequential bottleneck of the system, we can simply add additional server interface actors to improve horizontal scalability. We start by defining the different types of messages that the server actor can receive. We follow the convention of defining them inside the companion object of the FTPServerActor class: object FTPServerActor {sealed trait Commandcase class GetFileList(dir: String) extends Commandcase class CopyFile(src: String, dest: String) extends Commandcase class DeleteFile(path: String) extends Commanddef apply(fs: FileSystem) = Props(classOf[FTPServerActor], fs)} The actor template of the server actor takes a FileSystem object as a parameter. It reacts to the GetFileList, CopyFile, and DeleteFile messages by calling the appropriate methods from the filesystem API: class FTPServerActor(fileSystem: FileSystem) extends Actor {val log = Logging(context.system, this)def receive = {case GetFileList(dir) =>val filesMap = fileSystem.getFileList(dir)val files = filesMap.map(_._2).to[Seq]sender ! filescase CopyFile(srcpath, destpath) =>Future {Try(fileSystem.copyFile(srcpath, destpath))} pipeTo sendercase DeleteFile(path) =>Future {Try(fileSystem.deleteFile(path))} pipeTo sender}} When the server receives a GetFileList message, it calls the getFileList method with the specified dir directory, and sends a sequence collection with the FileInfo objects back to the client. Since FileInfo is a case class, it extends the Serializable interface, and its instances can be sent over the network. When the server receives a CopyFile or DeleteFile message, it calls the appropriate filesystem method asynchronously. The methods in the filesystem API throw exceptions when something goes wrong, so we need to wrap calls to them in Try objects. After the asynchronous file operations complete, the resulting Try objects are piped back as messages to the sender actor, using the Akka pipeTo method. To start the ScalaFTP server, we need to instantiate and initialize a FileSystem object, and start the server actor. We parse the network port command-line argument, and use it to create an actor system that is capable of remote communication. For this, we use the remotingSystem factory method that we introduced. The remoting actor system then creates an instance of the FTPServerActor. This is shown in the following program: object FTPServer extends App {val fileSystem = new FileSystem(".")fileSystem.init()val port = args(0).toIntval actorSystem = ch8.remotingSystem("FTPServerSystem", port)actorSystem.actorOf(FTPServerActor(fileSystem), "server")} The ScalaFTP server actor can run inside the same process as the client application, in another process in the same machine, or on a different machine connected with a network. The advantage of the actor model is that we usually need not worry about where the actor runs until we integrate it into the entire application. When you need to implement a distributed application that runs on different machines, use an actor framework. Our server program is now complete, and we can run it with the run command from SBT. We set the actor system to use the port 12345: run 12345 In the next section, we will implement the file navigation API for the ScalaFTP client, which will communicate with the server interface over the network. Client navigation API The client API exposes the server interfaces to the client program through asynchronous methods that return future objects. Unlike the server's filesystem API, which runs locally, the client API methods execute remote network requests. Futures are a natural way to model latency in the client API methods, and to avoid blocking during the network requests. Internally, the client API maintains an actor instance that communicates with the server actor. The client actor does not know the actor reference of the server actor when it is created. For this reason, the client actor starts in an unconnected state. When it receives the Start message with the URL of the server actor system, the client constructs an actor path to the server actor, sends out an Identify message, and switches to the connecting state. If the actor system is able to find the server actor, the client actor eventually receives the ActorIdentity message with the server actor reference. In this case, the client actor switches to the connected state, and is able to forward commands to the server. Otherwise, the connection fails and the client actor reverts to the unconnected state. The state diagram of the client actor is shown in the following figure: We define the Start message in the client actor's companion object: object FTPClientActor {case class Start(host: String)} We then define the FTPClientActor class and give it an implicit Timeout parameter. The Timeout parameter will be used later in the Akka ask pattern, when forwarding client requests to the server actor. The stub of the FTPClientActor class is as follows: class FTPClientActor(implicit val timeout: Timeout)extends Actor Before defining the receive method, we define behaviors corresponding to different actor states. Once the client actor in the unconnected state receives the Start message with the host string, it constructs an actor path to the server, and creates an actor selection object. The client actor then sends the Identify message to the actor selection, and switches its behavior to connecting. This is shown in the following behavior method, named unconnected: def unconnected: Actor.Receive = {case Start(host) =>val serverActorPath =s"akka.tcp://FTPServerSystem@$host/user/server"val serverActorSel = context.actorSelection(serverActorPath)serverActorSel ! Identify(())context.become(connecting(sender))} The connecting method creates a behavior given an actor reference to the sender of the Start message. We call this actor reference clientApp, because the ScalaFTP client application will send the Start message to the client actor. Once the client actor receives an ActorIdentity message with the ref reference to the server actor, it can send true back to the clientApp reference, indicating that the connection was successful. In this case, the client actor switches to the connected behavior. Otherwise, if the client actor receives an ActorIdentity message without the server reference, the client actor sends false back to the application, and reverts to the unconnected state: def connecting(clientApp: ActorRef): Actor.Receive = {case ActorIdentity(_, Some(ref)) =>clientApp ! truecontext.become(connected(ref))case ActorIdentity(_, None) =>clientApp ! falsecontext.become(unconnected)} The connected state uses the serverActor server actor reference to forward the Command messages. To do so, the client actor uses the Akka ask pattern, which returns a future object with the server's response. The contents of the future are piped back to the original sender of the Command message. In this way, the client actor serves as an intermediary between the application, which is the sender, and the server actor. The connected method is shown in the following code snippet: def connected(serverActor: ActorRef): Actor.Receive = {case command: Command =>(serverActor ? command).pipeTo(sender)} Finally, the receive method returns the unconnected behavior, in which the client actor is created: def receive = unconnected Having implemented the client actor, we can proceed to the client API layer. We model it as a trait with a connected value, the concrete methods getFileList, copyFile, and deleteFile, and an abstract host method. The client API creates a private remoting actor system and a client actor. It then instantiates the connected future that computes the connection status by sending a Start message to the client actor. The methods getFileList, copyFile, and deleteFile are similar. They use the ask pattern on the client actor to obtain a future with the response. Recall that the actor messages are not typed, and the ask pattern returns a Future[Any] object. For this reason, each method in the client API uses the mapTo future combinator to restore the type of the message: trait FTPClientApi {implicit val timeout: Timeout = Timeout(4 seconds)private val props = Props(classOf[FTPClientActor], timeout)private val system = ch8.remotingSystem("FTPClientSystem", 0)private val clientActor = system.actorOf(props)def host: Stringval connected: Future[Boolean] = {val f = clientActor ? FTPClientActor.Startf.mapTo[Boolean]}def getFileList(d: String): Future[(String, Seq[FileInfo])] = {val f = clientActor ? FTPServerActor.GetFileList(d)f.mapTo[Seq[FileInfo]].map(fs => (d, fs))}def copyFile(src: String, dest: String): Future[String] = {val f = clientActor ? FTPServerActor.CopyFile(src, dest)f.mapTo[Try[String]].map(_.get)}def deleteFile(srcpath: String): Future[String] = {val f = clientActor ? FTPServerActor.DeleteFile(srcpath)f.mapTo[Try[String]].map(_.get)}} Note that the client API does not expose the fact that it uses actors for remote communication. Moreover, the client API is similar to the server API, but the return types of the methods are futures instead of normal values. Futures encode the latency of a method without exposing the cause for the latency, so we often find them at the boundaries between different APIs. We can internally replace the actor communication between the client and the server with the remote Observable objects, but that would not change the client API. In a concurrent application, use futures at the boundaries of the layers to express latency. Now that we can programmatically communicate with the remote ScalaFTP server, we turn our attention to the user interface of the client program. Summary This article summarized the different concurrency libraries introduced to us. In this article, you learned how to choose the correct concurrent abstraction to solve a given problem. We learned to combine different concurrency abstractions together when designing larger concurrent applications. Resources for Article: Further resources on this subject: Creating Java EE Applications [Article] Differences in style between Java and Scala code [Article] Integrating Scala, Groovy, and Flex Development with Apache Maven [Article]
Read more
  • 0
  • 0
  • 1782

article-image-modernizing-our-spring-boot-app
Packt
26 Nov 2014
15 min read
Save for later

Modernizing our Spring Boot app

Packt
26 Nov 2014
15 min read
In this article by Greg L. Turnquist, the author of the book, Learning Spring Boot, we will discuss modernizing our Spring Boot app with JavaScript and adding production-ready support features. (For more resources related to this topic, see here.) Modernizing our app with JavaScript We just saw that, with a single @Grab statement, Spring Boot automatically configured the Thymeleaf template engine and some specialized view resolvers. We took advantage of Spring MVC's ability to pass attributes to the template through ModelAndView. Instead of figuring out the details of view resolvers, we instead channeled our efforts into building a handy template to render data fetched from the server. We didn't have to dig through reference docs, Google, and Stack Overflow to figure out how to configure and integrate Spring MVC with Thymeleaf. We let Spring Boot do the heavy lifting. But that's not enough, right? Any real application is going to also have some JavaScript. Love it or hate it, JavaScript is the engine for frontend web development. See how the following code lets us make things more modern by creating modern.groovy: @Grab("org.webjars:jquery:2.1.1")@Grab("thymeleaf-spring4")@Controllerclass ModernApp {def chapters = ["Quick Start With Groovy","Quick Start With Java","Debugging and Managing Your App","Data Access with Spring Boot","Securing Your App"]@RequestMapping("/")def home(@RequestParam(value="name", defaultValue="World")String n) {new ModelAndView("modern").addObject("name", n).addObject("chapters", chapters)}} A single @Grab statement pulls in jQuery 2.1.1. The rest of our server-side Groovy code is the same as before. There are multiple ways to use JavaScript libraries. For Java developers, it's especially convenient to use the WebJars project (http://webjars.org), where lots of handy JavaScript libraries are wrapped up with Maven coordinates. Every library is found on the /webjars/<library>/<version>/<module> path. To top it off, Spring Boot comes with prebuilt support. Perhaps you noticed this buried in earlier console outputs: ...2014-05-20 08:33:09.062 ... : Mapped URL path [/webjars/**] onto handlerof [...... With jQuery added to our application, we can amp up our template (templates/modern.html) like this: <html><head><title>Learning Spring Boot - Chapter 1</title><script src="webjars/jquery/2.1.1/jquery.min.js"></script><script>$(document).ready(function() {$('p').animate({fontSize: '48px',}, "slow");});</script></head><body><p th_text="'Hello, ' + ${name}"></p><ol><li th_each="chapter : ${chapters}"th:text="${chapter}"></li></ol></body></html> What's different between this template and the previous one? It has a couple extra <script> tags in the head section: The first one loads jQuery from /webjars/jquery/2.1.1/jquery.min.js (implying that we can also grab jquery.js if we want to debug jQuery) The second script looks for the <p> element containing our Hello, world! message and then performs an animation that increases the font size to 48 pixels after the DOM is fully loaded into the browser If we run spring run modern.groovy and visit http://localhost:8080, then we can see this simple but stylish animation. It shows us that all of jQuery is available for us to work with on our application. Using Bower instead of WebJars WebJars isn't the only option when it comes to adding JavaScript to our app. More sophisticated UI developers might use Bower (http://bower.io), a popular JavaScript library management tool. WebJars are useful for Java developers, but not every library has been bundled as a WebJar. There is also a huge community of frontend developers more familiar with Bower and NodeJS that will probably prefer using their standard tool chain to do their jobs. We'll see how to plug that into our app. First, it's important to know some basic options. Spring Boot supports serving up static web resources from the following paths: /META-INF/resources/ /resources/ /static/ /public/ To craft a Bower-based app with Spring Boot, we first need to craft a .bowerrc file in the same folder we plan to create our Spring Boot CLI application. Let's pick public/ as the folder of choice for JavaScript modules and put it in this file, as shown in the following code: {"directory": "public/"} Do I have to use public? No. Again, you can pick any of the folders listed previously and Spring Boot will serve up the code. It's a matter of taste and semantics. Our first step towards a Bower-based app is to define our project by answering a series of questions (this only has to be done once): $ bower init[?] name: app_with_bower[?] version: 0.1.0[?] description: Learning Spring Boot - bower sample[?] main file:[?] what types of modules does this package expose? amd[?] keywords:[?] authors: Greg Turnquist <gturnquist@pivotal.io>[?] license: ASL[?] homepage: http://blog.greglturnquist.com/category/learning-springboot[?] set currently installed components as dependencies? No[?] add commonly ignored files to ignore list? Yes[?] would you like to mark this package as private which prevents it frombeing accidentally published to the registry? Yes...[?] Looks good? Yes Now that we have set our project, let's do something simple such as install jQuery with the following command: $ bower install jquery --savebower jquery#* cached git://github.com/jquery/jquery.git#2.1.1bower jquery#* validate 2.1.1 against git://github.com/jquery/jquery.git#* These two commands will have created the following bower.json file: {"name": "app_with_bower","version": "0.1.0","authors": ["Greg Turnquist <gturnquist@pivotal.io>"],"description": "Learning Spring Boot - bower sample","license": "ASL","homepage": "http://blog.greglturnquist.com/category/learningspring-boot","private": true,"ignore": ["**/.*","node_modules","bower_components","public/","test","tests"],"dependencies": {"jquery": "~2.1.1"}} It will also have installed jQuery 2.1.1 into our app with the following directory structure: public└── jquery├── MIT-LICENSE.txt├── bower.json└── dist├── jquery.js└── jquery.min.js We must include --save (two dashes) whenever we install a module. This ensures that our bower.json file is updated at the same time, allowing us to rebuild things if needed. The altered version of our app with WebJars removed should now look like this: @Grab("thymeleaf-spring4")@Controllerclass ModernApp {def chapters = ["Quick Start With Groovy","Quick Start With Java","Debugging and Managing Your App","Data Access with Spring Boot","Securing Your App"]@RequestMapping("/")def home(@RequestParam(value="name", defaultValue="World")String n) {new ModelAndView("modern_with_bower").addObject("name", n).addObject("chapters", chapters)}} The view name has been changed to modern_with_bower, so it doesn't collide with the previous template if found in the same folder. This version of the template, templates/modern_with_bower.html, should look like this: <html><head><title>Learning Spring Boot - Chapter 1</title><script src="jquery/dist/jquery.min.js"></script><script>$(document).ready(function() {$('p').animate({fontSize: '48px',}, "slow");});</script></head><body><p th_text="'Hello, ' + ${name}"></p><ol><li th_each="chapter : ${chapters}"th:text="${chapter}"></li></ol></body></html> The path to jquery is now jquery/dist/jquery.min.js. The rest is the same as the WebJars example. We just launch the app with spring run modern_with_bower.groovy and navigate to http://localhost:8080. (Might need to refresh the page to ensure loading of the latest HTML.) The animation should work just the same. The options shown in this section can quickly give us a taste of how easy it is to use popular JavaScript tools with Spring Boot. We don't have to fiddle with messy tool chains to achieve a smooth integration. Instead, we can use them the way they are meant to be used. What about an app that is all frontend with no backend? Perhaps we're building an app that gets all its data from a remote backend. In this age of RESTful backends, it's not uncommon to build a single page frontend that is fed data updates via AJAX. Spring Boot's Groovy support provides the perfect and arguably smallest way to get started. We do so by creating pure_javascript.groovy, as shown in the following code: @Controllerclass JsApp { } That doesn't look like much, but it accomplishes a lot. Let's see what this tiny fragment of code actually does for us: The @Controller annotation, like @RestController, causes Spring Boot to auto-configure Spring MVC. Spring Boot will launch an embedded Apache Tomcat server. Spring Boot will serve up static content from resources, static, and public. Since there are no Spring MVC routes in this tiny fragment of code, things will fall to resource resolution. Next, we can create a static/index.html page as follows: <html>Greetings from pure HTML which can, in turn, load JavaScript!</html> Run spring run pure_javascript.groovy and navigate to http://localhost:8080. We will see the preceding plain text shown in our browser as expected. There is nothing here but pure HTML being served up by our embedded Apache Tomcat server. This is arguably the lightest way to serve up static content. Use spring jar and it's possible to easily bundle up our client-side app to be installed anywhere. Spring Boot's support for static HTML, JavaScript, and CSS opens the door to many options. We can add WebJar annotations to JsApp or use Bower to introduce third-party JavaScript libraries in addition to any custom client-side code. We might just manually download the JavaScript and CSS. No matter what option we choose, Spring Boot CLI certainly provides a super simple way to add rich-client power for app development. To top it off, RESTful backends that are decoupled from the frontend can have different iteration cycles as well as different development teams. You might need to configure CORS (http://spring.io/understanding/CORS) to properly handle making remote calls that don't go back to the original server. Adding production-ready support features So far, we have created a Spring MVC app with minimal code. We added views and JavaScript. We are on the verge of a production release. Before deploying our rapidly built and modernized web application, we might want to think about potential issues that might arise in production: What do we do when the system administrator wants to configure his monitoring software to ping our app to see if it's up? What happens when our manager wants to know the metrics of people hitting our app? What are we going to do when the Ops center supervisor calls us at 2:00 a.m. and we have to figure out what went wrong? The last feature we are going to introduce in this article is Spring Boot's Actuator module and CRaSH remote shell support (http://www.crashub.org). These two modules provide some super slick, Ops-oriented features that are incredibly valuable in a production environment. We first need to update our previous code (we'll call it ops.groovy), as shown in the following code: @Grab("spring-boot-actuator")@Grab("spring-boot-starter-remote-shell")@Grab("org.webjars:jquery:2.1.1")@Grab("thymeleaf-spring4")@Controllerclass OpsReadyApp {@RequestMapping("/")def home(@RequestParam(value="name", defaultValue="World")String n) {new ModelAndView("modern").addObject("name", n)}} This app is exactly like the WebJars example with two key differences: it adds @Grab("spring-boot-actuator") and @Grab("spring-boot-starter-remote-shell"). When you run this version of our app, the same business functionality is available that we saw earlier, but there are additional HTTP endpoints available: Actuator endpoint Description /autoconfig This reports what Spring Boot did and didn't auto-configure and why /beans This reports all the beans configured in the application context (including ours as well as the ones auto-configured by Boot) /configprops This exposes all configuration properties /dump This creates a thread dump report /env This reports on the current system environment /health This is a simple endpoint to check life of the app /info This serves up custom content from the app /metrics This shows counters and gauges on web usage /mappings This gives us details about all Spring MVC routes /trace This shows details about past requests Pinging our app for general health Each of these endpoints can be visited using our browser or using other tools such as curl. For example, let's assume we ran spring run ops.groovy and then opened up another shell. From the second shell, let's run the following curl command: $ curl localhost:8080/health{"status":"UP"} This immediately solves our first need listed previously. We can inform the system administrator that he or she can write a management script to interrogate our app's health. Gathering metrics Be warned that each of these endpoints serves up a compact JSON document. Generally speaking, command-line curl probably isn't the best option. While it's convenient on *nix and Mac systems, the content is dense and hard to read. It's more practical to have: A JSON plugin installed in our browser (such as JSONView at http://jsonview.com) A script that uses a JSON parsing library if we're writing a management script (such as Groovy's JsonSlurper at http://groovy.codehaus.org/gapi/groovy/json/JsonSlurper.html or JSONPath at https://code.google.com/p/json-path) Assuming we have JSONView installed, the following screenshot shows a listing of metrics: It lists counters for each HTTP endpoint. According to this, /metrics has been visited four times with a successful 200 status code. Someone tried to access /foo, but it failed with a 404 error code. The report also lists gauges for each endpoint, reporting the last response time. In this case, /metrics took 2 milliseconds. Also included are some memory stats as well as the total CPUs available. It's important to realize that the metrics start at 0. To generate some numbers, you might want to first click on some links before visiting /metrics. The following screenshot shows a trace report: It shows the entire web request and response for curl localhost:8080/health. This provides a basic framework of metrics to satisfy our manager's needs. It's important to understand that metrics gathered by Spring Boot Actuator aren't persistent across application restarts. So to gather long-term data, we have to gather them and then write them elsewhere. With these options, we can perform the following: Write a script that gathers metrics every hour and appends them to a running spreadsheet somewhere else in the filesystem, such as a shared drive. This might be simple, but probably also crude. To step it up, we can dump the data into a Hadoop filesystem for raw collection and configure Spring XD (http://projects.spring.io/spring-xd/) to consume it. Spring XD stands for Spring eXtreme Data. It is an open source product that makes it incredibly easy to chain together sources and sinks comprised of many components, such as HTTP endpoints, Hadoop filesystems, Redis metrics, and RabbitMQ messaging. Unfortunately, there is no space to dive into this subject. With any monitoring, it's important to check that we aren't taxing the system too heavily. The same container responding to business-related web requests is also serving metrics data, so it will be wise to engage profilers periodically to ensure that the whole system is performing as expected. Detailed management with CRaSH So what can we do when we receive that 2:00 a.m. phone call from the Ops center? After either coming in or logging in remotely, we can access the convenient CRaSH shell we configured. Every time the app launches, it generates a random password for SSH access and prints this to the local console: 2014-06-11 23:00:18.822 ... : Configuring property ssh.port=2000 fromproperties2014-06-11 23:00:18.823 ... : Configuring property ssh.authtimeout=600000 fro...2014-06-11 23:00:18.824 ... : Configuring property ssh.idletimeout=600000 fro...2014-06-11 23:00:18.824 ... : Configuring property auth=simple fromproperties2014-06-11 23:00:18.824 ... : Configuring property auth.simple.username=user f...2014-06-11 23:00:18.824 ... : Configuring property auth.simple.password=bdbe4a... We can easily see that there's SSH access on port 2000 via a user if we use this information to log in: $ ssh -p 2000 user@localhostPassword authenticationPassword:. ____ _ __ _ _/\ / ___'_ __ _ _(_)_ __ __ _ ( ( )___ | '_ | '_| | '_ / _' | \/ ___)| |_)| | | | | || (_| | ) ) ) )' |____| .__|_| |_|_| |___, | / / / /=========|_|==============|___/=/_/_/_/:: Spring Boot :: (v1.1.6.RELEASE) on retina> There's a fistful of commands: help: This gets a listing of available commands dashboard: This gets a graphic, text-based display of all the threads, environment properties, memory, and other things autoconfig: This prints out a report of which Spring Boot auto-configuration rules were applied and which were skipped (and why) All of the previous commands have man pages: > man autoconfigNAMEautoconfig - Display auto configuration report fromApplicationContextSYNOPSISautoconfig [-h | --help]STREAMautoconfig <java.lang.Void, java.lang.Object>PARAMETERS[-h | --help]Display this help message... There are many commands available to help manage our application. More details are available at http://www.crashub.org/1.3/reference.html. Summary In this article, we learned about modernizing our Spring Boot app with JavaScript and adding production-ready support features. We plugged in Spring Boot's Actuator module as well as the CRaSH remote shell, configuring it with metrics, health, and management features so that we can monitor it in production by merely adding two lines of extra code. Resources for Article: Further resources on this subject: Getting Started with Spring Security [Article] Spring Roo 1.1: Working with Roo-generated Web Applications [Article] Spring Security 3: Tips and Tricks [Article]
Read more
  • 0
  • 0
  • 3132

Packt
25 Nov 2014
7 min read
Save for later

Creating an Apache JMeter™ test workbench

Packt
25 Nov 2014
7 min read
This article is written by Colin Henderson, the author of Mastering GeoServer. This article will give you a brief introduction about how to create an Apache JMeter™ test workbench. (For more resources related to this topic, see here.) Before we can get into the nitty-gritty of creating a test workbench for Apache JMeter™, we must download and install it. Apache JMeter™ is a 100 percent Java application, which means that it will run on any platform provided there is a Java 6 or higher runtime environment present. The binaries can be downloaded from http://jmeter.apache.org/download_jmeter.cgi, and at the time of writing, the latest version is 2.11. No installation is required; just download the ZIP file and decompress it to a location you can access from a command-line prompt or shell environment. To launch JMeter on Linux, simply open shell and enter the following command: $ cd <path_to_jmeter>/bin$ ./jmeter To launch JMeter on Windows, simply open a command prompt and enter the following command: C:> cd <path_to_jmeter>\binC:> jmeter After a short time, JMeter GUI should appear, where we can construct our test plan. For ease and convenience, consider setting your system's PATH environment variable to the location of the JMeter bin directory. In future, you will be able to launch JMeter from the command line without having to CD first. The JMeter workbench will open with an empty configuration ready for us to construct our test strategy: The first thing we need to do is give our test plan a name; for now, let's call it GeoServer Stress Test. We can also provide some comments, which is good practice as it will help us remember for what reason we devised the test plan in future. To demonstrate the use of JMeter, we will create a very simple test plan. In this test plan, we will simulate a certain number of users hitting our GeoServer concurrently and requesting maps. To set this up, we first need to add Thread Group to our test plan. In a JMeter test, a thread is equivalent to a user: In the left-hand side menu, we need to right-click on the GeoServer Stress Test node and choose the Add | Threads (Users) | Thread Group menu option. This will add a child node to the test plan that we right-clicked on. The right-hand side panel provides options that we can set for the thread group to control how the user requests are executed. For example, we can name it something meaningful, such as Web Map Requests. In this test, we will simulate 30 users, making map requests over a total duration of 10 minutes, with a 10-second delay between each user starting. The number of users is set by entering a value for Number of Threads; in this case, 30. The Ramp-Up Period option controls the delay in starting each user by specifying the duration in which all the threads must start. So, in our case, we enter a duration of 300 seconds, which means all 30 users will be started by the end of 300 seconds. This equates to a 10-second delay between starting threads (300 / 30 = 10). Finally, we will set a duration for the test to run over by ticking the box for Scheduler, and then specifying a value of 600 seconds for Duration. By specifying a duration value, we override the End Time setting. Next, we need to provide some basic configuration elements for our test. First, we need to set the default parameters for all web requests. Right-click on the Web Map Requests thread group node that we just created, and then navigate to Add | Config Element | User Defined Variables. This will add a new node in which we can specify the default HTTP request parameters for our test: In the right-hand side panel, we can specify any number of variables. We can use these as replacement tokens later when we configure the web requests that will be sent during our test run. In this panel, we specify all the standard WMS query parameters that we don't anticipate changing across requests. Taking this approach is a good practice as it means that we can create a mix of tests using the same values, so if we change one, we don't have to change all the different test elements. To execute requests, we need to add Logic Controller. JMeter contains a lot of different logic controllers, but in this instance, we will use Simple Controller to execute a request. To add the controller, right-click on the Web Map Requests node and navigate to Add | Logic Controller | Simple Controller. A simple controller does not require any configuration; it is merely a container for activities we want to execute. In our case, we want the controller to read some data from our CSV file, and then execute an HTTP request to WMS. To do this, we need to add a CSV dataset configuration. Right-click on the Simple Controller node and navigate to Add | Config Element | CSV Data Set Config. The settings for the CSV data are pretty straightforward. The filename is set to the file that we generated previously, containing the random WMS request properties. The path can be specified as relative or absolute. The Variable Names property is where we specify the structure of the CSV file. The Recycle on EOF option is important as it means that the CSV file will be re-read when the end of the file is reached. Finally, we need to set Sharing mode to All threads to ensure the data can be used across threads. Next, we need to add a delay to our requests to simulate user activity; in this case, we will introduce a small delay of 5 seconds to simulate a user performing a map-pan operation. Right-click on the Simple Controller node, and then navigate to Add | Timer | Constant Timer: Simply specify the value we want the thread to be paused for in milliseconds. Finally, we need to add a JMeter sampler, which is the unit that will actually perform the HTTP request. Right-click on the Simple Controller node and navigate to Add | Sampler | HTTP Request. This will add an HTTP Request sampler to the test plan: There is a lot of information that goes into this panel; however, all it does is construct an HTTP request that the thread will execute. We specify the server name or IP address along with the HTTP method to use. The important part of this panel is the Parameters tab, which is where we need to specify all the WMS request parameters. Notice that we used the tokens that we specified in the CSV Data Set Config and WMS Request Defaults configuration components. We use the ${token_name} token, and JMeter replaces the token with the appropriate value of the referenced variable. We configured our test plan, but before we execute it, we need to add some listeners to the plan. A JMeter listener is the component that will gather the information from all of the test runs that occur. We add listeners by right-clicking on the thread group node and then navigating to the Add | Listeners menu option. A list of available listeners is displayed, and we can select the one we want to add. For our purposes, we will add the Graph Results, Generate Summary Results, Summary Report, and Response Time Graph listeners. Each listener can have its output saved to a datafile for later review. When completed, our test plan structure should look like the following: Before executing the plan, we should save it for use later. Summary In this article, we looked at how Apache JMeter™ can be used to construct and execute test plans to place loads on our servers so that we can analyze the results and gain an understanding of how well our servers perform. Resources for Article: Further resources on this subject: Geo-Spatial Data in Python: Working with Geometry [article] Working with Geo-Spatial Data in Python [article] Getting Started with GeoServer [article]
Read more
  • 0
  • 0
  • 3458
article-image-decoupling-units-unittestmock
Packt
24 Nov 2014
27 min read
Save for later

Decoupling Units with unittest.mock

Packt
24 Nov 2014
27 min read
In this article by Daniel Arbuckle, author of the book Learning Python Testing, you'll learn how by using the unittest.mock package, you can easily perform the following: Replace functions and objects in your own code or in external packages. Control how replacement objects behave. You can control what return values they provide, whether they raise an exception, even whether they make any calls to other functions, or create instances of other objects. Check whether the replacement objects were used as you expected: whether functions or methods were called the correct number of times, whether the calls occurred in the correct order, and whether the passed parameters were correct. (For more resources related to this topic, see here.) Mock objects in general All right, before we get down to the nuts and bolts of unittest.mock, let's spend a few moments talking about mock objects overall. Broadly speaking, mock objects are any objects that you can use as substitutes in your test code, to keep your tests from overlapping and your tested code from infiltrating the wrong tests. However, like most things in programming, the idea works better when it has been formalized into a well-designed library that you can call on when you need it. There are many such libraries available for most programming languages. Over time, the authors of mock object libraries have developed two major design patterns for mock objects: in one pattern, you can create a mock object and perform all of the expected operations on it. The object records these operations, and then you put the object into playback mode and pass it to your code. If your code fails to duplicate the expected operations, the mock object reports a failure. In the second pattern, you can create a mock object, do the minimal necessary configuration to allow it to mimic the real object it replaces, and pass it to your code. It records how the code uses it, and then you can perform assertions after the fact to check whether your code used the object as expected. The second pattern is slightly more capable in terms of the tests that you can write using it but, overall, either pattern works well. Mock objects according to unittest.mock Python has several mock object libraries; as of Python 3.3, however, one of them has been crowned as a member of the standard library. Naturally that's the one we're going to focus on. That library is, of course, unittest.mock. The unittest.mock library is of the second sort, a record-actual-use-and-then-assert library. The library contains several different kinds of mock objects that, between them, let you mock almost anything that exists in Python. Additionally, the library contains several useful helpers that simplify assorted tasks related to mock objects, such as temporarily replacing real objects with mocks. Standard mock objects The basic element of unittest.mock is the unittest.mock.Mock class. Even without being configured at all, Mock instances can do a pretty good job of pretending to be some other object, method, or function. There are many mock object libraries for Python; so, strictly speaking, the phrase "mock object" could mean any object that was created by any of these libraries. Mock objects can pull off this impersonation because of a clever, somewhat recursive trick. When you access an unknown attribute of a mock object, instead of raising an AttributeError exception, the mock object creates a child mock object and returns that. Since mock objects are pretty good at impersonating other objects, returning a mock object instead of the real value works at least in the common case. Similarly, mock objects are callable; when you call a mock object as a function or method, it records the parameters of the call and then, by default, returns a child mock object. A child mock object is a mock object in its own right, but it knows that it's connected to the mock object it came from—its parent. Anything you do to the child is also recorded in the parent's memory. When the time comes to check whether the mock objects were used correctly, you can use the parent object to check on all of its descendants. Example: Playing with mock objects in the interactive shell (try it for yourself!): $ python3.4 Python 3.4.0 (default, Apr 2 2014, 08:10:08) [GCC 4.8.2] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from unittest.mock import Mock, call >>> mock = Mock() >>> mock.x <Mock name='mock.x' id='140145643647832'> >>> mock.x <Mock name='mock.x' id='140145643647832'> >>> mock.x('Foo', 3, 14) <Mock name='mock.x()' id='140145643690640'> >>> mock.x('Foo', 3, 14) <Mock name='mock.x()' id='140145643690640'> >>> mock.x('Foo', 99, 12) <Mock name='mock.x()' id='140145643690640'> >>> mock.y(mock.x('Foo', 1, 1)) <Mock name='mock.y()' id='140145643534320'> >>> mock.method_calls [call.x('Foo', 3, 14), call.x('Foo', 3, 14), call.x('Foo', 99, 12), call.x('Foo', 1, 1), call.y(<Mock name='mock.x()' id='140145643690640'>)] >>> mock.assert_has_calls([call.x('Foo', 1, 1)]) >>> mock.assert_has_calls([call.x('Foo', 1, 1), call.x('Foo', 99, 12)]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 792, in assert_has_ calls ) from cause AssertionError: Calls not found. Expected: [call.x('Foo', 1, 1), call.x('Foo', 99, 12)] Actual: [call.x('Foo', 3, 14), call.x('Foo', 3, 14), call.x('Foo', 99, 12), call.x('Foo', 1, 1), call.y(<Mock name='mock.x()' id='140145643690640'>)] >>> mock.assert_has_calls([call.x('Foo', 1, 1), ... call.x('Foo', 99, 12)], any_order = True) >>> mock.assert_has_calls([call.y(mock.x.return_value)]) There are several important things demonstrated in this interactive session. First, notice that the same mock object was returned each time that we accessed mock.x. This always holds true: if you access the same attribute of a mock object, you'll get the same mock object back as the result. The next thing to notice might seem more surprising. Whenever you call a mock object, you get the same mock object back as the return value. The returned mock isn't made new for every call, nor is it unique for each combination of parameters. We'll see how to override the return value shortly but, by default, you get the same mock object back every time you call a mock object. This mock object can be accessed using the return_value attribute name, as you might have noticed from the last statement of the example. The unittest.mock package contains a call object that helps to make it easier to check whether the correct calls have been made. The call object is callable, and takes note of its parameters in a way similar to mock objects, making it easy to compare it to a mock object's call history. However, the call object really shines when you have to check for calls to descendant mock objects. As you can see in the previous example, while call('Foo', 1, 1) will match a call to the parent mock object, if the call used these parameters, call.x('Foo', 1, 1), it matches a call to the child mock object named x. You can build up a long chain of lookups and invocations. For example: >>> mock.z.hello(23).stuff.howdy('a', 'b', 'c') <Mock name='mock.z.hello().stuff.howdy()' id='140145643535328'> >>> mock.assert_has_calls([ ... call.z.hello().stuff.howdy('a', 'b', 'c') ... ]) >>> Notice that the original invocation included hello(23), but the call specification wrote it simply as hello(). Each call specification is only concerned with the parameters of the object that was finally called after all of the lookups. The parameters of intermediate calls are not considered. That's okay because they always produce the same return value anyway unless you've overridden that behavior, in which case they probably don't produce a mock object at all. You might not have encountered an assertion before. Assertions have one job, and one job only: they raise an exception if something is not as expected. The assert_has_calls method, in particular, raises an exception if the mock object's history does not include the specified calls. In our example, the call history matches, so the assertion method doesn't do anything visible. You can check whether the intermediate calls were made with the correct parameters, though, because the mock object recorded a call immediately to mock.z.hello(23) before it recorded a call to mock.z.hello().stuff.howdy('a', 'b', 'c'): >>> mock.mock_calls.index(call.z.hello(23)) 6 >>> mock.mock_calls.index(call.z.hello().stuff.howdy('a', 'b', 'c')) 7 This also points out the mock_calls attribute that all mock objects carry. If the various assertion functions don't quite do the trick for you, you can always write your own functions that inspect the mock_calls list and check whether things are or are not as they should be. We'll discuss the mock object assertion methods shortly. Non-mock attributes What if you want a mock object to give back something other than a child mock object when you look up an attribute? It's easy; just assign a value to that attribute: >>> mock.q = 5 >>> mock.q 5 There's one other common case where mock objects' default behavior is wrong: what if accessing a particular attribute is supposed to raise an AttributeError? Fortunately, that's easy too: >>> del mock.w >>> mock.w Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 563, in __getattr__ raise AttributeError(name) AttributeError: w Non-mock return values and raising exceptions Sometimes, actually fairly often, you'll want mock objects posing as functions or methods to return a specific value, or a series of specific values, rather than returning another mock object. To make a mock object always return the same value, just change the return_value attribute: >>> mock.o.return_value = 'Hi' >>> mock.o() 'Hi' >>> mock.o('Howdy') 'Hi' If you want the mock object to return different value each time it's called, you need to assign an iterable of return values to the side_effect attribute instead, as follows: >>> mock.p.side_effect = [1, 2, 3] >>> mock.p() 1 >>> mock.p() 2 >>> mock.p() 3 >>> mock.p() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 885, in __call__ return _mock_self._mock_call(*args, **kwargs) File "/usr/lib64/python3.4/unittest/mock.py", line 944, in _mock_call result = next(effect) StopIteration If you don't want your mock object to raise a StopIteration exception, you need to make sure to give it enough return values for all of the invocations in your test. If you don't know how many times it will be invoked, an infinite iterator such as itertools.count might be what you need. This is easily done: >>> mock.p.side_effect = itertools.count() If you want your mock to raise an exception instead of returning a value, just assign the exception object to side_effect, or put it into the iterable that you assign to side_effect: >>> mock.e.side_effect = [1, ValueError('x')] >>> mock.e() 1 >>> mock.e() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 885, in __call__ return _mock_self._mock_call(*args, **kwargs) File "/usr/lib64/python3.4/unittest/mock.py", line 946, in _mock_call raise result ValueError: x The side_effect attribute has another use, as well that we'll talk about. Mocking class or function details Sometimes, the generic behavior of mock objects isn't a close enough emulation of the object being replaced. This is particularly the case when it's important that they raise exceptions when used improperly, since mock objects are usually happy to accept any usage. The unittest.mock package addresses this problem using a technique called speccing. If you pass an object into unittest.mock.create_autospec, the returned value will be a mock object, but it will do its best to pretend that it's the same object you passed into create_autospec. This means that it will: Raise an AttributeError if you attempt to access an attribute that the original object doesn't have, unless you first explicitly assign a value to that attribute Raise a TypeError if you attempt to call the mock object when the original object wasn't callable Raise a TypeError if you pass the wrong number of parameters or pass a keyword parameter that isn't viable if the original object was callable Trick isinstance into thinking that the mock object is of the original object's type Mock objects made by create_autospec share this trait with all of their children as well, which is usually what you want. If you really just want a specific mock to be specced, while its children are not, you can pass the template object into the Mock constructor using the spec keyword. Here's a short demonstration of using create_autospec: >>> from unittest.mock import create_autospec >>> x = Exception('Bad', 'Wolf') >>> y = create_autospec(x) >>> isinstance(y, Exception) True >>> y <NonCallableMagicMock spec='Exception' id='140440961099088'> Mocking function or method side effects Sometimes, for a mock object to successfully take the place of a function or method means that the mock object has to actually perform calls to other functions, or set variable values, or generally do whatever a function can do. This need is less common than you might think, and it's also somewhat dangerous for testing purposes because, when your mock objects can execute arbitrary code, there's a possibility that they stop being a simplifying tool for enforcing test isolation, and become a complex part of the problem instead. Having said that, there are still times when you need a mocked function to do something more complex than simply returning a value, and we can use the side_effect attribute of mock objects to achieve this. We've seen side_effect before, when we assigned an iterable of return values to it. If you assign a callable to side_effect, this callable will be called when the mock object is called and passed the same parameters. If the side_effect function raises an exception, this is what the mock object does as well; otherwise, the side_effect return value is returned by the mock object. In other words, if you assign a function to a mock object's side_effect attribute, this mock object in effect becomes that function with the only important difference being that the mock object still records the details of how it's used. The code in a side_effect function should be minimal, and should not try to actually do the job of the code the mock object is replacing. All it should do is perform any expected externally visible operations and then return the expected result.Mock object assertion methods As we saw in the Standard mock objects section, you can always write code that checks the mock_calls attribute of mock objects to see whether or not things are behaving as they should. However, there are some particularly common checks that have already been written for you, and are available as assertion methods of the mock objects themselves. As is normal for assertions, these assertion methods return None if they pass, and raise an AssertionError if they fail. The assert_called_with method accepts an arbitrary collection of arguments and keyword arguments, and raises an AssertionError unless these parameters were passed to the mock the last time it was called. The assert_called_once_with method behaves like assert_called_with, except that it also checks whether the mock was only called once and raises AssertionError if that is not true. The assert_any_call method accepts arbitrary arguments and keyword arguments, and raises an AssertionError if the mock object has never been called with these parameters. We've already seen the assert_has_calls method. This method accepts a list of call objects, checks whether they appear in the history in the same order, and raises an exception if they do not. Note that "in the same order" does not necessarily mean "next to each other." There can be other calls in between the listed calls as long as all of the listed calls appear in the proper sequence. This behavior changes if you assign a true value to the any_order argument. In that case, assert_has_calls doesn't care about the order of the calls, and only checks whether they all appear in the history. The assert_not_called method raises an exception if the mock has ever been called. Mocking containers and objects with a special behavior One thing the Mock class does not handle is the so-called magic methods that underlie Python's special syntactic constructions: __getitem__, __add__, and so on. If you need your mock objects to record and respond to magic methods—in other words, if you want them to pretend to be container objects such as dictionaries or lists, or respond to mathematical operators, or act as context managers or any of the other things where syntactic sugar translates it into a method call underneath—you're going to use unittest.mock.MagicMock to create your mock objects. There are a few magic methods that are not supported even by MagicMock, due to details of how they (and mock objects) work: __getattr__, __setattr__, __init__ , __new__, __prepare__, __instancecheck__, __subclasscheck__, and __del__. Here's a simple example in which we use MagicMock to create a mock object supporting the in operator: >>> from unittest.mock import MagicMock >>> mock = MagicMock() >>> 7 in mock False >>> mock.mock_calls [call.__contains__(7)] >>> mock.__contains__.return_value = True >>> 8 in mock True >>> mock.mock_calls [call.__contains__(7), call.__contains__(8)] Things work similarly with the other magic methods. For example, addition: >>> mock + 5 <MagicMock name='mock.__add__()' id='140017311217816'> >>> mock.mock_calls [call.__contains__(7), call.__contains__(8), call.__add__(5)] Notice that the return value of the addition is a mock object, a child of the original mock object, but the in operator returned a Boolean value. Python ensures that some magic methods return a value of a particular type, and will raise an exception if that requirement is not fulfilled. In these cases, MagicMock's implementations of the methods return a best-guess value of the proper type, instead of a child mock object. There's something you need to be careful of when it comes to the in-place mathematical operators, such as += (__iadd__) and |= (__ior__), and that is the fact that MagicMock handles them somewhat strangely. What it does is still useful, but it might well catch you by surprise: >>> mock += 10 >>> mock.mock_calls [] What was that? Did it erase our call history? Fortunately, no, it didn't. What it did was assign the child mock created by the addition operation to the variable called mock. That is entirely in accordance with how the in-place math operators are supposed to work. Unfortunately, it has still cost us our ability to access the call history, since we no longer have a variable pointing at the parent mock object. Make sure that you have the parent mock object set aside in a variable that won't be reassigned, if you're going to be checking in-place math operators. Also, you should make sure that your mocked in-place operators return the result of the operation, even if that just means return self.return_value, because otherwise Python will assign None to the left-hand variable. There's another detailed way in which in-place operators work that you should keep in mind: >>> mock = MagicMock() >>> x = mock >>> x += 5 >>> x <MagicMock name='mock.__iadd__()' id='139845830142216'> >>> x += 10 >>> x <MagicMock name='mock.__iadd__().__iadd__()' id='139845830154168'> >>> mock.mock_calls [call.__iadd__(5), call.__iadd__().__iadd__(10)] Because the result of the operation is assigned to the original variable, a series of in-place math operations builds up a chain of child mock objects. If you think about it, that's the right thing to do, but it is rarely what people expect at first. Mock objects for properties and descriptors There's another category of things that basic Mock objects don't do a good job of emulating: descriptors. Descriptors are objects that allow you to interfere with the normal variable access mechanism. The most commonly used descriptors are created by Python's property built-in function, which simply allows you to write functions to control getting, setting, and deleting a variable. To mock a property (or other descriptor), create a unittest.mock.PropertyMock instance and assign it to the property name. The only complication is that you can't assign a descriptor to an object instance; you have to assign it to the object's type because descriptors are looked up in the type without first checking the instance. That's not hard to do with mock objects, fortunately: >>> from unittest.mock import PropertyMock >>> mock = Mock() >>> prop = PropertyMock() >>> type(mock).p = prop >>> mock.p <MagicMock name='mock()' id='139845830215328'> >>> mock.mock_calls [] >>> prop.mock_calls [call()] >>> mock.p = 6 >>> prop.mock_calls [call(), call(6)] The thing to be mindful of here is that the property is not a child of the object named mock. Because of this, we have to keep it around in its own variable because otherwise we'd have no way of accessing its history. The PropertyMock objects record variable lookup as a call with no parameters, and variable assignment as a call with the new value as a parameter. You can use a PropertyMock object if you actually need to record variable accesses in your mock object history. Usually you don't need to do that, but the option exists. Even though you set a property by assigning it to an attribute of a type, you don't have to worry about having your PropertyMock objects bleed over into other tests. Each Mock you create has its own type object, even though they all claim to be of the same class: >>> type(Mock()) is type(Mock()) False Thanks to this feature, any changes that you make to a mock object's type object are unique to that specific mock object. Mocking file objects It's likely that you'll occasionally need to replace a file object with a mock object. The unittest.mock library helps you with this by providing mock_open, which is a factory for fake open functions. These functions have the same interface as the real open function, but they return a mock object that's been configured to pretend that it's an open file object. This sounds more complicated than it is. See for yourself: >>> from unittest.mock import mock_open >>> open = mock_open(read_data = 'moose') >>> with open('/fake/file/path.txt', 'r') as f: ... print(f.read()) ... moose If you pass a string value to the read_data parameter, the mock file object that eventually gets created will use that value as the data source when its read methods get called. As of Python 3.4.0, read_data only supports string objects, not bytes. If you don't pass read_data, read method calls will return an empty string. The problem with the previous code is that it makes the real open function inaccessible, and leaves a mock object lying around where other tests might stumble over it. Read on to see how to fix these problems. Replacing real code with mock objects The unittest.mock library gives a very nice tool for temporarily replacing objects with mock objects, and then undoing the change when our test is done. This tool is unittest.mock.patch. There are a lot of different ways in which that patch can be used: it works as a context manager, a function decorator, and a class decorator; additionally, it can create a mock object to use for the replacement or it can use the replacement object that you specify. There are a number of other optional parameters that can further adjust the behavior of the patch. Basic usage is easy: >>> from unittest.mock import patch, mock_open >>> with patch('builtins.open', mock_open(read_data = 'moose')) as mock: ... with open('/fake/file.txt', 'r') as f: ... print(f.read()) ... moose >>> open <built-in function open> As you can see, patch dropped the mock open function created by mock_open over the top of the real open function; then, when we left the context, it replaced the original for us automatically. The first parameter of patch is the only one that is required. It is a string describing the absolute path to the object to be replaced. The path can have any number of package and subpackage names, but it must include the module name and the name of the object inside the module that is being replaced. If the path is incorrect, patch will raise an ImportError, TypeError, or AttributeError, depending on what exactly is wrong with the path. If you don't want to worry about making a mock object to be the replacement, you can just leave that parameter off: >>> import io >>> with patch('io.BytesIO'): ... x = io.BytesIO(b'ascii data') ... io.BytesIO.mock_calls [call(b'ascii data')] The patch function creates a new MagicMock for you if you don't tell it what to use for the replacement object. This usually works pretty well, but you can pass the new parameter (also the second parameter, as we used it in the first example of this section) to specify that the replacement should be a particular object; or you can pass the new_callable parameter to make patch use the value of that parameter to create the replacement object. We can also force the patch to use create_autospec to make the replacement object, by passing autospec=True: >>> with patch('io.BytesIO', autospec = True): ... io.BytesIO.melvin Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 557, in __getattr__ raise AttributeError("Mock object has no attribute %r" % name) AttributeError: Mock object has no attribute 'melvin' The patch function will normally refuse to replace an object that does not exist; however, if you pass it create=True, it will happily drop a mock object wherever you like. Naturally, this is not compatible with autospec=True. The patch function covers the most common cases. There are a few related functions that handle less common but still useful cases. The patch.object function does the same thing as patch, except that, instead of taking the path string, it accepts an object and an attribute name as its first two parameters. Sometimes this is more convenient than figuring out the path to an object. Many objects don't even have valid paths (for example, objects that exist only in a function local scope), although the need to patch them is rarer than you might think. The patch.dict function temporarily drops one or more objects into a dictionary under specific keys. The first parameter is the target dictionary; the second is a dictionary from which to get the key and value pairs to put into the target. If you pass clear=True, the target will be emptied before the new values are inserted. Notice that patch.dict doesn't create the replacement values for you. You'll need to make your own mock objects, if you want them. Mock objects in action That was a lot of theory interspersed with unrealistic examples. Let's take a look at what we've learned and apply it for a more realistic view of how these tools can help us. Better PID tests The PID tests suffered mostly from having to do a lot of extra work to patch and unpatch time.time, and had some difficulty breaking the dependence on the constructor. Patching time.time Using patch, we can remove a lot of the repetitiveness of dealing with time.time; this means that it's less likely that we'll make a mistake somewhere, and saves us from spending time on something that's kind of boring and annoying. All of the tests can benefit from similar changes: >>> from unittest.mock import Mock, patch >>> with patch('time.time', Mock(side_effect = [1.0, 2.0, 3.0, 4.0, 5.0])): ... import pid ... controller = pid.PID(P = 0.5, I = 0.5, D = 0.5, setpoint = 0, ... initial = 12) ... assert controller.gains == (0.5, 0.5, 0.5) ... assert controller.setpoint == [0.0] ... assert controller.previous_time == 1.0 ... assert controller.previous_error == -12.0 ... assert controller.integrated_error == 0.0 Apart from using patch to handle time.time, this test has been changed. We can now use assert to check whether things are correct instead of having doctest compare the values directly. There's hardly any difference between the two approaches, except that we can place the assert statements inside the context managed by patch. Decoupling from the constructor Using mock objects, we can finally separate the tests for the PID methods from the constructor, so that mistakes in the constructor cannot affect the outcome: >>> with patch('time.time', Mock(side_effect = [2.0, 3.0, 4.0, 5.0])): ... pid = imp.reload(pid) ... mock = Mock() ... mock.gains = (0.5, 0.5, 0.5) ... mock.setpoint = [0.0] ... mock.previous_time = 1.0 ... mock.previous_error = -12.0 ... mock.integrated_error = 0.0 ... assert pid.PID.calculate_response(mock, 6) == -3.0 ... assert pid.PID.calculate_response(mock, 3) == -4.5 ... assert pid.PID.calculate_response(mock, -1.5) == -0.75 ... assert pid.PID.calculate_response(mock, -2.25) == -1.125 What we've done here is set up a mock object with the proper attributes, and pass it into calculate_response as the self-parameter. We could do this because we didn't create a PID instance at all. Instead, we looked up the method's function inside the class and called it directly, allowing us to pass whatever we wanted as the self-parameter instead of having Python's automatic mechanisms handle it. Never invoking the constructor means that we're immune to any errors it might contain, and guarantees that the object state is exactly what we expect here in our calculate_response test. Summary In this article, we've learned about a family of objects that specialize in impersonating other classes, objects, methods, and functions. We've seen how to configure these objects to handle corner cases where their default behavior isn't sufficient, and we've learned how to examine the activity logs that these mock objects keep, so that we can decide whether the objects are being used properly or not. Resources for Article: Further resources on this subject: Installing NumPy, SciPy, matplotlib, and IPython [Article] Machine Learning in IPython with scikit-learn [Article] Python 3: Designing a Tasklist Application [Article]
Read more
  • 0
  • 0
  • 7112

article-image-function-passing
Packt
19 Nov 2014
6 min read
Save for later

Function passing

Packt
19 Nov 2014
6 min read
In this article by Simon Timms, the author of the book, Mastering JavaScript Design Patterns, we will cover function passing. In functional programming languages, functions are first-class citizens. Functions can be assigned to variables and passed around just like you would with any other variable. This is not entirely a foreign concept. Even languages such as C had function pointers that could be treated just like other variables. C# has delegates and, in more recent versions, lambdas. The latest release of Java has also added support for lambdas, as they have proven to be so useful. (For more resources related to this topic, see here.) JavaScript allows for functions to be treated as variables and even as objects and strings. In this way, JavaScript is functional in nature. Because of JavaScript's single-threaded nature, callbacks are a common convention and you can find them pretty much everywhere. Consider calling a function at a later date on a web page. This is done by setting a timeout on the window object as follows: setTimeout(function(){alert("Hello from the past")}, 5 * 1000); The arguments for the set timeout function are a function to call and a time to delay in milliseconds. No matter the JavaScript environment in which you're working, it is almost impossible to avoid functions in the shape of callbacks. The asynchronous processing model of Node.js is highly dependent on being able to call a function and pass in something to be completed at a later date. Making calls to external resources in a browser is also dependent on a callback to notify the caller that some asynchronous operation has completed. In basic JavaScript, this looks like the following code: var xmlhttp = new XMLHttpRequest()xmlhttp.onreadystatechange=function()if (xmlhttp.readyState==4 &&xmlhttp.status==200){//process returned data}};xmlhttp.open("GET", http://some.external.resource, true); xmlhttp.send(); You may notice that we assign onreadystatechange before we even send the request. This is because assigning it later may result in a race condition in which the server responds before the function is attached to the ready state change. In this case, we've used an inline function to process the returned data. Because functions are first class citizens, we can change this to look like the following code: var xmlhttp;function requestData(){xmlhttp = new XMLHttpRequest()xmlhttp.onreadystatechange=processData;xmlhttp.open("GET", http://some.external.resource, true); xmlhttp.send();}function processData(){if (xmlhttp.readyState==4 &&xmlhttp.status==200){   //process returned data}} This is typically a cleaner approach and avoids performing complex processing in line with another function. However, you might be more familiar with the jQuery version of this, which looks something like this: $.getJSON('http://some.external.resource', function(json){//process returned data}); In this case, the boiler plate of dealing with ready state changes is handled for you. There is even convenience provided for you should the request for data fail with the following code: $.ajax('http://some.external.resource',{ success: function(json){   //process returned data},error: function(){   //process failure},dataType: "json"}); In this case, we've passed an object into the ajax call, which defines a number of properties. Amongst these properties are function callbacks for success and failure. This method of passing numerous functions into another suggests a great way of providing expansion points for classes. Likely, you've seen this pattern in use before without even realizing it. Passing functions into constructors as part of an options object is a commonly used approach to providing extension hooks in JavaScript libraries. Implementation In Westeros, the tourism industry is almost nonextant. There are great difficulties with bandits killing tourists and tourists becoming entangled in regional conflicts. Nonetheless, some enterprising folks have started to advertise a grand tour of Westeros in which they will take those with the means on a tour of all the major attractions. From King's Landing to Eyrie, to the great mountains of Dorne, the tour will cover it all. In fact, a rather mathematically inclined member of the tourism board has taken to calling it a Hamiltonian tour, as it visits everywhere once. The HamiltonianTour class provides an options object that allows the definition of an options object. This object contains the various places to which a callback can be attached. In our case, the interface for it would look something like the following code: export class HamiltonianTourOptions{onTourStart: Function;onEntryToAttraction: Function;onExitFromAttraction: Function;onTourCompletion: Function;} The full HamiltonianTour class looks like the following code: var HamiltonianTour = (function () {function HamiltonianTour(options) {   this.options = options;}HamiltonianTour.prototype.StartTour = function () {   if (this.options.onTourStart&&typeof (this.options.onTourStart)    === "function")   this.options.onTourStart();   this.VisitAttraction("King's Landing");   this.VisitAttraction("Winterfell");   this.VisitAttraction("Mountains of Dorne");   this.VisitAttraction("Eyrie");   if (this.options.onTourCompletion&&typeof    (this.options.onTourCompletion) === "function")   this.options.onTourCompletion();}; HamiltonianTour.prototype.VisitAttraction = function (AttractionName) {   if (this.options.onEntryToAttraction&&typeof    (this.options.onEntryToAttraction) === "function")   this.options.onEntryToAttraction(AttractionName);    //do whatever one does in a Attraction   if (this.options.onExitFromAttraction&&typeof    (this.options.onExitFromAttraction) === "function")   this.options.onExitFromAttraction(AttractionName);};return HamiltonianTour;})(); You can see in the highlighted code how we check the options and then execute the callback as needed. This can be done by simply using the following code: var tour = new HamiltonianTour({onEntryToAttraction: function(cityname){console.log("I'm delighted to be in " + cityname)}});tour.StartTour(); The output of the preceding code will be: I'm delighted to be in King's LandingI'm delighted to be in WinterfellI'm delighted to be in Mountains of DorneI'm delighted to be in Eyrie Summary In this article, we have learned about function passing. Passing functions is a great approach to solving a number of problems in JavaScript and tends to be used extensively by libraries such as jQuery and frameworks such as Express. It is so commonly adopted that using it provides to added barriers no your code's readability. Resources for Article: Further resources on this subject: Creating Java EE Applications [article] Meteor.js JavaScript Framework: Why Meteor Rocks! [article] Dart with JavaScript [article]
Read more
  • 0
  • 0
  • 11069
Modal Close icon
Modal Close icon