Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-playing-swift
Packt
23 Dec 2014
23 min read
Save for later

Playing with Swift

Packt
23 Dec 2014
23 min read
Xcode ships with both a command line interpreter and a graphical interface called playground that can be used to prototype and test Swift code snippets. Code typed into the playground is compiled and executed interactively, which permits a fluid style of development. In addition, the user interface can present a graphical view of variables as well as a timeline, which can show how loops are executed. Finally, playgrounds can mix and match code and documentation, leading to the possibility of providing example code as playgrounds and using playgrounds to learn how to use existing APIs and frameworks. This article by Alex Blewitt, the author of Swift Essentials, will present the following topics: How to create a playground Displaying values in the timeline Presenting objects with Quick Look Running asynchronous code Using playground live documentation Generating playgrounds with Markdown and AsciiDoc Limitations of playgrounds (For more resources related to this topic, see here.) Getting started with playgrounds When Xcode is started, a welcome screen is shown with various options, including the ability to create a playground. Playgrounds can also be created from the File | New | Playground menu. Creating a playground Using either the Xcode welcome screen (which can be opened by navigating to Window | Welcome to Xcode) or navigating to File | New | Playground, create MyPlayground in a suitable location targeting iOS. Creating the playground on the Desktop will allow easy access to test Swift code, but it can be located anywhere on the filesystem. Playgrounds can be targeted either towards OS X applications or towards iOS applications. This can be configured when the playground is created, or by switching to the Utilities view by navigating to View | Utilities | Show File Inspector or pressing Command + Option + 1 and changing the dropdown from OS X to iOS or vice versa.   When initially created, the playground will have a code snippet that looks as follows: // Playground - noun: a place where people can play import UIKit var str = "Hello, playground" Playgrounds targeting OS X will read import Cocoa instead. On the right-hand side, a column will show the value of the code when each line is executed. In the previous example, the word Hello, playgr... is seen, which is the result of the string assignment. By grabbing the vertical divider between the Swift code and the output, the output can be resized to show the full text message:   Alternatively, by moving the mouse over the right-hand side of the playground, the Quick Look icon (the eye symbol) will appear; if clicked on, a pop-up box will show the full details:   Viewing the console output The console output can be viewed on the right-hand side by opening the Assistant Editor. This can be opened by pressing Command + Option + Enter or by navigating to View | Assistant Editor | Show Assistant Editor. This will show the result of any println statements executed in the code. Add a simple for loop to the playground and show the Assistant Editor: for i in 1...12 { println("I is (i)") } The output is shown on the right-hand side:   The assistant editor can be configured to be displayed in different locations, such as at the bottom, or stacked horizontally or vertically by navigating to the View | Assistant Editor menu. Viewing the timeline The timeline shows what other values are displayed as a result of executing the code. In the case of the print loop shown previously, the output was displayed as Console Output in the timeline. However, it is possible to use the playground to inspect the value of an expression on a line, without having to display it directly. In addition, results can be graphed to show how the values change over time. Add another line above the println statement to calculate the result of executing an expression, (i-6)*(i-7), and store it in a variable, j: for i in 1...12 { var j = (i-7) * (i-6) println("I is (i)") } On the line next to the variable definition, click on the add variable history symbol (+), which is in the right-hand column (visible when the mouse moves over that area). After it is clicked on, it will change to a (o) symbol and display the graph on the right-hand side. The same can be done for the println statement as well:   The slider at the bottom, indicated by the red tick mark, can be used to slide the vertical bar to see the exact value at certain points:   To show several values at once, use additional variables to hold the values and display them in the timeline as well: for i in 1...12 { var j = (i-7) * (i-6) var k = i println("I is (i)") }   When the slider is dragged, both values will be shown at the same time. Displaying objects with QuickLook The playground timeline can display objects as well as numbers and simple strings. It is possible to load and view images in a playground using classes such as UIImage (or NSImage on OS X). These are known as QuickLook supported objects, and by default include: Strings (attributed and unattributed) Views Class and struct types (members are shown) Colors It is possible to build support for custom types in Swift, by implementing a debugQuickLookObject method that returns a graphical view of the data. Showing colored labels To show a colored label, a color needs to be obtained first. When building against iOS, this will be UIColor; but when building against OS X, it will be NSColor. The methods and types are largely equivalent between the two, but this article will focus on the iOS types. A color can be acquired with an initializer or by using one of the predefined colors that are exposed in Swift using methods: import UIKit // AppKit for OS X let blue = UIColor.blueColor() // NSColor.blueColor() for OS X The color can be used in a UILabel, which displays a text string in a particular size and color. The UILabel needs a size, which is represented by a CGRect, and can be defined with an x and y position along with a width and height. The x and y positions are not relevant for playgrounds and so can be left as zero: let size = CGRect(x:0,y:0,width:200,height:100) let label = UILabel(frame:size)// NSLabel for OS X Finally, the text needs to be displayed in blue and with a larger font size: label.text = str // from the first line of the code label.textColor = blue label.font = UIFont.systemFontOfSize(24) // NSFont for OS X When the playground is run, the color and font are shown in the timeline and available for quick view. Even though the same UILabel instance is being shown, the timeline and the QuickLook values show a snapshot of the state of the object at each point, making it easy to see what has happened between changes.   Showing images Images can be created and loaded into a playground using the UIImage constructor (or NSImage on OS X). Both take a named argument, which is used to find an image with the given name from the playground's Resources folder. To download a logo, open Terminal.app and run the following commands: $ mkdir MyPlayground.playground/Resources $ curl http://alblue.bandlem.com/images/AlexHeadshotLeft.png > MyPlayground.playground/Resources/logo.png An image can now be loaded in Swift with: let logo = UIImage(named:"logo") The location of the Resources associated with a playground can be seen in the File Inspector utilities view, which can be opened by pressing Command + Option + 1. The loaded image can be displayed using QuickLook or by adding it to the "value history:   It is possible to use a URL to acquire an image by creating an NSURL with NSURL(string:"http://..."), then loading the contents of the URL with NSData(contentsOfURL:), and finally using UIImage(data:) to convert it to an image. However, as Swift will keep re-executing the code over and over again, the URL will be hit multiple times in a single debugging session without caching. It is recommended that NSData(contentsOfURL:) and similar networking classes be avoided in playgrounds. Advanced techniques The playground has its own framework, XCPlayground, which can be used to perform certain tasks. For example, individual values can be captured during loops for later analysis. It also permits asynchronous code to continue to execute once the playground has finished running. Capturing values explicitly It is possible to explicitly add values to the timeline by importing the XCPlayground framework and calling XCPCaptureValue with a value that should be displayed in the timeline. This takes an identifier, which is used both as the title and for group-related data values in the same series. When the value history button is selected, it essentially inserts a call to XCPCaptureValue with the value of the expression as "the identifier. For example, to add the logo to the timeline automatically: import XCPlayground XCPCaptureValue("logo",logo)   It is possible to use an identifier to group the data that is being shown in a loop with the identifier representing categories of the values. For example, to display a list of all even and odd numbers between 1 and 6, the following code could be used: for n in 1...6 { if n % 2 == 0 {    XCPCaptureValue("even",n)    XCPCaptureValue("odd",0) } else {    XCPCaptureValue("odd",n)    XCPCaptureValue("even",0) } } The result, when executed, will look as follows:   Running asynchronous code By default, when the execution hits the bottom of the playground, the execution stops. In most cases, this is desirable, but when asynchronous code is involved, execution might need to run even if the main code has finished executing. This might be the case if networking data is involved or if there are multiple tasks whose results need to be synchronized. For example, wrapping the previous even/odd split in an asynchronous call will result in no data being displayed: dispatch_async(dispatch_get_main_queue()) { for n in 1...6 {    // as before } } This uses one of Swift's language features: the dispatch_async method is actually a two-argument method that takes a queue and a block type. However, if the last argument is a block type, then it can be represented as a trailing closure rather than an argument. To allow the playground to continue executing after reaching the bottom, add the following call: XCPSetExecutionShouldContinueIndefinitely() Although this suggests that the execution will run forever, it is limited to 30 seconds of runtime, or whatever is the value displayed at the bottom-right corner of the screen. This timeout can be changed by typing in a new value or using the + and – buttons to increase/decrease time by one second.   Playgrounds and documentation Playgrounds can contain a mix of code and documentation. This allows a set of code samples and explanations to be mixed in with the playground itself. Although there is no way of using Xcode to add sections in the UI at present, the playground itself is an XML file that can be edited using an external text editor such as TextEdit.app. Learning with playgrounds As playgrounds can contain a mixture of code and documentation, it makes them an ideal format for viewing annotated code snippets. In fact, Apple's Swift Tour book can be opened as a playground file. Xcode documentation can be searched by navigating to the Help | Documentation and API Reference menu, or by pressing Command + Shift + 0. In the search field that is presented, type Swift Tour and then select the first result. The Swift Tour book should be presented in Xcode's help system:   A link to download and open the documentation as a playground is given in the first section; if this is downloaded, it can be opened in Xcode as a standalone playground. This provides the same information, but allows the code examples to be dynamic and show the results in the window:   A key advantage of learning through playground-based documentation is that the code can be experimented with. In the Simple Values section of the documentation, where myVariable is assigned, the right-hand side of the playground shows the values. If the literal numbers are changed, the new values will be recalculated and shown on the right-hand side. Some examples are presented solely in playground form; for example, the "Balloons demo, which was used in the introduction of Swift in the WWDC 2014 keynote, is downloadable as a playground from https://developer.apple.com/swift/resources/. Note that the Balloons playground requires OS X 10.10 and Xcode 6.1 to run. Understanding the playground format A playground is an OS X bundle, which means that it is a directory that looks like a single file. If a playground is selected either in TextEdit.app or in Finder, then it looks like a regular file:   Under the covers, it is actually a directory: $ ls -FMyPlayground.playground/ Inside the directory, there are a number of files: $ ls -1 MyPlayground.playground/* MyPlayground.playground/Resources MyPlayground.playground/contents.xcplayground MyPlayground.playground/section-1.swift MyPlayground.playground/timeline.xctimeline The files are as follows: The Resources directory, which was created earlier to hold the logo image The contents.xcplayground file, which is an XML table of contents of the files that make up the playground The section-1.swift file, which is the Swift file created by default when a new playground is created, and contains the code that is typed in for any new playground content The timeline.xctimeline file, which is an automatically generated file containing timestamps of execution, which the runtime generates when executing a Swift file and the timeline is open The table of contents file defines which runtime environment is being targeted (for example, iOS or OS X), a list of sections, and a reference to the timeline file: <playground version='3.0' sdk='iphonesimulator'> <sections>    <code source-file-name='section-1.swift'/> </sections> <timeline fileName='timeline.xctimeline'/> </playground> This file can be edited to add new sections, provided that it is not open in Xcode at the same time. An Xcode playground directory is deleted and recreated whenever changes are made in Xcode. Any Terminal.app windows open in that directory will no longer show any files. As a result, using external tools and editing the files in place might result in changes being lost. In addition, if you are using ancient versions of control systems, such as SVN and CVS, you might find your version control metadata being wiped out between saves. Xcode ships with the industry standard Git version control system, which should be preferred instead. Adding a new documentation section To add a new documentation section, ensure that the playground is not open in Xcode and then edit the contents.xcplayground file. The file itself can be opened by right-clicking on the playground in Finder and choosing Show Package Contents:   This will open up a new Finder window, with the contents displayed as a top-level set of elements. The individual files can then be opened for editing by right-clicking on the contents.xcplayground file, choosing Open With | Other..., and selecting an application, such as TextEdit.app.   Alternatively, the file can be edited from the command line using an editor such as pico, vi, or emacs. Although there are few technology debates more contentious than whether vi or emacs is better, the recommended advice is to learn how to be productive in at least one of them. Like learning to touch-type, being productive in a command-line editor is something that will pay dividends in the future if the initial learning challenge can be overcome. For those who don't have time, pico (also known as nano) can be a useful tool in command-line situations, and the on-screen help makes it easier to learn to use. Note that the carat symbol (^) means control, so ^X means Control + X. To add a new documentation section, create a directory called Documentation, and inside it, create a file called hello.html. The HTML file is an HTML5 document, with a declaration and a body. A minimal file looks like: <!DOCTYPE html> <html> <body>    <h1>Welcome to Swift Playground</h1> </body> </html> The content needs to be added to the table of contents (contents.xcplayground) in order to display it in the playground itself, by adding a documentation element under the sections element: <playground version='3.0' sdk='iphonesimulator'> <sections>    <code source-file-name='section-1.swift'/>    <documentation relative-path='hello.html'/> </sections> <timeline fileName='timeline.xctimeline'/> </playground> The relative-path attribute is relative to the Documentation directory. All content in the Documentation directory is copied between saves in the timeline and can be used to store other text content such as CSS files. Binary content, including images, should be stored in the Resources directory. When viewed as a playground, the content will be shown in the same window as "the documentation:   If the content is truncated in the window, then a horizontal rule can be added at the bottom with <hr/>, or the documentation can be styled, as shown in the next section. Styling the documentation As the documentation is written in HTML, it is possible to style it using CSS. For example, the background of the documentation is transparent, which results in the text overlapping both the margins as well as the output. To add a style sheet to the documentation, create a file called stylesheet.css in the Documentation directory and add the following content: body { background-color: white } To add the style sheet to the HTML file, add a style sheet link reference to the head element in hello.html: <head> <link rel="stylesheet" type="text/css" href="stylesheet.css"/> </head> Now when the playground is opened, the text will have a solid white background and will not be obscured by the margins:   Adding resources to a playground Images and other resources can also be added to a playground. Resources need to be added to a directory called Resources, which is copied as is between different versions of the playground. To add an image to the document, create a Resources folder and then insert an image. For example, earlier in this article, an image was downloaded by using the following commands: $ mkdir MyPlayground.playground/Resources $ curl http://alblue.bandlem.com/images/AlexHeadshotLeft.png > MyPlayground.playground/Resources/logo.png The image can then be referred to in the documentation using an img tag and a relative path from the Documentation directory: <img src="../Resources/logo.png" alt="Logo"/> Other supported resources (such as JPEG and GIF) can be added to the Resources folder as well. It is also possible to add other content (such as a ZIP file of examples) to the Resources folder and provide hyperlinks from the documentation to the resource files. <a href="../Resources/AlexBlewitt.vcf">Download contact card</a> Additional entries in the header The previous example showed the minimum amount of content required for playground documentation. However, there are other meta elements that can be added to the document that have specific purposes and which might be found in other playground examples on the internet. Here is a more comprehensive example of using meta elements: <!DOCTYPE html> <html lang="en"> <head>    <meta charset="utf-8"/>    <link rel="stylesheet" type="text/css" href="stylesheet.css"/>    <title>Welcome to Swift Playground</title>    <meta name="xcode-display" content="render"/>    <meta name="apple-mobile-web-app-capable" content="yes"/>    <meta name="viewport" content="width=device-width,maximum-scale=1.0"/> </head> <body>...</body> </html> In this example, the document is declared as being written in English (lang="en" on the html element) and in the UTF-8 character set. The <meta charset="utf-8"/> should always be the first element in the HTML head section, and the UTF-8 encoding should always be preferred for writing documents. If this is missed, it will default to a different encoding, such as ISO-8859-1, which can lead to strange characters appearing. Always use UTF-8 for writing HTML documents. The link and title are standard HTML elements that associate the style sheet "(from before) and the title of the document. The title is not displayed in Xcode, but it can be shown if the HTML document is opened in a browser instead. As the documentation is reusable between playgrounds and the web, it makes sense to "give it a sensible title. The link should be the second element after the charset definition. In fact, all externally linked resources—such as style sheets and scripts—should occur near the top of the document. This allows the HTML parser to initiate the download of external resources as soon as possible. This also includes the HTML5 prefetch link type, which is not supported in Safari or playground at the time of writing. The meta tags are instructions to Safari to render it in different ways (Safari is the web engine that is used to present the documentation content in playground). Safari-specific meta tags are described at https://developer.apple.com/library/safari/documentation/AppleApplications/Reference/SafariHTMLRef/Articles/MetaTags.html and include the following: The xcode-display=render meta tag, which indicates that Xcode should show the content of the document instead of the HTML source code when opening in Xcode The apple-mobile-web-app-capable=yes meta tag, which indicates that Safari should show this fullscreen if necessary when running on a "mobile device The viewport=width=device-width,maximum-scale=1.0 meta tag, which "allows the document body to be resized to fit the user's viewable area without scaling Generating playgrounds automatically The format of the playground files are well known, and several utilities have been created to generate playgrounds from documentation formats, such as Markdown or AsciiDoc. These are text-based documentation formats that provide a standard means to generate output documents, particularly HTML-based ones. Markdown Markdown (a word play on markup) was created to provide a standard syntax to generate web page documentation with links and references in a plain text format. More information about Markdown can be found at the home page (http://daringfireball.net/projects/markdown/), and more about the standardization of Markdown into CommonMark (used by StackOverflow, GitHub, Reddit, and others) can be found at http://commonmark.org. Embedding code in documentation is fairly common in Markdown. The file is treated as a top-level document, with sections to separate out the documentation and the code blocks. In CommonMark, these are separated with back ticks (```), often with the name of the language to add different script rendering types: ## Markdown Example ##This is an example CommonMark document. Blank lines separate paragraphs. Code blocks are introduced with three back-ticks and closed with back-ticks:```swift println("Welcome to Swift") ```Other text and other blocks can follow below. The most popular tool for converting Markdown/CommonMark documents into playgrounds (at the time of writing) is Jason Sandmeyer's swift-playground-builder at https://github.com/jas/swift-playground-builder/. The tool uses Node to execute JavaScript and can be installed using the npm install -g swift-playground-builder command. Both Node and npm can be installed from "http://nodejs.org. Once installed, documents can be translated using playground --platform ios --destination outdir --stylesheet stylesheet.css. If code samples should not be editable, then the --no-refresh argument should be added. AsciiDoc AsciiDoc is similar in intent to Markdown, except that it can render to more backends than just HTML5. AsciiDoc is growing in popularity for documenting code, primarily because the standard is much more well defined than Markdown is. The de facto standard translation tool for AsciiDoc is written in Ruby and can be installed using the sudo gem install asciidoctor command. Code blocks in AsciiDoc are represented by a [source] block. For Swift, this will be [source, swift]. The block starts and ends with two hyphens (--): .AsciiDoc ExampleThis is an example AsciiDoc document. Blank lines separate paragraphs. Code blocks are introduced with a source block and two hyphens:[source, swift] -- println("Welcome to Swift") -- Other text and other code blocks can follow below --. AsciiDoc files typically use the ad extension, and the ad2play tool can be installed from James Carlson's repository at https://github.com/jxxcarlson/ad2play. Saving the preceding example as example.ad and running ad2play example.ad will result in the generation of the example.playground file. More information about AsciiDoc, including the syntax and backend, can be found at the AsciiDoc home page at http://www.methods.co.nz/asciidoc/ or on the Asciidoctor home page at http://asciidoctor.org. Limitations of playgrounds Although playgrounds can be very powerful for interacting with code, there are some limitations that are worth being aware of. There is no debugging support in the playground. It is not possible to add a breakpoint and use the debugger and find out what the values are. Given that the UI allows tracking values—and that it's very easy to add new lines with just the value to be tracked—this is not much of a hardship. Other limitations of playgrounds include: Only the simulator can be used for the execution of iOS-based playgrounds. This prevents the use of hardware-specific features that might only be present on a device. The performance of playground scripts is mainly driven based on how many lines are executed and how much output is saved by the debugger. It should not be used to test the performance of performance-sensitive code. Although the playground is well suited to present user interface components, it cannot be used for user input. Anything requiring entitlements (such as in-app purchases or access to iCloud) is not possible in playground at the time of writing. Note that while earlier releases of playground did not support custom frameworks, Xcode 6.1 permits frameworks to be loaded into playground, provided that the framework is built and marked as public and that it is in the same workspace as the playground Summary This article presented playgrounds, an innovative way of running Swift code with graphical representations of values and introspection of running code. Both expressions and the timeline were presented as a way of showing the state of the program at any time, as well as graphically inspecting objects using QuickLook. The XCPlayground framework can also be used to record specific values and allow asynchronous code to be executed. Being able to mix code and documentation into the same playground is also a great way of showing what functions exist, and how to create self-documenting playgrounds was presented. In addition, tools for the creation of such playgrounds using either AsciiDoc or Markdown (CommonMark) were introduced. Resources for Article: Further resources on this subject: Using OpenStack Swift [article] Sparrow iOS Game Framework - The Basics of Our Game [article] Adding Real-time Functionality Using Socket.io [article]
Read more
  • 0
  • 0
  • 9669

article-image-beagle-boards
Packt
23 Dec 2014
10 min read
Save for later

Beagle Boards

Packt
23 Dec 2014
10 min read
In this article by Hunyue Yau, author of Learning BeagleBone, we will provide a background on the entire family of Beagle boards with brief highlights of what is unique about every member, such as things that favor one member over the other. This article will help you identify Beagle boards that might have been mislabeled. The following topics will be covered here: What are Beagle boards How do they relate to other development boards BeagleBoard Classic BeagleBoard-xM BeagleBone White BeagleBone Black (For more resources related to this topic, see here.) The Beagle board family The Beagle boards are a family of low-cost, open development boards that provide everyday students, developers, and other interested people with access to the current mobile processor technology on a path toward developing ideas. Prior to the invention of the Beagle family of boards, the available options with the user were primarily limited to either low-computing power boards, such as the 8-bit microcontroller-based Arduino boards, or dead-end options, such as repurposing existing products. There were even other options such as compromising the physical size or electrical power consumption by utilizing the nonmobile-oriented technology, for example, embedding a small laptop or desktop into a project. The Beagle boards attempt to address these points and more. The Beagle board family provides you with access to the technologies that were originally developed for mobile devices, such as phones and tablets, and use them to develop projects and for educational purposes. By leveraging the same technology for education, students can be less reliant on obsolete technologies. All this access comes affordably. Prior to the Beagle boards being available, development boards of this class easily exceeded thousands of dollars. In contrast, the initial Beagle board offering was priced at a mere 150 dollars! The Beagle boards The Beagle family of boards began in late 2008 with the original Beagle board. The original board has quite a few characteristics similar to all members of the Beagle board family. All the current boards are based on an ARM core and can be powered by a single 5V source or by varying degrees from a USB port. All boards have a USB port for expansion and provide direct access to the processor I/O for advance interfacing and expansion. Examples of the processor I/O available for expansion include Serial Peripheral Interface (SPI), I2C, pulse width modulation (PWM), and general-purpose input/output (GPIO). The USB expansion path was introduced at an early stage providing a cheap path to add features by leveraging the existing desktop and laptop accessories. All the boards are designed keeping the beginner board in mind and, as such, are impossible to brick on software basis. To brick a board is a common slang term that refers to damaging a board beyond recovery, thus, turning the board from an embedded development system to something as useful for embedded development as a brick. This doesn't mean that they cannot be damaged electrically or physically. For those who are interested, the design and manufacturing material is also available for all the boards. The bill of material is designed to be available via the distribution so that the boards themselves can be customized and manufactured even in small quantities. This allows projects to be manufactured if desired. Do not power up the board on any conductive surfaces or near conductive materials, such as metal tools or exposed wires. The board is fully exposed and doing so can subject your board to electrical damage. The only exception is a proper ESD mat design for use with electrons. The proper ESD mats are designed to be only conductive enough to discharge static electricity without damaging the circuits. The following sections highlight the specifications for each member presented in the order they were introduced. They are based on the latest revision of the board. As these boards leverage mobile technology, the availability changes and the designs are partly revised to accommodate the available parts. The design information for older versions is available at http://www.beagleboard.org/. BeagleBoard Classic The initial member of the Beagle board family is the BeagleBoard Classic (BBC), which features the following specs: OMAP3530 clocked up to 720 MHz, featuring an ARM Cortex-A8 core along with integrated 3D and video decoding accelerators 256 MB of LPDDR (low-power DDR) memory with 512 MB of integrated (NAND) flash memory on board; older revisions had less memory USB OTG (switchable between a USB device and a USB host) along with a pure USB high-speed host only port A low-level debug port accessible using a common desktop DB-9 adapter Analog audio in and out DVI-D video output to connect to a desktop monitor or a digital TV A full-size SD card interface A 28-pin general expansion header along with two 20-pin headers for video expansion 1.8V I/O Only a nominal 5V is available on the expansion connector. Expansion boards should have their own regulator. At the original release of the BBC in 2008, OMAP3530 was comparable to the processors of mobile phones of that time. The BBC is the only member to feature a full-size SD card interface. You can see the BeagleBoard Classic in the following image: BeagleBoard-xM As an upgrade to the BBC, the BeagleBoard-xM (BBX) was introduced later. It features the following specs: DM3730 clocked up to 1 GHz, featuring an ARM Cortex-A8 core along with integrated 3D and video decoding accelerators compared to 720 MHz of the BBC. 512 MB of LPDDR but no onboard flash memory compared to 256 MB of LPDDR with up to 512 MB of onboard flash memory. USB OTG (switchable between a USB device and a USB host) along with an onboard hub to provide four USB host ports and an onboard USB connected the Ethernet interface. The hub and Ethernet connect to the same port as the only high-speed port of the BBC. The hub allows low-speed devices to work with the BBX. A low-level debug port accessible with a standard DB-9 serial cable. An adapter is no longer needed. Analog audio in and out. This is the same analog audio in and out as that of the BBC. DVI-D video output to connect to a desktop monitor or a digital TV. This is the same DVI-D video output as used in the BBC. A microSD interface. It replaces the full-size SD interface on the BBC. The difference is mainly the physical size. A 28-pin expansion interface and two 20-pin video expansion interfaces along with an additional camera interface board. The 28-pin and two 20-pin interfaces are physically and electrically compatible with the BBC. 1.8V I/O. Only a nominal 5V is available on the expansion connector. Expansion boards should have their own regulator. The BBX has a faster processor and added capabilities when compared to the BBC. The camera interface is a unique feature for the BBX and provides a direct interface for raw camera sensors. The 28-pin interface, along with the two 20-pin video interfaces, is electrically and mechanically compatible with the BBC. Mechanical mounting holes were purposely made backward compatible. Beginning with the BBX, boards were shipped with a microSD card containing the Angström Linux distribution. The latest version of the kernel and bootloader are shared between the BBX and BBC. The software can detect and utilize features available on each board as the DM3730 and the OMAP3530 processors are internally very similar. You can see the BeagleBoard-xM in the following image: BeagleBone To simplify things and to bring in a low-entry cost, the BeagleBone subfamily of boards was introduced. While many concepts in this article can be shared with the entire Beagle family, this article will focus on this subfamily. All current members of BeagleBone can be purchased for less than 100 dollars. BeagleBone White The initial member of this subfamily is the BeagleBone White (BBW). This new form factor has a footprint to allow the board itself to be stored inside an Altoids tin. The Altoids tin is conductive and can electrically damage the board if an operational BeagleBone without additional protection is placed inside it. The BBW features the following specs: AM3358 clocked at up to 720 MHz, featuring an ARM Cortex-A8 core along with a 3D accelerator, an ARM Cortex-M3 for power management, and a unique feature—the Programmable Real-time Unit Subsystem (PRUSS) 256 MB of DDR2 memory Two USB ports, namely, a dedicated USB host and dedicated USB device An onboard JTAG debugger An onboard USB interface to access the low-level serial interfaces 10/100 MB Ethernet interfaces Two 46-pin expansion interfaces with up to eight channels of analog input 10-pin power expansion interface A microSD interface 3.3V digital I/O 1.8V analog I/O As with the BBX, the BBW ships with the Angström Linux distribution. You can see the BeagleBone White in the following image: BeagleBone Black Intended as a lower cost version of the BeagleBone, the BeagleBone Black (BBB) features the following specs: AM3358 clocked at up to 1 GHz, featuring an ARM Cortex-A8 core along with a 3D accelerator, an ARM Cortex-M3 for power management, and a unique feature: the PRUSS. This is an improved revision of the same processor in BBW. 512 MB of DDR3 memory compared to 256 MB of DDR2 memory on the BBW. 4 GB of onboard flash embedded MMC (eMMC) memory for the latest version compared to a complete lack of onboard flash memory on the BBW. Two USB ports, namely, a dedicated USB host and dedicated USB device. A low-level serial interface is available as a dedicated 6-pin header. 10/100 MB Ethernet interfaces. Two 46-pin expansion interfaces with up to eight channels of analog input. A microSD interface. A micro HDMI interface to connect to a digital monitor or a digital TV. A digital audio is available on the same interface. This is new to the BBB. 3.3V digital I/O. 1.8V analog I/O. The overall mechanical form factor of the BBB is the same as that of the BBW. However, due to the added features, there are some slight electrical changes in the expansion interface. The power expansion header was removed to make room for added features. Unlike other boards, the BBB is shipped with a Linux distribution on the internal flash memory. Early revisions shipped with Angström Linux and later revisions shipped with Debian Linux as an attempt to simplify things for new users. Unlike the BBW, the BBB does not provide an onboard JTAG debugger or an onboard USB to serial converter. Both these features were provided by a single chip on the BBW and were removed from the BBB for cost reasons. JTAG debugging is possible on the BBB by soldering a connector to the back of the BBB and using an external debugger. A serial port access on the BBB is provided by a serial header. This article will focus solely on the BeagleBone subfamily (BBW and BBB). The difference between them will be noted where applicable. It should be noted that for more advanced projects, the BBC/BBX should be considered as they offer additional unique features that are not available in the BBW/BBB. Most concepts learned on the BBB/BBW boards are entirely applicable to the BBC/BBX boards. You can see the BeagleBone Black in the following image: Summary In this article, we looked at the Beagle board offerings and a few unique features of each offering. Then, we went through the process of setting up a BeagleBone board and understood the basics to access it from a laptop/desktop. Resources for Article: Further resources on this subject: Protecting GPG Keys in BeagleBone [article] Making the Unit Very Mobile - Controlling Legged Movement [article] Pulse width modulator [article]
Read more
  • 0
  • 0
  • 9003

article-image-best-practices
Packt
23 Dec 2014
29 min read
Save for later

Best Practices

Packt
23 Dec 2014
29 min read
This article by Prabath Siriwardena, author of Mastering Apache Maven 3 focuses on best practices associated with all the core concepts. The following best practices are essential ingredients in creating a successful/productive build environment. The following criteria will help you evaluate the efficiency of your Maven project if you are mostly dealing with a large-scale, multi-module project: The time it takes for a developer to get started with a new project and add it to the build system The effort it requires to upgrade a version of a dependency across all the project modules The time it takes to build the complete project with a fresh local Maven repository The time it takes to do a complete offline build The time it takes to update the versions of Maven artifacts produced by the project, for example, from 1.0.0-SNAPSHOT to 1.0.0 The effort it requires for a completely new developer to understand what your Maven build does The effort it requires to introduce a new Maven repository The time it takes to execute unit tests and integration tests (For more resources related to this topic, see here.) Dependency management In the following example, you will notice that the dependency versions are added to each and every dependency defined in the application POM file: <dependencies>   <dependency>     <groupId>com.nimbusds</groupId>     <artifactId>nimbus-jose-jwt</artifactId>     <version>2.26</version>   </dependency>   <dependency>     <groupId>commons-codec</groupId>     <artifactId>commons-codec</artifactId>     <version>1.2</version>   </dependency> </dependencies> Imagine that you have a set of application POM files in a multi-module Maven project that has the same set of dependencies. If you have duplicated the artifact version with each and every dependency, then to upgrade to the latest dependency, you need to update all the POM files, which could easily lead to a mess. Not just that, if you have different versions of the same dependency used in different modules of the same project, then it's going to be a debugging nightmare in case of an issue. With proper dependency management, we can overcome both the previous issues. If it's a multi-module Maven project, you need to introduce the  dependencyManagement configuration element in the parent POM so that it will be inherited by all the other child modules: <dependencyManagement>   <dependencies>     <dependency>       <groupId>com.nimbusds</groupId>       <artifactId>nimbus-jose-jwt</artifactId>       <version>2.26</version>     </dependency>     <dependency>       <groupId>commons-codec</groupId>       <artifactId>commons-codec</artifactId>       <version>1.2</version>     </dependency>   </dependencies> </dependencyManagement> Once you define dependencies under the dependencyManagement section as shown in the previous code, you only need to refer a dependency by its groupId and artifactId tags. The version tag is picked from the appropriate the dependencyManagement section: <dependencies>   <dependency>     <groupId>com.nimbusds</groupId>     <artifactId>nimbus-jose-jwt</artifactId>   <dependency>     <groupId>commons-codec</groupId>     <artifactId>commons-codec</artifactId>   </dependency> </dependencies> With the previous code snippet, if you want to upgrade or downgrade a dependency, you only need to change the version of the dependency under the dependencyManagement section. The same principle applies to plugins as well. If you have a set of plugins, which are used across multiple modules, you should define them under the pluginManagement section of the parent module. In this way, you can downgrade or upgrade plugin versions seamlessly just by changing the pluginManagement section of the parent POM, as shown in the following code: <pluginManagement>   <plugins>     <plugin>       <artifactId>maven-resources-plugin</artifactId>       <version>2.4.2</version>     </plugin>     <plugin>       <artifactId>maven-site-plugin</artifactId>       <version>2.0-beta-6</version>     </plugin>     <plugin>       <artifactId>maven-source-plugin</artifactId>       <version>2.0.4</version>     </plugin>     <plugin>       <artifactId>maven-surefire-plugin</artifactId>       <version>2.13</version     </plugin   </plugins> </pluginManagement> Once you define the plugins in the pluginManagement section, as shown in the previous code, you only need to refer a plugin from its groupId (optional) and the artifactId tags. The version tag is picked from the appropriate pluginManagement section: <plugins>   <plugin>     <artifactId>maven-resources-plugin</artifactId>     <executions>……</executions>   </plugin>   <plugin>     <artifactId>maven-site-plugin</artifactId>     <executions>……</executions>   </plugin>   <plugin>     <artifactId>maven-source-plugin</artifactId>     <executions>……</executions>   </plugin>   <plugin>     <artifactId>maven-surefire-plugin</artifactId>     <executions>……</executions>   </plugin> </plugins> Defining a parent module In most of the multi-module Maven projects, there are many things that are shared across multiple modules. Dependency versions, plugins versions, properties, and repositories are only some of them. It is a common as well as a best practice to create a separate module called parent, and in its POM file, define everything in common. The packaging type of this POM file is pom. The artifact generated by the pom packaging type is itself a POM file. The following are a few examples: The Apache Axis2 project, available at http://svn.apache.org/repos/asf/axis/axis2/java/core/trunk/modules/parent/ The WSO2 Carbon project, available at https://svn.wso2.org/repos/wso2/carbon/platform/trunk/parent/ Not all the projects follow this approach. Some just keep the parent POM file under the root directory (not under the parent module). The following are a couple of examples: The Apache Synapse project, available at http://svn.apache.org/repos/asf/synapse/trunk/java/pom.xml The Apache HBase project, available at http://svn.apache.org/repos/asf/hbase/trunk/pom.xml Both approaches deliver the same results. However, the first one is much preferred. With the first approach, the parent POM file only defines the shared resources across different Maven modules in the project, while there is another POM file at the root of the project, which defines all the modules to be included in the project build. With the second approach, you define all the shared resources as well as all the modules to be included in the project build in the same POM file, which is under the project's root directory. The first approach is better than the second one, based on the separation of concerns principle. POM properties There are six types of properties that you can use within a Maven application POM file: Built-in properties Project properties Local settings Environment variables Java system properties Custom properties It is always recommended that you use properties, instead of hardcoding values in application POM files. Let''s look at a few examples. Let's take the application POM file inside the Apache Axis2 distribution module, available at http://svn.apache.org/repos/asf/axis/axis2/java/core/trunk/modules/distribution/pom.xml. This defines all the artifacts created in the Axis2 project that need to be included in the final distribution. All the artifacts share the same groupId tag as well as the version tag of the distribution module. This is a common scenario in most of the multi-module Maven projects. Most of the modules (if not all) share the same groupId tag and the version tag: <dependencies>   <dependency>     <groupId>org.apache.axis2</groupId>     <artifactId>axis2-java2wsdl</artifactId>     <version>${project.version}</version>   </dependency>   <dependency>     <groupId>org.apache.axis2</groupId>     <artifactId>axis2-kernel</artifactId>     <version>${project.version}</version>   </dependency>   <dependency>     <groupId>org.apache.axis2</groupId>     <artifactId>axis2-adb</artifactId>     <version>${project.version}</version>   </dependency> </dependencies> In the previous configuration, instead of duplicating the version element, Axis2 uses the project property ${project.version}. When Maven finds this project property, it reads the value from the project POM version element. If the project POM file does not have a version element, then Maven will try to read it from the immediate parent POM file. The benefit here is, when you upgrade your project version some day, you only need to upgrade the version element of the distribution POM file (or its parent). The previous configuration is not perfect; it can be further improved in the following manner: <dependencies>   <dependency>     <groupId>${project.groupId}</groupId>     <artifactId>axis2-java2wsdl</artifactId>     <version>${project.version}</version>   </dependency>   <dependency>     <groupId>${project.groupId}</groupId>     <artifactId>axis2-kernel</artifactId>     <version>${project.version}</version>   </dependency>   <dependency>     <groupId>${project.groupId}</groupId>     <artifactId>axis2-adb</artifactId>     <version>${project.version}</version>   </dependency> </dependencies> Here, we also replace the hardcoded value of groupId in all the dependencies with the project property ${project.groupid}. When Maven finds this project property, it reads the value from the project POM groupId element. If the project POM file does not have a groupId element, then Maven will try to read it from the immediate parent POM file. The following lists out some of the Maven built-in properties and project properties: project.version: This refers to the value of the version element of the project POM file project.groupId: This refers to the value of the groupId element of the project POM file project.artifactId: This refers to the value of the artifactId element of the project POM file project.name: This refers to the value of the name element of the project POM file project.description: This refers to the value of the description element of the project POM file project.baseUri: This refers to the path of the project's base directory The following is an example that shows the usage of this project property. Here, we have a system dependency that needs to be referred from a filesystem path: <dependency>   <groupId>org.apache.axis2.wso2</groupId>   <artifactId>axis2</artifactId>   <version>1.6.0.wso2v2</version>   <scope>system</scope>   <systemPath>${project.basedir}/lib/axis2-1.6.jar</systemPath> </dependency> In addition to the project properties, you can also read properties from the USER_HOME/.m2/settings.xml file. For example, if you want to read the path to the local Maven repository, you can use the ${settings.localRepository} property. In the same way, with the same pattern, you can read any of the configuration elements that are defined in the settings.xml file. The environment variables defined in the system can be read using the env prefix, within an application POM file. The ${env.M2_HOME} property will return the path to the Maven home, while ${env.java_home} returns the path to the Java home directory. These properties will be quite useful within certain Maven plugins. Maven also lets you define your own set of custom properties. Custom properties are mostly used when defining dependency versions. You should not scatter custom properties all over the place. The ideal place to define them is the parent POM file in a multi-module Maven project, which will then be inherited by all the other child modules. If you look at the parent POM file of the WSO2 Carbon project, you will find a large set of custom properties, which are defined in https://svn.wso2.org/repos/wso2/carbon/platform/branches/turing/parent/pom.xml. The following lists out some of them: <properties>   <rampart.version>1.6.1-wso2v10</rampart.version>   <rampart.mar.version>1.6.1-wso2v10</rampart.mar.version>   <rampart.osgi.version>1.6.1.wso2v10</rampart.osgi.version> </properties> When you add a dependency to the Rampart JAR, you do not need to specify the version there. Just refer it by the ${rampart.version} property name. Also, keep in mind that all the custom defined properties are inherited and can be overridden in any child POM file: <dependency>   <groupId>org.apache.rampart.wso2</groupId>   <artifactId>rampart-core</artifactId>   <version>${rampart.version}</version> </dependency> Avoiding repetitive groupId  and version tags and inherit from the parent POM In a multi-module Maven project, most of the modules (if not all) share the same groupId and version elements. In this case, you can avoid adding the version and groupId elements to your application POM file. These will be automatically inherited from the corresponding parent POM. If you look at axis2-kernel (which is a module of the Apache Axis2 project), you will find that no groupId or version is defined at http://svn.apache.org/repos/asf/axis/axis2/java/core/trunk/modules/kernel/pom.xml. Maven reads them from the parent POM file: <project>   <modelVersion>4.0.0</modelVersion>   <parent>     <groupId>org.apache.axis2</groupId>     <artifactId>axis2-parent</artifactId>     <version>1.7.0-SNAPSHOT</version>     <relativePath>../parent/pom.xml</relativePath>   </parent>   <artifactId>axis2-kernel</artifactId>   <name>Apache Axis2 - Kernel</name> </project> Following naming conventions When defining coordinates for your Maven project, you must always follow the naming conventions. The value of the groupId element should follow the same naming convention you use in Java package names. It has to be a domain name (the reverse of the domain name) that you own, or at least your project is developed under. The following lists out some of the naming conventions related to groupId: The name of the groupId element has to be in lower case. Use the reverse of a domain name that can be used to uniquely identify your project. This will also help to avoid collisions between artifacts produced by different projects. Avoid using digits or special characters (that is, org.wso2.carbon.identity-core). Do not try to group two words into a single word by camel casing (that is, org.wso2.carbon.identityCore). Make sure that all the subprojects developed under different teams in the same company finally inherit from the same groupId element and extend the name of the parent groupId element rather than defining their own. Let's go through some examples. You will notice that all the open source projects developed under Apache Software Foundation (ASF) use the same parent groupId (org.apache) and define their own groupId elements, which extend from the parent: The Apache Axis2 project uses org.apache.axis2, which inherits from the org.apache parent groupId The Apache Synapse project uses org.apache.synapse, which inherits from the org.apache parent groupId The Apache ServiceMix project uses org.apache.servicemix, which inherits from the org.apache parent groupId The WSO2 Carbon project uses org.wso2.carbon Apart from the groupId element, you should also follow the naming conventions while defining artifactIds. The following lists out some of the naming conventions related to artifactId: The name of the artifactId has to be in lower case. Avoid duplicating the value of groupId inside the artifactId element. If you find a need to start your artifactId with the value of groupId element and add something to the end, then you need to revisit the structure of your project. You might need to add more module groups. Avoid using special characters (that is, #, $, &, %, and so on). Do not try to group two words into a single word by camel casing (that is, identityCore). Following naming conventions for version is also equally important. The version of a given Maven artifact can be divided into four categories: <Major version>.<Minor version>.<Incremental version>-<Build   number or the qualifier> The major version reflects the introduction of a new major feature. A change in the major version of a given artifact can also mean that the new changes are not necessarily backward compatible with the previously released artifact. The minor version reflects an introduction of a new feature to the previously released version, in a backward compatible manner. The incremental version reflects a bug-fixed release of the artifact. The build number can be the revision number from the source code repository. This versioning convention is not just for Maven artifacts. Apple did a major release of its iOS mobile operating system in September 2014: iOS 8.0.0. Soon after the release, they discovered a critical bug in it that had an impact on cellular network connectivity and TouchID on iPhone. Then, they released iOS 8.0.1 as a patch release to fix the issues. Let's go through some of the examples: The Apache Axis2 1.6.0 release, available at http://svn.apache.org/repos/asf/axis/axis2/java/core/tags/v1.6.0/pom.xml. The Apache Axis2 1.6.2 release, available at http://svn.apache.org/repos/asf/axis/axis2/java/core/tags/v1.6.2/pom.xml. Apache Axis2 1.7.0-SNAPSHOT release, available at http://svn.apache.org/repos/asf/axis/axis2/java/core/trunk/pom.xml. SNAPSHOT releases are done from the trunk of the source repository with the latest available code. Apache Synapse 2.1.0-wso2v5 release, available at http://svn.wso2.org/repos/wso2/tags/carbon/3.2.3/dependencies/synapse/2.1.0-wso2v5/pom.xml. Here, the Synapse code is maintained under the WSO2 source repository and not under the Apache repository. In this case, we use the wso2v5 classifier to make it different from the same artifact produced by Apache Synapse. Maven profiles When do we need Maven profiles and why is it a best practice? Think about a large-scale multi-module Maven project. One of the best examples I am aware of is the WSO2 Carbon project. If you look at the application POM file available at http://svn.wso2.org/repos/wso2/tags/carbon/3.2.3/components/pom.xml, you will notice that there are more than hundred modules. Also, if you go deeper into each module, you will further notice that there are more modules within them: http://svn.wso2.org/repos/wso2/tags/carbon/3.2.3/components/identity/pom.xml. As a developer of the WSO2 Carbon project, you do not need to build all these modules. In this specific example, different groups of the modules are later aggregated into build multiple products. However, a given product does not need to build all the modules defined in the parent POM file. If you are a developer in a product team, you only need to worry about building the set of modules related to your product; if not, it's an utter waste of productive time. Maven profiles help you to do this. With Maven profiles, you can activate a subset of configurations defined in your application POM file, based on some criteria. If we take the same example we took previously, you will find that multiple profiles are defined under the <profiles> element: http://svn.wso2.org/repos/wso2/tags/carbon/3.2.3/components/pom.xml. Each profile element defines the set of modules that is relevant to it and identified by a unique ID. Also for each module, you need to define a criterion to activate it, under the activation element. By setting the value of the activeByDefault element to true, we make sure that the corresponding profile will get activated when no other profile is picked. In this particular example, if we just execute mvn clean install, the profile with the default ID will get executed. Keep in mind that the magic here does not lie on the name of the profile ID, default, but on the value of the activeByDefault element, which is set to true for the default profile. The value of the id element can be of any name: <profiles>   <profile>     <id>product-esb</id>     <activation>       <property>         <name>product</name>         <value>esb</value>       </property>     </activation>     <modules></modules>   </profile>   <profile>     <id>product-greg</id>     <activation>       <property>         <name>product</name>         <value>greg</value>       </property>     </activation>     <modules></modules>   </profile>   <profile>     <id>product-is</id>     <activation>       <property>         <name>product</name>         <value>is</value>       </property>     </activation     <modules></modules>   </profile>   <profile>     <id>default</id>     <activation>       <activeByDefault>true</activeByDefault>     </activation>     <modules></modules>   </profile> </profiles> If I am a member of the WSO2 Identity Server (IS) team, then I will execute the build in the following manner: $ mvn clean install –Dproduct=is Here, we pass the system property product with the value is. If you look at the activation criteria for all the profiles, all are based on the system property: product. If the value of the system property is is, then Maven will pick the build profile corresponding to the Identity Server: <activation>   <property>     <name>product</name>     <value>is</value>   </property> </activation> You also can define an activation criterion to execute a profile, in the absence of a property. For example, the following configuration shows how to activate a profile if the product property is missing: <activation>   <property>     <name>!product</name>   </property> </activation> The profile activation criteria can be based on a system property, the JDK version, or an operating system parameter where you run the build. The following sample configuration shows how to activate a build profile for JDK 1.6: <activation>   <jdk>1.6</jdk> </activation> The following sample configuration shows how to activate a build profile based on operating system parameters: <activation>   <os>     <name>mac os x</name>     <family>mac</family>     <arch>x86_64</arch>    <version>10.8.5</version>   </os> </activation> The following sample configuration shows how to activate a build profile based on the presence or absence of a file: <activation>   <file>     <exists>……</exists>     <missing>……</missing>   </file> </activation> In addition to the activation configuration, you can also execute a Maven profile just by its ID, which is defined within the id element. In this case, you need a prefix; use the profile ID with –P, as shown in the following command: $ mvn clean install -Pproduct-is Think twice before you write your own plugin Maven is all about plugins! There is a plugin out there for almost everything. If you find a need to write a plugin, spend some time researching on the Web to see whether you can find something similar—the chances are very high. You can also find a list of available Maven plugins at http://maven.apache.org/plugins. The Maven release plugin Releasing a project requires a lot of repetitive tasks. The objective of the Maven release plugin is to automate them. The release plugin defines the following eight goals, which are executed in two stages, preparing the release and performing the release: release:clean: This goal cleans up after a release preparation release:prepare: This goal prepares for a release in Software Configuration Management (SCM) release:prepare-with-pom: This goal prepares for a release in SCM and generates release POMs by fully resolving the dependencies release:rollback: This goal rolls back to a previous release release:perform: This goal performs a release from SCM release:stage: This goal performs a release from SCM into a staging folder/repository release:branch: This goal creates a branch of the current project with all versions updated release:update-versions: This goal updates versions in the POM(s) The preparation stage will complete the following tasks with the release:prepare goal: Verify that all the changes in the source code are committed. Make sure that there are no SNAPSHOT dependencies. During the project development phase, we use SNAPSHOT dependencies but at the time of the release, all dependencies should be changed to a released version. Change the version of project POM files from SNAPSHOT to a concrete version number. Change the SCM information in the project POM to include the final destination of the tag. Execute all the tests against the modified POM files. Commit the modified POM files to the SCM and tag the code with the version name. Change the version of POM files in the trunk to a SNAPSHOT version and then commit the modified POM files to the trunk. Finally, the release will be performed with the release:perform goal. This will check the code from the release tag in the SCM and run a set of predefined goals: site, deploy-site. The maven-release-plugin is not defined in the super POM and should be explicitly defined in your project POM file. The releaseProfiles configuration element defines the profiles to be released and the goals configuration element defines the plugin goals to be executed during the release:perform goal. In the following configuration, the deploy goal of the maven-deploy-plugin and the single goal of the maven-assembly-plugin will get executed: <plugin>   <artifactId>maven-release-plugin</artifactId>   <version>2.5</version>   <configuration>     <releaseProfiles>release</releaseProfiles>     <goals>deploy assembly:single</goals>   </configuration> </plugin> More details about the Maven release plugin are available at http://maven.apache.org/maven-release/maven-release-plugin/. The Maven enforcer plugin The Maven enforcer plugin lets you control or enforce constraints in your build environment. These could be the Maven version, Java version, operating system parameters, and even user-defined rules. The plugin defines two goals: enforce and displayInfo. The enforcer:enforce goal will execute all the defined rules against all the modules in a multi-module Maven project, while enforcer:displayInfo will display the project compliance details with respect to the standard rule set. The maven-enforcer-plugin is not defined in the super POM and should be explicitly defined in your project POM file: <plugins>   <plugin>     <groupId>org.apache.maven.plugins</groupId>     <artifactId>maven-enforcer-plugin</artifactId>     <version>1.3.1</version>     <executions>       <execution>         <id>enforce-versions</id>         <goals>           <goal>enforce</goal>         </goals>         <configuration>           <rules>             <requireMavenVersion>               <version>3.2.1</version>             </requireMavenVersion>             <requireJavaVersion>               <version>1.6</version>             </requireJavaVersion>             <requireOS>               <family>mac</family>             </requireOS>           </rules>         </configuration>       </execution>     </executions>   </plugin> </plugins> The previous plugin configuration enforces the Maven version to be 3.2.1, the Java version to be 1.6, and the operating system to be in the Mac family. The Apache Axis2 project uses the enforcer plugin to make sure that no application POM file defines Maven repositories. All the artifacts required by Axis2 are expected to be in the Maven central repository. The following configuration element is extracted from http://svn.apache.org/repos/asf/axis/axis2/java/core/trunk/modules/parent/pom.xml. Here, it bans all the repositories and plugin repositories, except snapshot repositories: <plugin>   <artifactId>maven-enforcer-plugin</artifactId>   <version>1.1</version>   <executions>     <execution>       <phase>validate</phase>       <goals>         <goal>enforce</goal>       </goals>       <configuration>         <rules>           <requireNoRepositories>             <banRepositories>true</banRepositories>             <banPluginRepositories>true</banPluginRepositories>             <allowSnapshotRepositories>true</allowSnapshotRepositories>             <allowSnapshotPluginRepositories>true</allowSnapshotPluginRepositories>           </requireNoRepositories>         </rules>       </configuration>     </execution>   </executions> </plugin> In addition to the standard rule set ships with the enforcer plugin, you can also define your own rules. More details about how to write custom rules are available at http://maven.apache.org/enforcer/enforcer-api/writing-a-custom-rule.html. Avoid using un-versioned plugins If you have associated a plugin with your application POM, without a version, then Maven will download the corresponding maven-metadata.xml file and store it locally. Only the latest released version of the plugin will be downloaded and used in the project. This can easily create certain uncertainties. Your project might work fine with the current version of a plugin, but later if there is a new release of the same plugin, your Maven project will start to use the latest one automatically. This can result in unpredictable behaviors and lead to a debugging mess. It is always recommended that you specify the plugin version along with the plugin configuration. You can enforce this as a rule, with the Maven enforcer plugin, as shown in the following code: <plugin>   <groupId>org.apache.maven.plugins</groupId>   <artifactId>maven-enforcer-plugin</artifactId>   <version>1.3.1</version>   <executions>     <execution>       <id>enforce-plugin-versions</id>       <goals>         <goal>enforce</goal>       </goals>       <configuration>         <rules>           <requirePluginVersions>             <message>………… <message>             <banLatest>true</banLatest>             <banRelease>true</banRelease>             <banSnapshots>true</banSnapshots>             <phases>clean,deploy,site</phases>             <additionalPlugins>               <additionalPlugin>                 org.apache.maven.plugins:maven-eclipse-plugin               </additionalPlugin>               <additionalPlugin>                 org.apache.maven.plugins:maven-reactor-plugin               </additionalPlugin>             </additionalPlugins>             <unCheckedPluginList>               org.apache.maven.plugins:maven-enforcer-plugin,org.apache.maven.plugins:maven-idea-plugin             </unCheckedPluginList>           </requirePluginVersions>         </rules>       </configuration>     </execution>   </executions> </plugin> The following points explain each of the key configuration elements defined in the previous code: message: Use this to define an optional message to the user if the rule execution fails. banLatest: Use this to restrict the use of LATEST as the  version for any plugin. banRelease: Use this to restrict the use of RELEASE as the version for any plugin. banSnapshots: Use this to restrict the use of SNAPSHOT plugins. banTimestamps: Use this to restrict the use of SNAPSHOT plugins with the timestamp version. phases: This is a comma-separated list of phases that should be used to find lifecycle plugin bindings. The default value is clean,deploy,site. additionalPlugins: This is a list of additional plugins to enforce to have versions. These plugins might not be defined in application POM files, but are used anyway, such as help and eclipse. The plugins should be specified in the groupId:artifactId form. unCheckedPluginList: This is a comma-separated list of plugins to skip version checking. You can read more details about the requirePluginVersions rule at http://maven.apache.org/enforcer/enforcer-rules/requirePluginVersions.html. Using exclusive and inclusive routes When Maven asks for an artifact from a Nexus proxy repository, Nexus knows where to look at exactly. For example, say we have a proxy repository that runs at http://localhost:8081/nexus/content/repositories/central/, which internally points to the remote repository running at https://repo1.maven.org/maven2/. As there is one-to-one mapping between the proxy repository and the corresponding remote repository, Nexus can route the requests without much trouble. However, if Maven looks for an artifact via a Nexus group repository, then Nexus has to iterate through all the repositories in that group repository to find the exact artifact. There can be cases where we have even more than 20 repositories in a single group repository, which can easily bring delays at the client side. To optimize artifact discovery in group repositories, we need to set correct inclusive/exclusive routing rules. Avoid having both release and snapshot repositories in the same group repository With the Nexus repository manager, you can group both the release repositories and snapshot repositories together into a single group repository. This is treated as an extreme malpractice. Ideally, you should be able to define distinct update policies for release repositories and snapshot repositories. Summary In this article, we looked at and highlighted some of the best practices to be followed in a large-scale development project with Maven. It is always recommended to follow best practices, as it will drastically improve developer productivity and will reduce any maintenance nightmares. Resources for Article: Further resources on this subject: APACHE MAVEN AND M2ECLIPSE [article] Coverage With Apache Karaf Pax Exam Tests [article] Apache Karaf – Provisioning and Clusters [article]
Read more
  • 0
  • 0
  • 5835

article-image-sneak-peek-ios-touch-id
Packt
23 Dec 2014
4 min read
Save for later

Sneak peek into iOS Touch ID

Packt
23 Dec 2014
4 min read
This article, created by Mayank Birani, the author of Learning iOS 8 for Enterprise, Apple introduced a new feature in iOS 7 called Touch ID authentication. Previously, there was only four-digit passcode security in iPhones; now, Apple has extended security and introduced a new security pattern in iPhones. In Touch ID authentication, our fingerprint acts as a password. After launching the Touch ID fingerprint-recognition technology in the iPhone 5S last year, Apple is now providing it for developers with iOS 8. Now, third-party apps will be able to use Touch ID for authentication in the new iPhone and iPad OSes. Accounting apps, and other apps that contain personal and important data, will be protected with Touch ID. Now, you can protect all your apps with your fingerprint password. (For more resources related to this topic, see here.) There are two ways to use Touch ID as an authentication mechanism in our iOS 8 applications. They are explained in the following sections. Touch ID through touch authentication The Local Authentication API is an API that returns a Boolean value to accept and decline the fingerprint. If there is an error, then an error code gets executed and tells us what the issue is. Certain conditions have to be met when using Local Authentication. They are as follows: The application must be in the foreground (this doesn't work with background processes) If you're using the straight Local Authentication method, you will be responsible for handling all the errors and properly responding with your UI to ensure that there is an alternative method to log in to your apps Touch ID through Keychain Access Keychain Access includes the new Touch ID integration in iOS 8. In Keychain Access, we don't have to work on implementation details; it automatically handles the passcode implementation using the user's passcode. Several keychain items can be chosen to use Touch ID to unlock the item when requested in code through the use of the new Access Control Lists (ACLs). ACL is a feature of iOS 8. If Touch ID has been locked out, then it will allow the user to enter the device's passcode to proceed without any interruption. There are some features of Keychain Access that make it the best option for us. They are listed here: Keychain Access uses Touch ID, and its attributes won't be synced by any cloud services. So, these features make it very safe to use. If users overlay more than one query, then the system gets confused about correct user, and it will pop up a dialog box with multiple touch issues. How to use the Local Authentication framework Apple provides a framework to use Touch ID in our app called Local Authentication. This framework was introduced for iOS 8. To make an app, including the Touch ID authentication, we need to import this framework in our code. It is present in the framework library of Apple. Let's see how to use the Local Authentication framework: Import the Local Authentication framework as follows: #import<localAuthentication/localAuthentication.h> This framework will work on Xcode 6 and above. To use this API, we have to create a Local Authentication context, as follows: LAContext *passcode = [[LAContext alloc] init]; Now, check whether Touch ID is available or not and whether it can be used for authentication: - (BOOL)canEvaluatePolicy:(LAPolicy)policy error: (NSError * __autoreleasing *)error; To display Touch ID, use the following code: - (void)evaluatePolicy:(LAPolicy)policy localizedReason: (NSString *)localizedReason    reply:(void(^)(BOOL success, NSError *error))reply; Take a look at the following example of Touch ID: LAContext *passcode = [[LAContext alloc] init]; NSError *error = nil; NSString *Reason = <#String explaining why our app needs authentication#>; if ([passcode canEvaluatePolicy: LAPolicyDeviceOwnerAuthenticationWithBiometrics error:&error]) { [passcode evaluatePolicy:      LAPolicyDeviceOwnerAuthenticationWithBiometrics    localizedReason:Reason reply:^(BOOL success, NSError      *error) {        if (success)        {          // User authenticated successfully        } else        {          // User did not authenticate successfully,            go through the error        }    }]; } else {    // could not go through policy look at error and show an appropriate message to user } Summary In this article, we focused on the Touch ID API, which was introduced in iOS 8. We also discussed how Apple has improved its security feature using this API. Resources for Article: Further resources on this subject: Sparrow iOS Game Framework - The Basics of Our Game [article] Updating data in the background [article] Physics with UIKit Dynamics [article]
Read more
  • 0
  • 0
  • 4765

article-image-learning-qgis-python-api
Packt
23 Dec 2014
44 min read
Save for later

Learning the QGIS Python API

Packt
23 Dec 2014
44 min read
In this article, we will take a closer look at the Python libraries available for the QGIS Python developer, and also look at the various ways in which we can use  these libraries to perform useful tasks within QGIS. In particular, you will learn: How the QGIS Python libraries are based on the underlying C++ APIs How to use the C++ API documentation as a reference to work with the Python APIs How the PyQGIS libraries are organized The most important concepts and classes within the PyQGIS libraries and how to use them Some practical examples of performing useful tasks using PyQGIS About the QGIS Python APIs The QGIS system itself is written in C++ and has its own set of APIs that are also written in C++. The Python APIs are implemented as wrappers around these C++ APIs. For example, there is a Python class named QgisInterface that acts as a wrapper around a C++ class of the same name. All the methods, class variables, and the like, which are implemented by the C++ version of QgisInterface are made available through the Python wrapper. What this means is that when you access the Python QGIS APIs, you aren't accessing the API directly. Instead, the wrapper connects your code to the underlying C++ objects and methods, as follows :  Fortunately, in most cases, the QGIS Python wrappers simply hide away the complexity of the underlying C++ code, so the PyQGIS libraries work as you would expect them to. There are some gotchas, however, and we will cover these as they come up. Deciphering the C++ documentation As QGIS is implemented in C++, the documentation for the QGIS APIs is all based on C++. This can make it difficult for Python developers to understand and work with the QGIS APIs. For example, the API documentation for the QgsInterface.zoomToActiveLayer() method:  If you're not familiar with C++, this can be quite confusing. Fortunately, as a Python programmer, you can skip over much of this complexity because it doesn't apply to you. In particular: The virtual keyword is an implementation detail you don't need to worry about The word void indicates that the method doesn't return a value The double colons in QgisInterface::zoomToActiveLayer are simply a C++ convention for separating the class name from the method name Just like in Python, the parentheses show that the method doesn't take any parameters. So if you have an instance of QgisInterface (for example, as the standard iface variable available in the Python Console), you can call this method simply by typing the following: iface.zoomToActiveLayer() Now, let's take a look at a slightly more complex example: the C++ documentation for the QgisInterface.addVectorLayer() method looks like the following:  Notice how the virtual keyword is followed by QgsVectorLayer* instead of void. This is the return value for this method; it returns a QgsVector object. Technically speaking, * means that the method returns a pointer to an object of type QgsVectorLayer. Fortunately, Python wrappers automatically handle pointers, so you don't need to worry about this.  Notice the brief description at the bottom of the documentation for this method; while many of the C++ methods have very little, if any, additional information, other methods have quite extensive descriptions. Obviously, you should read these descriptions carefully as they tell you more about what the method does. Even without any description, the C++ documentation is still useful as it tells you what the method is called, what parameters it accepts, and what type of data is being returned. In the preceding method, you can see that there are three parameters listed in between the parentheses. As C++ is a strongly typed language, you have to define the type of each parameter when you define a function. This is helpful for Python programmers as it tells you what type of value to supply. Apart from QGIS objects, you might also encounter the following data types in the C++ documentation: Data type Description int A standard Python integer value long A standard Python long integer value float A standard Python floating point (real) number bool A Boolean value (true or false) QString A string value. Note that the QGIS Python wrappers automatically convert Python strings to C++ strings, so you don't need to deal with QString objects directly QList This object is used to encapsulate a list of other objects. For example, QList<QString*> represents a list of strings Just as in Python, a method can take default values for each parameter. For example, the QgisInterface.newProject() method looks like the following:  In this case, the thePromptToSaveFlag parameter has a default value, and this default value will be used if no value is supplied. In Python, classes are initialized using the __init__ method. In C++, this is called a constructor. For example, the constructor for the QgsLabel class looks like the following:  Just as in Python, C++ classes inherit the methods defined in their superclass. Fortunately, QGIS doesn't have an extensive class hierarchy, so most of the classes don't have a superclass. However, don't forget to check for a superclass if you can't find the method you're looking for in the documentation for the class itself. Finally, be aware that C++ supports the concept of method overloading. A single method can be defined more than once, where each version accepts a different set of parameters. For example, take a look at the constructor for the QgsRectangle class—you will see that there are four different versions of this method. The first version accepts the four coordinates as floating point numbers:  The second version constructs a rectangle using two QgsPoint objects:  The third version copies the coordinates from QRectF (which is a Qt data type) into a QgsRectangle object:  The final version copies the coordinates from another QgsRectangle object:  The C++ compiler chooses the correct method to use based on the parameters that have been supplied. Python has no concept of method overloading; just choose the version of the method that accepts the parameters you want to supply, and the QGIS Python wrappers will automatically choose the correct method for you. If you keep these guidelines in mind, deciphering the C++ documentation for QGIS isn't all that hard. It just looks more complicated than it really is, thanks to all the complexity specific to C++. However, it won't take long for your brain to start filtering out C++ and use the QGIS reference documentation almost as easily as if it was written for Python rather than C++. Organization of the QGIS Python libraries Now that we can understand the C++-oriented documentation, let's see how the PyQGIS libraries are structured. All of the PyQGIS libraries are organized under a package named qgis. You wouldn't normally import qgis directly, however, as all the interesting libraries are subpackages within this main package; here are the five packages that make up the PyQGIS library: qgis.core This provides access to the core GIS functionality used throughout QGIS. qgis.gui This defines a range of GUI widgets that you can include in your own programs. qgis.analysis This provides spatial analysis tools to analyze vector and raster format data. qgis.networkanalysis This provides tools to build and analyze topologies. qgis.utils This implements miscellaneous functions that allow you to work with the QGIS application using Python.  The first two packages (qgis.core and qgis.gui) implement the most important parts of the PyQGIS library, and it's worth spending some time to become more familiar with the concepts and classes they define. Now let's take a closer look at these two packages. The qgis.core package The qgis.core package defines fundamental classes used throughout the QGIS system. A large part of this package is dedicated to working with vector and raster format geospatial data, and displaying these types of data within a map. Let's take a look at how this is done. Maps and map layers A map consists of multiple layers drawn one on top of the other:  There are three types of map layers supported by QGIS: Vector layer: This layer draws geospatial features such as points, lines, and polygons Raster layer: This layer draws raster (bitmapped) data onto a map Plugin layer: This layer allows a plugin to draw directly onto a map Each of these types of map layers has a corresponding class within the qgis.core library. For example, a vector map layer will be represented by an object of type qgis.core.QgsVectorLayer. We will take a closer look at vector and raster map layers shortly. Before we do this, though, we need to learn how geospatial data (both vector and raster data) is positioned on a map. Coordinate reference systems Since the Earth is a three-dimensional object, maps will only represent the Earth's surface as a two-dimensional plane, so there has to be a way of translating points on the Earth's surface into (x,y) coordinates within a map. This is done using a Coordinate Reference System (CRS):  Globe image courtesy Wikimedia (http://commons.wikimedia.org/wiki/File:Rotating_globe.gif) A CRS has two parts: an ellipsoid, which is a mathematical model of the Earth's surface, and a projection, which is a formula that converts points on the surface of the spheroid into (x,y) coordinates on a map. Generally you won't need to worry about all these details. You can simply select the appropriate CRS that matches the CRS of the data you are using. However, as there are many different coordinate reference systems that have been devised over the years, it is vital that you use the correct CRS when plotting your geospatial data. If you don't do this, your features will be displayed in the wrong place or have the wrong shape. The majority of geospatial data available today uses the EPSG 4326 coordinate reference system (sometimes also referred to as WGS84). This CRS defines coordinates as latitude and longitude values. This is the default CRS used for new data imported into QGIS. However, if your data uses a different coordinate reference system, you will need to create and use a different CRS for your map layer. The qgis.core.QgsCoordinateReferenceSystem class represents a CRS. Once you create your coordinate reference system, you can tell your map layer to use that CRS when accessing the underlying data. For example: crs = QgsCoordinateReferenceSystem(4326,           QgsCoordinateReferenceSystem.EpsgCrsId))layer.setCrs(crs) Note that different map layers can use different coordinate reference systems. Each layer will use its CRS when drawing the contents of the layer onto the map. Vector layers A vector layer draws geospatial data onto a map in the form of points, lines, polygons, and so on. Vector-format geospatial data is typically loaded from a vector data source such as a shapefile or database. Other vector data sources can hold vector data in memory, or load data from a web service across the internet. A vector-format data source has a number of features, where each feature represents a single record within the data source. The qgis.core.QgsFeature class represents a feature within a data source. Each feature has the following principles: ID: This is the feature's unique identifier within the data source. Geometry: This is the underlying point, line, polygon, and so on, which represents the feature on the map. For example, a data source representing cities would have one feature for each city, and the geometry would typically be either a point that represents the center of the city, or a polygon (or a multipolygon) that represents the city's outline. Attributes: These are key/value pairs that provide additional information about the feature. For example, a city data source might have attributes such as total_area, population, elevation, and so on. Attribute values can be strings, integers, or floating point numbers. In QGIS, a data provider allows the vector layer to access the features within the data source. The data provider, an instance of qgis.core.QgsVectorDataProvider, includes: A geometry type that is stored in the data source A list of fields that provide information about the attributes stored for each feature The ability to search through the features within the data source, using the getFeatures() method and the QgsFeatureRequest class You can access the various vector (and also raster) data providers by using the qgis.core.QgsProviderRegistry class. The vector layer itself is represented by a qgis.core.QgsVectorLayer object. Each vector layer includes: Data provider: This is the connection to the underlying file or database that holds the geospatial information to be displayed Coordinate reference system: This indicates which CRS the geospatial data uses Renderer: This chooses how the features are to be displayed Let's take a closer look at the concept of a renderer and how features are displayed within a vector map layer. Displaying vector data The features within a vector map layer are displayed using a combination of renderer and symbol objects. The renderer chooses the symbol to use for a given feature, and the symbol that does the actual drawing. There are three basic types of symbols defined by QGIS: Marker symbol: This displays a point as a filled circle Line symbol: This draws a line using a given line width and color Fill symbol: This draws the interior of a polygon with a given color These three types of symbols are implemented as subclasses of the qgis.core.QgsSymbolV2 class: qgis.core.QgsMarkerSymbolV2 qgis.core.QgsLineSymbolV2 qgis.core.QgsFillSymbolV2 Internally, symbols are rather complex, as "symbol layers" allow multiple elements to be drawn on top of each other. In most cases, however, you can make use of the "simple" version of the symbol. This makes it easier to create a new symbol without having to deal with the internal complexity of symbol layers. For example: symbol = QgsMarkerSymbolV2.createSimple({'width' : 1.0, 'color' : "255,0,0"}) While symbols draw the features onto the map, a renderer is used to choose which symbol to use to draw a particular feature. In the simplest case, the same symbol is used for every feature within a layer. This is called a single symbol renderer, and is represented by the qgis.core.QgsSingleSymbolRenderV2 class. Other possibilities include: Categorized symbol renderer (qgis.core.QgsCategorizedSymbolRendererV2): This renderer chooses a symbol based on the value of an attribute. The categorized symbol renderer has a mapping from attribute values to symbols. Graduated symbol renderer (qgis.core.QgsGraduatedSymbolRendererV2): This type of renderer has a range of attribute, values, and maps each range to an appropriate symbol. Using a single symbol renderer is very straightforward: symbol = ...renderer = QgsSingleSymbolRendererV2(symbol)layer.setRendererV2(renderer) To use a categorized symbol renderer, you first define a list of qgis.core. QgsRendererCategoryV2 objects, and then use that to create the renderer. For example: symbol_male = ...symbol_female = ... categories = []categories.append(QgsRendererCategoryV2("M", symbol_male, "Male"))categories.append(QgsRendererCategoryV2("F", symbol_female, "Female"))renderer = QgsCategorizedSymbolRendererV2("", categories)renderer.setClassAttribute("GENDER")layer.setRendererV2(renderer) Notice that the QgsRendererCategoryV2 constructor takes three parameters: the desired value, the symbol used, and a label used to describe that category. Finally, to use a graduated symbol renderer, you define a list of qgis.core.QgsRendererRangeV2 objects and then use that to create your renderer. For example: symbol1 = ...symbol2 = ... ranges = []ranges.append(QgsRendererRangeV2(0, 10, symbol1, "Range 1"))ranges.append(QgsRendererRange(11, 20, symbol2, "Range 2")) renderer = QgsGraduatedSymbolRendererV2("", ranges)renderer.setClassAttribute("FIELD")layer.setRendererV2(renderer) Accessing vector data In addition to displaying the contents of a vector layer within a map, you can use Python to directly access the underlying data. This can be done using the data provider's getFeatures() method. For example, to iterate over all the features within the layer, you can do the following: provider = layer.dataProvider()for feature in provider.getFeatures(QgsFeatureRequest()):  ... If you want to search for features based on some criteria, you can use the QgsFeatureRequest object's setFilterExpression() method, as follows: provider = layer.dataProvider()request = QgsFeatureRequest()request.setFilterExpression('"GENDER" = "M"')for feature in provider.getFeatures(QgsFeatureRequest()):  ... Once you have the features, it's easy to get access to the feature's geometry, ID, and attributes. For example: geometry = feature.geometry()  id = feature.id()  name = feature.attribute("NAME") The object returned by the feature.geometry() call, which will be an instance of qgis.core.QgsGeometry, represents the feature's geometry. This object has a number of methods you can use to extract the underlying data and perform various geospatial calculations. Spatial indexes In the previous section, we searched for features based on their attribute values. There are times, though, when you might want to find features based on their position in space. For example, you might want to find all features that lie within a certain distance of a given point. To do this, you can use a spatial index, which indexes features according to their location and extent. Spatial indexes are represented in QGIS by the QgsSpatialIndex class. For performance reasons, a spatial index is not created automatically for each vector layer. However, it's easy to create one when you need it: provider = layer.dataProvider()index = QgsSpatialIndex()for feature in provider.getFeatures(QgsFeatureRequest()):  index.insertFeature(feature) Don't forget that you can use the QgsFeatureRequest.setFilterExpression() method to limit the set of features that get added to the index. Once you have the spatial index, you can use it to perform queries based on the position of the features. In particular: You can find one or more features that are closest to a given point using the nearestNeighbor() method. For example: features = index.nearestNeighbor(QgsPoint(long, lat), 5) Note that this method takes two parameters: the desired point as a QgsPoint object and the number of features to return. You can find all features that intersect with a given rectangular area by using the intersects() method, as follows: features = index.intersects(QgsRectangle(left, bottom, right, top)) Raster layers Raster-format geospatial data is essentially a bitmapped image, where each pixel or "cell" in the image corresponds to a particular part of the Earth's surface. Raster data is often organized into bands, where each band represents a different piece of information. A common use for bands is to store the red, green, and blue component of the pixel's color in a separate band. Bands might also represent other types of information, such as moisture level, elevation, or soil type. There are many ways in which raster information can be displayed. For example: If the raster data only has one band, the pixel value can be used as an index into a palette. The palette maps each pixel value maps to a particular color. If the raster data has only one band but no palette is provided. The pixel values can be used directly as a grayscale value; that is, larger numbers are lighter and smaller numbers are darker. Alternatively, the pixel values can be passed through a pseudocolor algorithm to calculate the color to be displayed. If the raster data has multiple bands, then typically, the bands would be combined to generate the desired color. For example, one band might represent the red component of the color, another band might represent the green component, and yet another band might represent the blue component. Alternatively, a multiband raster data source might be drawn using a palette, or as a grayscale or a pseudocolor image, by selecting a particular band to use for the color calculation. Let's take a closer look at how raster data can be drawn onto the map. Displaying raster data The drawing style associated with the raster band controls how the raster data will be displayed. The following drawing styles are currently supported: Drawing style Description PalettedColor For a single band raster data source, a palette maps each raster value to a color. SingleBandGray For a single band raster data source, the raster value is used directly as a grayscale value. SingleBandPseudoColor For a single band raster data source, the raster value is used to calculate a pseudocolor. PalettedSingleBandGray For a single band raster data source that has a palette, this drawing style tells QGIS to ignore the palette and use the raster value directly as a grayscale value. PalettedSingleBandPseudoColor For a single band raster data source that has a palette, this drawing style tells QGIS to ignore the palette and use the raster value to calculate a pseudocolor. MultiBandColor For multiband raster data sources, use a separate band for each of the red, green, and blue color components. For this drawing style, the setRedBand(), setGreenBand(), and setBlueBand() methods can be used to choose which band to use for each color component. MultiBandSingleBandGray For multiband raster data sources, choose a single band to use as the grayscale color value. For this drawing style, use the setGrayBand() method to specify the band to use. MultiBandSingleBandPseudoColor For multiband raster data sources, choose a single band to use to calculate a pseudocolor. For this drawing style, use the setGrayBand() method to specify the band to use.  To set the drawing style, use the layer.setDrawingStyle() method, passing in a string that contains the name of the desired drawing style. You will also need to call the various setXXXBand() methods, as described in the preceding table, to tell the raster layer which bands contain the value(s) to use to draw each pixel. Note that QGIS doesn't automatically update the map when you call the preceding functions to change the way the raster data is displayed. To have your changes displayed right away, you'll need to do the following: Turn off raster image caching. This can be done by calling layer.setImageCache(None). Tell the raster layer to redraw itself, by calling layer.triggerRepaint(). Accessing raster data As with vector-format data, you can access the underlying raster data via the data provider's identify() method. The easiest way to do this is to pass in a single coordinate and retrieve the value or values of that coordinate. For example: provider = layer.dataProvider()values = provider.identify(QgsPoint(x, y),              QgsRaster.IdentifyFormatValue)if values.isValid():  for band,value in values.results().items():    ... As you can see, you need to check whether the given coordinate exists within the raster data (using the isValid() call). The values.results() method returns a dictionary that maps band numbers to values. Using this technique, you can extract all the underlying data associated with a given coordinate within the raster layer. You can also use the provider.block() method to retrieve the band data for a large number of coordinates all at once. We will look at how to do this later in this article. Other useful qgis.core classes Apart from all the classes and functionality involved in working with data sources and map layers, the qgis.core library also defines a number of other classes that you might find useful: Class Description QgsProject This represents the current QGIS project. Note that this is a singleton object, as only one project can be open at a time. The QgsProject class is responsible for loading and storing properties, which can be useful for plugins. QGis This class defines various constants, data types, and functions used throughout the QGIS system. QgsPoint This is a generic class that stores the coordinates for a point within a two-dimensional plane. QgsRectangle This is a generic class that stores the coordinates for a rectangular area within a two-dimensional plane. QgsRasterInterface This is the base class to use for processing raster data. This can be used, to reproject a set of raster data into a new coordinate system, to apply filters to change the brightness or color of your raster data, to resample the raster data, and to generate new raster data by rendering the existing data in various ways. QgsDistanceArea This class can be used to calculate distances and areas for a given geometry, automatically converting from the source coordinate reference system into meters. QgsMapLayerRegistry This class provides access to all the registered map layers in the current project. QgsMessageLog This class provides general logging features within a QGIS program. This lets you send debugging messages, warnings, and errors to the QGIS "Log Messages" panel.  The qgis.gui package The qgis.gui package defines a number of user-interface widgets that you can include in your programs. Let's start by looking at the most important qgis.gui classes, and follow this up with a brief look at some of the other classes that you might find useful. The QgisInterface class QgisInterface represents the QGIS system's user interface. It allows programmatic access to the map canvas, the menu bar, and other parts of the QGIS application. When running Python code within a script or a plugin, or directly from the QGIS Python console, a reference to QgisInterface is typically available through the iface global variable. The QgisInterface object is only available when running the QGIS application itself. If you are running an external application and import the PyQGIS library into your application, QgisInterface won't be available. Some of the more important things you can do with the QgisInterface object are: Get a reference to the list of layers within the current QGIS project via the legendInterface() method. Get a reference to the map canvas displayed within the main application window, using the mapCanvas() method. Retrieve the currently active layer within the project, using the activeLayer() method, and set the currently active layer by using the setActiveLayer() method. Get a reference to the application's main window by calling the mainWindow() method. This can be useful if you want to create additional Qt windows or dialogs that use the main window as their parent. Get a reference to the QGIS system's message bar by calling the messageBar() method. This allows you to display messages to the user directly within the QGIS main window. The QgsMapCanvas class The map canvas is responsible for drawing the various map layers into a window. The QgsMapCanvas class represents a map canvas. This class includes: A list of the currently shown map layers. This can be accessed using the layers() method. Note that there is a subtle difference between the list of map layers available within the map canvas and the list of map layers included in the QgisInterface.legendInterface() method. The map canvas's list of layers only includes the list of layers currently visible, while QgisInterface.legendInterface() returns all the map layers, including those that are currently hidden.  The map units used by this map (meters, feet, degrees, and so on). The map's units can be retrieved by calling the mapUnits() method. An extent,which is the area of the map that is currently shown within the canvas. The map's extent will change as the user zooms in and out, and pans across the map. The current map extent can be obtained by calling the extent() method. A current map tool that controls the user's interaction with the contents of the map canvas. The current map tool can be set using the setMapTool() method, and you can retrieve the current map tool (if any) by calling the mapTool() method. A background color used to draw the background behind all the map layers. You can change the map's background color by calling the canvasColor() method. A coordinate transform that converts from map coordinates (that is, coordinates in the data source's coordinate reference system) to pixels within the window. You can retrieve the current coordinate transform by calling the getCoordinateTransform() method. The QgsMapCanvasItem class A map canvas item is an item drawn on top of the map canvas. The map canvas item will appear in front of the map layers. While you can create your own subclass of QgsMapCanvasItem if you want to draw custom items on top of the map canvas, it would be more useful for you to make use of an existing subclass that will do the work for you. There are currently three subclasses of QgsMapCanvasItem that you might find useful: QgsVertexMarker: This draws an icon (an "X", a "+", or a small box) centered around a given point on the map. QgsRubberBand: This draws an arbitrary polygon or polyline onto the map. It is intended to provide visual feedback as the user draws a polygon onto the map. QgsAnnotationItem: This is used to display additional information about a feature, in the form of a balloon that is connected to the feature. The QgsAnnotationItem class has various subclasses that allow you to customize the way the information is displayed. The QgsMapTool class A map tool allows the user to interact with and manipulate the map canvas, capturing mouse events and responding appropriately. A number of QgsMapTool subclasses provide standard map interaction behavior such as clicking to zoom in, dragging to pan the map, and clicking on a feature to identify it. You can also create your own custom map tools by subclassing QgsMapTool and implementing the various methods that respond to user-interface events such as pressing down the mouse button, dragging the canvas, and so on. Once you have created a map tool, you can allow the user to activate it by associating the map tool with a toolbar button. Alternatively, you can activate it from within your Python code by calling the mapCanvas.setMapTool(...) method. We will look at the process of creating your own custom map tool in the section Using the PyQGIS library in the following table: Other useful qgis.gui classes While the qgis.gui package defines a large number of classes, the ones you are most likely to find useful are given in the following table: Class Description QgsLegendInterface This provides access to the map legend, that is, the list of map layers within the current project. Note that map layers can be grouped, hidden, and shown within the map legend. QgsMapTip This displays a tip on a map canvas when the user holds the mouse over a feature. The map tip will show the display field for the feature; you can set this by calling layer.setDisplayField("FIELD"). QgsColorDialog This is a dialog box that allows the user to select a color. QgsDialog This is a generic dialog with a vertical box layout and a button box, making it easy to add content and standard buttons to your dialog. QgsMessageBar This is a user interface widget for displaying non-blocking messages to the user. We looked at the message bar class in the previous article. QgsMessageViewer This is a generic class that displays long messages to the user within a modal dialog. QgsBlendModeComboBox QgsBrushStyleComboBox QgsColorRampComboBox QgsPenCapStyleComboBox QgsPenJoinStyleComboBox QgsScaleComboBox These QComboBox user-interface widgets allow you to prompt the user for various drawing options. With the exception of the QgsScaleComboBox, which lets the user choose a map scale, all the other QComboBox subclasses let the user choose various Qt drawing options.  Using the PyQGIS library In the previous section, we looked at a number of classes provided by the PyQGIS library. Let's make use of these classes to perform some real-world geospatial development tasks. Analyzing raster data We're going to start by writing a program to load in some raster-format data and analyze its contents. To make this more interesting, we'll use a Digital Elevation Model (DEM) file, which is a raster format data file that contains elevation data. The Global Land One-Kilometer Base Elevation Project (GLOBE) provides free DEM data for the world, where each pixel represents one square kilometer of the Earth's surface. GLOBE data can be downloaded from http://www.ngdc.noaa.gov/mgg/topo/gltiles.html. Download the E tile, which includes the western half of the USA. The resulting file, which is named e10g, contains the height information you need. You'll also need to download the e10g.hdr header file so that QGIS can read the file—you can download this from http://www.ngdc.noaa.gov/mgg/topo/elev/esri/hdr. Once you've downloaded these two files, put them together into a convenient directory. You can now load the DEM data into QGIS using the following code: registry = QgsProviderRegistry.instance()provider = registry.provider("gdal", "/path/to/e10g") Unfortunately, there is a slight complexity here. Since QGIS doesn't know which coordinate reference system is used for the data, it displays a dialog box that asks you to choose the CRS. Since the GLOBE DEM data is in the WGS84 CRS, which QGIS uses by default, this dialog box is redundant. To disable it, you need to add the following to the top of your program: from PyQt4.QtCore import QSettingsQSettings().setValue("/Projections/defaultBehaviour", "useGlobal") Now that we've loaded our raster DEM data into QGIS, we can analyze. There are lots of things we can do with DEM data, so let's calculate how often each unique elevation value occurs within the data. Notice that we're loading the DEM data directly using QgsRasterDataProvider. We don't want to display this information on a map, so we don't want (or need) to load it into QgsRasterLayer.  Since the DEM data is in a raster format, you need to iterate over the individual pixels or cells to get each height value. The provider.xSize() and provider.ySize() methods tell us how many cells are in the DEM, while the provider.extent() method gives us the area of the Earth's surface covered by the DEM. Using this information, we can extract the individual elevation values from the contents of the DEM in the following way: raster_extent = provider.extent()raster_width = provider.xSize()raster_height = provider.ySize()block = provider.block(1, raster_extent, raster_width, raster_height) The returned block variable is an object of type QgsRasterBlock, which is essentially a two-dimensional array of values. Let's iterate over the raster and extract the individual elevation values: for x in range(raster_width):  for y in range(raster_height):    elevation = block.value(x, y)    .... Now that we've loaded the individual elevation values, it's easy to build a histogram out of those values. Here is the entire program to load the DEM data into memory, and calculate, and display the histogram: from PyQt4.QtCore import QSettingsQSettings().setValue("/Projections/defaultBehaviour", "useGlobal")registry = QgsProviderRegistry.instance()provider = registry.provider("gdal", "/path/to/e10g") raster_extent = provider.extent()raster_width = provider.xSize()raster_height = provider.ySize()no_data_value = provider.srcNoDataValue(1) histogram = {} # Maps elevation to number of occurrences. block = provider.block(1, raster_extent, raster_width,            raster_height)if block.isValid():  for x in range(raster_width):    for y in range(raster_height):      elevation = block.value(x, y)      if elevation != no_data_value:        try:          histogram[elevation] += 1        except KeyError:          histogram[elevation] = 1 for height in sorted(histogram.keys()):  print height, histogram[height] Note that we've added a no data value check to the code. Raster data often includes pixels that have no value associated with them. In the case of a DEM, elevation data is only provided for areas of land; pixels over the sea have no elevation, and we have to exclude them, or our histogram will be inaccurate. Manipulating vector data and saving it to a shapefile Let's create a program that takes two vector data sources, subtracts one set of vectors from the other, and saves the resulting geometries into a new shapefile. Along the way, we'll learn a few important things about the PyQGIS library. We'll be making use of the QgsGeometry.difference() function. This function performs a geometrical subtraction of one geometry from another, similar to this:  Let's start by asking the user to select the first shapefile and open up a vector data provider for that file: filename_1 = QFileDialog.getOpenFileName(iface.mainWindow(),                     "First Shapefile",                     "~", "*.shp")if not filename_1:  return registry = QgsProviderRegistry.instance()provider_1 = registry.provider("ogr", filename_1) We can then read the geometries from that file into memory: geometries_1 = []for feature in provider_1.getFeatures(QgsFeatureRequest()):  geometries_1.append(QgsGeometry(feature.geometry())) This last line of code does something very important that may not be obvious at first. Notice that we use the following: QgsGeometry(feature.geometry()) We use the preceding line instead of the following: feature.geometry() This creates a new instance of the QgsGeometry object, copying the geometry into a new object, rather than just adding the existing geometry object to the list. We have to do this because of a limitation of the way the QGIS Python wrappers work: the feature.geometry() method returns a reference to the geometry, but the C++ code doesn't know that you are storing this reference away in your Python code. So, when the feature is no longer needed, the memory used by the feature's geometry is also released. If you then try to access that geometry later on, the entire QGIS system will crash. To get around this, we make a copy of the geometry so that we can refer to it even after the feature's memory has been released. Now that we've loaded our first set of geometries into memory, let's do the same for the second shapefile: filename_2 = QFileDialog.getOpenFileName(iface.mainWindow(),                     "Second Shapefile",                     "~", "*.shp")if not filename_2:  return provider_2 = registry.provider("ogr", filename_2) geometries_2 = []for feature in provider_2.getFeatures(QgsFeatureRequest()):  geometries_2.append(QgsGeometry(feature.geometry())) With the two sets of geometries loaded into memory, we're ready to start subtracting one from the other. However, to make this process more efficient, we will combine the geometries from the second shapefile into one large geometry, which we can then subtract all at once, rather than subtracting one at a time. This will make the subtraction process much faster: combined_geometry = Nonefor geometry in geometries_2:  if combined_geometry == None:    combined_geometry = geometry  else:    combined_geometry = combined_geometry.combine(geometry) We can now calculate the new set of geometries by subtracting one from the other: dst_geometries = []for geometry in geometries_1:  dst_geometry = geometry.difference(combined_geometry)  if not dst_geometry.isGeosValid(): continue  if dst_geometry.isGeosEmpty(): continue  dst_geometries.append(dst_geometry) Notice that we check to ensure that the destination geometry is mathematically valid and isn't empty. Invalid geometries are a common problem when manipulating complex shapes. There are options for fixing them, such as splitting apart multi-geometries and performing a buffer operation.  Our last task is to save the resulting geometries into a new shapefile. We'll first ask the user for the name of the destination shapefile: dst_filename = QFileDialog.getSaveFileName(iface.mainWindow(),                      "Save results to:",                      "~", "*.shp")if not dst_filename:  return We'll make use of a vector file writer to save the geometries into a shapefile. Let's start by initializing the file writer object: fields = QgsFields()writer = QgsVectorFileWriter(dst_filename, "ASCII", fields,               dst_geometries[0].wkbType(),               None, "ESRI Shapefile")if writer.hasError() != QgsVectorFileWriter.NoError:  print "Error!"  return We don't have any attributes in our shapefile, so the fields list is empty. Now that the writer has been set up, we can save the geometries into the file: for geometry in dst_geometries:  feature = QgsFeature()  feature.setGeometry(geometry)  writer.addFeature(feature) Now that all the data has been written to the disk, let's display a message box that informs the user that we've finished: QMessageBox.information(iface.mainWindow(), "",            "Subtracted features saved to disk.") As you can see, creating a new shapefile is very straightforward in PyQGIS, and it's easy to manipulate geometries using Python—just so long as you copy QgsGeometry you want to keep around. If your Python code starts to crash while manipulating geometries, this is probably the first thing you should look for. Using different symbols for different features within a map Let's use World Borders Dataset that you downloaded in the previous article to draw a world map, using different symbols for different continents. This is a good example of using a categorized symbol renderer, though we'll combine it into a script that loads the shapefile into a map layer, and sets up the symbols and map renderer to display the map exactly as you want it. We'll then save this map as an image. Let's start by creating a map layer to display the contents of the World Borders Dataset shapefile: layer = iface.addVectorLayer("/path/to/TM_WORLD_BORDERS-0.3.shp",               "continents", "ogr") Each unique region code in the World Borders Dataset shapefile corresponds to a continent. We want to define the name and color to use for each of these regions, and use this information to set up the various categories to use when displaying the map: from PyQt4.QtGui import QColorcategories = []for value,color,label in [(0,   "#660000", "Antarctica"),                          (2,   "#006600", "Africa"),                          (9,   "#000066", "Oceania"),                          (19,  "#660066", "The Americas"),                          (142, "#666600", "Asia"),                          (150, "#006666", "Europe")]:  symbol = QgsSymbolV2.defaultSymbol(layer.geometryType())  symbol.setColor(QColor(color))  categories.append(QgsRendererCategoryV2(value, symbol, label)) With these categories set up, we simply update the map layer to use a categorized renderer based on the value of the region attribute, and then redraw the map: layer.setRendererV2(QgsCategorizedSymbolRendererV2("region",                          categories))layer.triggerRepaint() There's only one more thing to do; since this is a script that can be run multiple times, let's have our script automatically remove the existing continents layer, if it exists, before adding a new one. To do this, we can add the following to the start of our script: layer_registry = QgsMapLayerRegistry.instance() for layer in layer_registry.mapLayersByName("continents"):   layer_registry.removeMapLayer(layer.id()) When our script is running, it will create one (and only one) layer that shows the various continents in different colors. These will appear as different shades of gray in the printed article, but the colors will be visible on the computer screen: Now, let's use the same data set to color each country based on its relative population. We'll start by removing the existing population layer, if it exists: layer_registry = QgsMapLayerRegistry.instance()for layer in layer_registry.mapLayersByName("population"):  layer_registry.removeMapLayer(layer.id()) Next, we open the World Borders Dataset into a new layer called "population": layer = iface.addVectorLayer("/path/to/TM_WORLD_BORDERS-0.3.shp",               "population", "ogr") We then need to set up our various population ranges: from PyQt4.QtGui import QColorranges = []for min_pop,max_pop,color in [(0,        99999,     "#332828"),                              (100000,   999999,    "#4c3535"),                              (1000000,  4999999,   "#663d3d"),                              (5000000,  9999999,   "#804040"),                              (10000000, 19999999,  "#993d3d"),                              (20000000, 49999999,  "#b33535"),                              (50000000, 999999999, "#cc2828")]:  symbol = QgsSymbolV2.defaultSymbol(layer.geometryType())  symbol.setColor(QColor(color))  ranges.append(QgsRendererRangeV2(min_pop, max_pop,                   symbol, "")) Now that we have our population ranges and their associated colors, we simply set up a graduated symbol renderer to choose a symbol based on the value of the pop2005 attribute, and tell the map to redraw itself: layer.setRendererV2(QgsGraduatedSymbolRendererV2("pop2005",                         ranges))layer.triggerRepaint() The result will be a map layer that shades each country according to its population:  Calculating the distance between two user-defined points  In our final example of using the PyQGIS library, we'll write some code that, when run, starts listening for mouse events from the user. If the user clicks on a point, drags the mouse, and then releases the mouse button again, we will display the distance between those two points. This is a good example of how to add your  own map interaction logic to QGIS, using the QgsMapTool class. This is the basic structure for our QgsMapTool subclass: class DistanceCalculator(QgsMapTool):  def __init__(self, iface):    QgsMapTool.__init__(self, iface.mapCanvas())    self.iface = iface   def canvasPressEvent(self, event):    ...   def canvasReleaseEvent(self, event):    ... To make this map tool active, we'll create a new instance of it and pass it to the mapCanvas.setMapTool() method. Once this is done, our canvasPressEvent() and canvasReleaseEvent() methods will be called whenever the user clicks or releases the mouse button over the map canvas. Let's start with the code that handles the user clicking on the canvas. In this method, we're going to convert from the pixel coordinates that the user clicked on to the map coordinates (that is, a latitude and longitude value). We'll then remember these coordinates so that we can refer to them later. Here is the necessary code: def canvasPressEvent(self, event):  transform = self.iface.mapCanvas().getCoordinateTransform()  self._startPt = transform.toMapCoordinates(event.pos().x(),                        event.pos().y()) When the canvasReleaseEvent() method is called, we'll want to do the same with the point at which the user released the mouse button: def canvasReleaseEvent(self, event):  transform = self.iface.mapCanvas().getCoordinateTransform()  endPt = transform.toMapCoordinates(event.pos().x(),                    event.pos().y()) Now that we have the two desired coordinates, we'll want to calculate the distance between them. We can do this using a QgsDistanceArea object:      crs = self.iface.mapCanvas().mapRenderer().destinationCrs()  distance_calc = QgsDistanceArea()  distance_calc.setSourceCrs(crs)  distance_calc.setEllipsoid(crs.ellipsoidAcronym())  distance_calc.setEllipsoidalMode(crs.geographicFlag())  distance = distance_calc.measureLine([self._startPt,                     endPt]) / 1000 Notice that we divide the resulting value by 1000. This is because the QgsDistanceArea object returns the distance in meters, and we want to display the distance in kilometers. Finally, we'll display the calculated distance in the QGIS message bar:   messageBar = self.iface.messageBar()  messageBar.pushMessage("Distance = %d km" % distance,              level=QgsMessageBar.INFO,              duration=2) Now that we've created our map tool, we need to activate it. We can do this by adding the following to the end of our script: calculator = DistanceCalculator(iface)iface.mapCanvas().setMapTool(calculator) With the map tool activated, the user can click and drag on the map. When the mouse button is released, the distance (in kilometers) between the two points will be displayed in the message bar: Summary In this article, we took an in-depth look at the PyQGIS libraries and how you can use them in your own programs. We learned that the QGIS Python libraries are implemented as wrappers around the QGIS APIs implemented in C++. We saw how Python programmers can understand and work with the QGIS reference documentation, even though it is written for C++ developers. We also looked at the way the PyQGIS libraries are organized into different packages, and learned about the most important classes defined in the qgis.core and qgis.gui packages. We then saw how a coordinate reference systems (CRS) is used to translate from points on the three-dimensional surface of the Earth to coordinates within a two-dimensional map plane. We learned that vector format data is made up of features, where each feature has an ID, a geometry, and a set of attributes, and that symbols are used to draw vector geometries onto a map layer, while renderers are used to choose which symbol to use for a given feature. We learned how a spatial index can be used to speed up access to vector features. Next, we saw how raster format data is organized into bands that represent information such as color, elevation, and so on, and looked at the various ways in which a raster data source can be displayed within a map layer. Along the way, we learned how to access the contents of a raster data source. Finally, we looked at various techniques for performing useful tasks using the PyQGIS library. In the next article, we will learn more about QGIS Python plugins, and then go on to use the plugin architecture as a way of implementing a useful feature within a mapping application. Resources for Article:   Further resources on this subject: QGIS Feature Selection Tools [article] Server Logs [article]
Read more
  • 0
  • 0
  • 14536

article-image-building-middle-tier
Packt
23 Dec 2014
34 min read
Save for later

Building the Middle-Tier

Packt
23 Dec 2014
34 min read
In this article by Kerri Shotts , the author of the book PhoneGap for Enterprise covered how to build a web server that bridges the gap between our database backend and our mobile application. If you browse any Cordova/PhoneGap forum, you'll often come across posts asking how to connect to and query a backend database. In this article, we will look at the reasons why it is necessary to interact with your backend database using an intermediary service. If the business logic resides within the database, the middle-tier might be a very simple layer wrapping the data store, but it can also implement a significant portion of business logic as well. The middle-tier also usually handles session authentication logic. Although many enterprise projects will already have a middle-tier in place, it's useful to understand how a middle-tier works, and how to implement one if you ever need to build a solution from the ground up. In this article, we'll focus heavily on these topics: Typical middle-tier architecture Designing a RESTful-like API Implementing a RESTful-like hypermedia API using Node.js Connecting to the backend database Executing queries Handling authentication using Passport Building API handlers You are welcome to implement your middle-tier using any technology with which you are comfortable. The topics that we will cover in this article can be applied to any middle-tier platform. Middle-tier architecture It's tempting, especially for simple applications, to have the desire to connect your mobile app directly to your data store. This is an incredibly bad idea, which means your data store is vulnerable and exposed to attacks from the outside world (unless you require the user to log in to a VPN). It also means that your mobile app has a lot of code dedicated solely to querying your data store, which makes for a tightly coupled environment. If you ever want to change your database platform or modify the table structures, you will need to update the app, and any app that wasn't updated will stop working. Furthermore, if you want another system to access the data, for example, a reporting solution, you will need to repeat the same queries and logic already implemented in your app in order to ensure consistency. For these reasons alone, it's a bad idea to directly connect your mobile app to your backend database. However, there's one more good reason: Cordova has no nonlocal database drivers whatsoever. Although it's not unusual for a desktop application to make a direct connection to your database on an internal network, Cordova has no facility to load a database driver to interface directly with an Oracle or MySQL database. This means that you must build an intermediary service to bridge the gap from your database backend to your mobile app. No middle-tier is exactly the same, but for web and mobile apps, this intermediary service—also called an application server—is typically a relatively simple web server. This server accepts incoming requests from a client (our mobile app or a website), processes them, and returns the appropriate results. In order to do so, the web server parses these requests using a variety of middleware (security, session handling, cookie handling, request parsing, and so on) and then executes the appropriate request handler for the request. This handler then needs to pass this request on to the business logic handler, which, in our case, lives on the database server. The business logic will determine how to react to the request and returns the appropriate data to the request handler. The request handler transforms this data into something usable by the client, for example, JSON or XML, and returns it to the client. The middle-tier provides an Application Programming Interface (API). Beyond authentication and session handling, the middle-tier provides a set of reusable components that perform specific tasks by delegating these tasks to lower tiers. As an example, one of the components of our Tasker app is named get-task-comments. Provided the user is properly authenticated, the component will request a specific task from the business logic and return the attached comments. Our mobile app (or any other consumer) only needs to know how to call get-task-comments. This decouples the client from the database and ensures that we aren't unnecessarily repeating code. The flow of request and response looks a lot like the following figure: Designing a RESTful-like API A mobile app interfaces with your business logic and data store via an API provided by the application server middle-tier. Exactly how this API is implemented and how the client uses it is up to the developers of the system. In the past, this has often meant using web services (over HTTP) with information interchange via Simple Object Access Protocol (SOAP). Recently, RESTful APIs have become the norm when working with web and mobile applications. These APIs conform to the following constraints: Client/Server: Clients are not concerned with how data is stored, (that's the server's job), and servers are not concerned with state (that's the client's job). They should be able to be developed and/or replaced completely independently of each other (low coupling) as long as the API remains the same. Stateless: Each request should have the necessary information contained within it so that the server can properly handle the request. The server isn't concerned about session states; this is the sole domain of the client. Cacheable: Responses must specify if they can be cached or not. Proper management of this can greatly improve performance and scalability. Layered: The client shouldn't be able to tell if there are any intermediary servers between it and the server. This ensures that additional servers can be inserted into the chain to provide caching, security, load balancing, and so on. Code-on-demand: This is an optional constraint. The server can send the necessary code to handle the response to the client. For a mobile PhoneGap app, this might involve sending a small snippet of JavaScript, for example, to handle how to display and interact with a Facebook post. Uniform Interface: Resources are identified by a Uniform Resource Identifier (URI), for example, https://pge-as.example.com/task/21 refers to the task with an identifier of 21. These resources can be expressed in any number of formats to facilitate data interchange. Furthermore, when the client has the resource (in whatever representation it is provided), the client should also have enough information to manipulate the resource. Finally, the representation should indicate valid state transitions by providing links that the client can use to navigate the state tree of the system. There are many good web APIs in production, but often they fail to address the last constraint very well. They might represent resources using URIs, but typically the client is expected to know all the endpoints of the API and how to transition between them without the server telling the client how to do so. This means that the client is tightly coupled to the API. If the URIs or the API change, then the client breaks. RESTful APIs should instead provide all the valid state transitions with each response. This lets the client reduce its coupling by looking for specific actions rather than assuming that a specific URI request will work. Properly implemented, the underlying URIs could change and the client app would be unaffected. The only thing that needs to be constant is the entry URI to the API. There are many good examples of these kinds of APIs, PayPal's is quite good as are many others. The responses from these APIs always contain enough information for the client to advance to the next state in the chain. So in the case of PayPal, a response will always contain enough information to advance to the next step of the monetary transaction. Because the response contains this information, the client only needs to look at the response rather than having the URI of the next step hardcoded. RESTful APIs aren't standardized; one API might provide links to the next state in one format, while another API might use a different format. That said, there are several attempts to create a standard response format, Collection+JSON is just one example. The lack of standardization in the response format isn't as bad as it sounds; the more important issue is that as long as your app understands the response format, it can be decoupled from the URI structure of your API and its resources. The API becomes a list of methods with explicit transitions rather than a list of URIs alone. As long as the action names remain the same, the underlying URIs can be changed without affecting the client. This works well when it comes to most APIs where authorization is provided using an API key or an encoded token. For example, an API will often require authorization via OAuth 2.0. Your code asks for the proper authorization first, and upon each subsequent request, it passes an appropriate token that enables access to the requested resource. Where things become problematic, and why we're calling our API RESTful-like, is when it comes to the end user authentication. Whether the user of our mobile app recognizes it or not, they are an immediate consumer of our API. Because the data itself is protected based upon the roles and access of each particular user, users must authenticate themselves prior to accessing any data. When an end user is involved with authentication, the idea of sessions is inevitably required largely for the end user's convenience. Some sessions can be incredibly short-lived, for example, many banks will terminate a session if no activity is seen for 10 minutes, while others can be long-lived, and others might even be effectively eternal until explicitly revoked by the user. Regardless of the session length, the fact that a session is present indicates that the server must often store some information about state. Even if this information applies only to the user's authentication and session validity, it still violates the second rule of RESTful APIs. Tasker's web API, then, is a RESTful-like API. In everything except session handling and authentication, our API is like any other RESTful API. However, when it comes to authentication, the server maintains some state in order to ensure that users are properly authenticated. In the case of Tasker, the maintained state is limited. Once a user authenticates, a unique single-use token and an Hash Message Authentication Code (HMAC) secret are generated and returned to the client. This token is expected to be sent with the next API request and this request is expected to be signed with the HMAC secret. Upon completion of this API request, a new token is generated. Each token expires after a specified amount of time, or can be expired immediately by an explicit logout. Each token is stored in the backend, which means we violate the stateless rule. Our tokens are just a cryptographically random series of bytes, and because of this, there's nothing in the token that can be used to identify the user. This means we need to maintain the valid tokens and their user associations in the database. If the token contained user-identifiable information, we could technically avoid maintaining state, but this also means that the token could be forged if the attacker knew how tokens were constructed. A random token, on the other hand, means that there's no method of construction that can fool the server; the attacker will have to be very lucky to guess it right. Since Tasker's tokens are continually expiring after a short period of time and are continually regenerated upon each request, guessing a token is that much more difficult. Of course, it's not impossible for an attacker to get lucky and guess the right token on the first try, but considering the amount of entropy in most usernames and passwords, it's more likely that the attacker could guess the user's password than they could guess the correct token. Because these tokens are managed by the backend, our Tasker's API isn't truly stateless, and so it's not truly RESTful, hence the term RESTful-like. If you want to implement your API as a pure RESTful API, feel free. If your API is like that of many other APIs (such as Twitter, PayPal, Facebook, and so on), you'll probably want to do so. All this sounds well and good, but how should we go about designing and defining our API? Here's how I suggest going about it: Identify the resources. In Tasker, the resources are people, tasks, and task comments. Essentially, these are the data models. (If you take security into account, Tasker also has user and role resources in addition to sessions.) Define how the URI should represent the resource. For example, Bob Smith might be represented by /person/bob-smith or /person/29481. Query parameters are also acceptable: /person?administeredBy=john-doe will refer to the set of all individuals who have John Doe as their administrator. If this helps, think of each instance of a resource and each collection of these resources as web pages each having their own URL. Identify the actions that can be performed for each resource. For example, a task can be created and modified by the owner of the task. This task can be assigned to another user. A task's status and progress can be updated by both the owner and the assignee. With RESTful APIs, these actions are typically handled by using the HTTP verbs (also known as methods) GET, POST, PUT, and DELETE. Others can also be used, such as OPTIONS, PATCH, and so on. We'll cover in a moment how these usually line up against typical Create, Read, Update, Delete (CRUD) operations. Identify the state transitions that are valid for resources. As an example, a client's first steps might be to request a list of all tasks assigned to a particular user. As part of the response, it should be given URIs that indicate how the app should retrieve information about a particular task. Furthermore, within this single task's response, there should be information that tells the client how to modify the task. Most APIs generally mirror the typical CRUD operations. The following is how the HTTP verbs line up against the familiar CRUD counterparts for a collection of items: HTTP verb CRUD operation Description GET READ This returns the collection of items in the desired format. Often can be filtered and sorted via query parameters. POST CREATE This creates an item within the collection. The return result includes the URI for the new resource. DELETE N/A This is not typically used at the collection level, unless one wants to remove the entire collection. PUT N/A This is not typically used at the collection level, though it can be used to update/replace each item in the collection.  The same verbs are used for items within a collection: HTTP verb CRUD operation Description GET READ This returns a specific item, given the ID. POST N/A This is not typically used at the item level. DELETE DELETE This deletes a specific item, given the ID. PUT UPDATE This updates an existing item. Sometimes PATCH is used to update only specific properties of the item.  Here's an example of a state transition diagram for a portion of the Tasker API along with the corresponding HTTP verbs: Now that we've determined the states and the valid transitions, we're ready to start modeling the API and the responses it should generate. This is particularly useful before you start coding, as one will often notice issues with the API during this phase, and it's far easier to fix them now rather than after a lot of code has been written (or worse, after the API is in production). How you model your API is up to you. If you want to create a simple text document that describes the various requests and expected responses, that's fine. You can also use any number of tools that aid in modeling your API. Some even allow you to provide mock responses for testing. Some of these are identified as follows: RAML (http://raml.org): This is a markup language to model RESTful-like APIs. You can build API models using any text editor, but there is also an API designer online. Apiary (http://apiary.io): Apiary uses a markdown-like language (API blueprint) to model APIs. If you're familiar with markdown, you shouldn't have much trouble using this service. API mocking and automated testing are also provided. Swagger (http://swagger.io): This is similar to RAML, where it uses YAML as the modeling language. Documentation and client code can be generated directly from the API model. Building our API using Node.js In this section, we'll cover connecting our web service to our Oracle database, handling user authentication and session management using Passport, and defining handlers for state transitions. You'll definitely want to take a look at the /tasker-srv directory in the code package for this book, which contains the full web server for Tasker. In the following sections, we've only highlighted some snippets of the code. Connecting to the backend database Node.js's community has provided a large number of database drivers, so chances are good that whatever your backend, Node.js has a driver available for it. In our example app, we're using an Oracle database as the backend, which means we'll be using the oracle driver (https://www.npmjs.org/package/oracle). Connecting to the database is actually pretty easy, the following code shows how: var oracle = require("oracle"); oracle.connect ( { hostname: "localhost", port: 1521, database: "xe", user: "tasker", password: "password" }, function (err, client) { if (err) { /* error; return or next(err) */ } /* query the database; when done call client.close() */ }); In the real world, a development version of our server will be using a test database, and a production version of our server will use the production database. To facilitate this, we made the connection information configurable. The /config/development.json and /config/production.json files contain connection information, and the main code simply requests the configuration information when making a connection, the following code line is used to get the configuration information: oracle.connect ( config.get ( "oracle" ), … ); Since we're talking about the real world, we also need to recognize that database connections are slow and they need to be pooled in order to improve performance as well as permit parallel execution. To do this, we added the generic-pool NPM module (https://www.npmjs.org/package/generic-pool) and added the following code to app.js: var clientPool = pool.Pool( { name: "oracle", create: function ( cb ) {    return new oracle.connect( config.get("oracle"),      function ( err, client ) {        cb ( err, client );      }    ) }, destroy: function ( client ) {    try {      client.close();    } catch (err) {      // do nothing, but if we don't catch the error,      // the server crashes    } }, max: 5, min: 1, idleTimeoutMillis: 30000 }); Because our pool will always contain at least one connection, we need to ensure that when the process exits, the pool is properly drained, as follows: process.on("exit", function () { clientPool.drain( function () {    clientPool.destroyAllNow(); }); }); On its own, this doesn't do much yet. We need to ensure that the pool is available to the entire app: app.set ( "client-pool", clientPool ); Executing queries We've built our business logic in the Oracle database using PL/SQL stored procedures and functions. In PL/SQL, functions can return table-like structures. While this is similar in concept to a view, writing a function using PL/SQL provides us more flexibility. As such, our queries won't actually be talking to the base tables, they'll be talking to functions that return results based on the user's authorization. This means that we don't need additional conditions in a WHERE clause to filter based on the user's authorization, which helps eliminate code duplication. Regardless of the previous statement, executing queries and stored procedures is done using the same method, that is execute. Before we can execute anything, we need to first acquire a client connection from the pool. To this end, we added a small set of database utility methods; you can see the code in the /db-utils directory. The query utility method is shown in the following code snippet: DBUtils.prototype.query = function ( sql, bindParameters, cb ) { var self = this,    clientPool = self._clientPool,    deferred = Q.defer();    clientPool.acquire( function ( err, client ) {    if ( err ) {    winston.error("Failed to acquire connection.");      if ( cb ) {        cb( new Error( err ) );      else {        deferred.reject( err );      }    }    try {      client.execute( sql, bindParameters,        function ( err, results ) {          if ( err ) {            clientPool.release( client );            if ( cb ) {            cb( new Error( err ) );            } else {              deferred.reject( err );            }           }          clientPool.release( client );            if ( cb ) {            cb( err, results );          } else {            deferred.resolve( results );          }        } );      }      catch ( err2 ) {      try {        clientPool.release( client );      }      catch ( err3 ) {        // can't do anything...      }      if ( cb ) {        cb( err2 );      } else { deferred.reject( err2 );      }    } } ); if ( !cb ) {    return deferred.promise; } }; It's then possible to retrieve the results to an arbitrary query using the preceding method, as shown in the following code snippet: dbUtil.query( "SELECT * FROM " + "table(tasker.task_mgmt.get_task(:1,:2))", [ taskId, req.user.userId ] ) .then( function ( results ) { // if no results, return 404 not found if ( results.length === 0 ) {    return next( Errors.HTTP_NotFound() ); } // create a new task with the database results // (will be in first row) req.task = new Task( results[ 0 ] ); return next(); } ) .catch( function ( err ) { return next( new Error( err ) ); } ) .done(); The query used in the preceding code is an example of calling a stored function that returns a table structure. The results of the SELECT statement will depend on parameters (taskId and username), and get_task will decide what data can be returned based on the user's authorization. Using Passport to handle authentication and sessions Although we've implemented our own authentication protocol, it's better that we use one that has already been well vetted and is well understood as well as one that suits our particular needs. In our case, we needed the demo to stand on its own without a lot of additional services, and as such, we built our own protocol. Even so, we chose a well known cryptographic method (PBKDF2), and are using a large number of iterations and large key lengths. In order to implement authentication easily in Node.js, you'll probably want to use Passport (https://www.npmjs.org/package/passport). It has a large community, and supports a large number of authentication schemes. If at all possible, try to use third-party authentication systems as often as possible (for example, LDAP, AD, Kerberos, and so on). In our case, because our authentication method is custom, we chose to use the passport-req strategy (https://www.npmjs.org/package/passport-req). Since Tasker's authentication is token-based, we will use this to inspect a custom header that the client will use to pass us the authentication token. The following is a simplified diagram of how Tasker's authentication process works: Please don't use our authentication strategy for anything that requires high levels of security. It's just an example, and isn't guaranteed to be secure in any way. Before we can actually use Passport, we need to define how our authentication strategy actually works. We do this by calling passport.use in our app.js file: var passport = require("passport"); var ReqStrategy = require("passport-req").Strategy; var Session = require("./models/session"); passport.use ( new ReqStrategy ( function ( req, done ) {    var clientAuthToken = req.headers["x-auth-token"];    var session = new Session ( new DBUtils ( clientPool ) );    session.findSession( clientAuthToken )    .then( function ( results ) {    if ( !results ) { return done( null, false ); }    done( null, results );    } )    .catch( function ( err ) {    return done( err );    } )    .done(); } )); In the preceding code, we've given Passport a new authentication strategy. Now, whenever Passport needs to authenticate a request, it will call this small section of code. You might be wondering what's going on in findSession. Here's the code: Session.prototype.findSession = function ( clientAuthToken, cb ) { var self = this, deferred = Q.defer(); // if no token, no sense in continuing if ( typeof clientAuthToken === "undefined" ) {    if ( cb ) { return cb( null, false ); }    else { deferred.reject(); } } // an auth token is of the form 1234.ABCDEF10284128401ABC13... var clientAuthTokenParts = clientAuthToken.split( "." ); if ( !clientAuthTokenParts ) {    if ( cb ) { return cb( null, false ); }    else { deferred.reject(); } } // no auth token, no session. // get the parts var sessionId = clientAuthTokenParts[ 0 ], authToken = clientAuthTokenParts[ 1 ]; // ask the database via dbutils if the token is recognized self._dbUtils.execute( "CALL tasker.security.verify_token (:1, :2, :3, :4, :5 ) INTO :6", [ sessionId, authToken, // authorization token self._dbUtils.outVarchar2( { size: 32 } ), self._dbUtils.outVarchar2( { size: 4000 } ), self._dbUtils.outVarchar2( { size: 4000 } ), self._dbUtils.outVarchar2( { size: 1 } ) ] ) .then( function ( results ) {    // returnParam3 has a Y or N; Y is good auth    if ( results.returnParam3 === "Y" ) {      // notify callback of successful auth      var user = {        userId:   results.returnParam, sessionId: sessionId,        nextToken: results.returnParam1,        hmacSecret: results.returnParam2      };      if ( cb ) { cb( null, user ) }      else { deferred.resolve( user ); }    } else {      // auth failed      if ( cb ) { cb( null, false ); } else { deferred.reject(); }    } } ) .catch( function ( err ) {    if ( cb ) { return cb( err, false ); }    else { deferred.reject(); } } ) .done(); if ( !cb ) { return deferred.promise; } }; The dbUtils.execute() method is a wrapper method around the Oracle query method we covered in the Executing queries section. Once a session has been retrieved from the database, Passport will want to serialize the user. This is usually just the user's ID, but we serialize a little more (which, from the preceding code, is the user's ID, session ID, and the HMAC secret): passport.serializeUser(function( user, done ) { done (null, user); }); The serializeUser method is called after a successful authentication and it must be present, or an error will occur. There's also a deserializeUser method if you're using typical Passport sessions: this method is designed to restore the user information from the Passport session. Before any of this will work, we also need to tell Express to use the Passport middleware: app.use ( passport.initialize() ); Passport makes handling authentication simple, but it also provides session support as well. While we don't use it for Tasker, you can use it to support a typical session-based username/password authentication system quite easily with a single line of code: app.use ( passport.session() ); If you're intending to use sessions with Passport, make sure you also provide a deserializeUser method. Next, we need to implement the code to authenticate a user with their username and password. Remember, we initially require the user to log in using their username and password, and once authenticated, we handle all further requests using tokens. To do this, we need to write a portion of our API code. Building API handlers We won't cover the entire API in this section, but we will cover a couple of small pieces, especially as they pertain to authentication and retrieving data. First, we've codified our API in /tasker-srv/api-def in the code package for this book. You'll also want to take a look at /tasker-srv/api-utils to see how we parse out this data structure into useable routes for the Express router. Basically, we codify our API by building a simple structure: [ { "route": "/auth", "actions": [ … ] }, { "route": "/task", "actions": [ … ] }, { "route": "/task/{:taskId}", "params": [ … ],   "actions": [ … ] }, … ] Each route can have any number of actions and parameters. Parameters are equivalent to the Express Router's parameters. In the preceding example, {:taskId} is a parameter that will take on the value of whatever is in that particular location in the URI. For example, /task/21 will result in taskId with the value of 21. This is useful for our actions because each action can then assume that the parameters have already been parsed, so any actions on the /task/{:taskId} route will already have task information at hand. The parameters are defined as follows: { "name": "taskId", "type": "number", "description": "…", "returns": [ … ], "securedBy": "tasker-auth", "handler": function (req, res, next, taskId) {…} } Actions are defined as follows: { "title": "Task", "action": "get-task", "verb": "get", "description": { … }, // hypermedia description "returns": [ … ],     // http status codes that are returned "example": { … },     // example response "href": "/task/{taskId}", "template": true, "accepts": [ "application/json", … ], "sends": [ "application/json", … ], "securedBy": "tasker-auth", "hmac": "tasker-256", "store": { … }, "query-parameters": { … }, "handler": function ( req, res, next ) { … } } Each handler is called whenever that particular route is accessed by a client using the correct HTTP verbs (identified by verb in the prior code). This allows us to write a handler for each specific state transition in our API, which is nicer than having to write a large method that's responsible for the entire route. It also makes describing the API using hypermedia that much simpler, since we can require a portion of the API and call a simple utility method (/tasker-srv/api-utils/index.js) to generate the description for the client. Since we're still working on how to handle authentication, here's how the API definition for the POST /auth route looks (the complete version is located at /tasker-srv/api-def/auth/login.js): action = {    "title": "Authenticate User",    "action": "login",    "description": [ … ], "example":     { … },    "returns":     {      200: "User authenticated; see information in body.",      401: "Incorrect username or password.", …    },    "verb": "post", "href": "/auth",    "accepts": [ "application/json", … ],    "sends": [ "application/json", … ],    "csrf": "tasker-csrf",    "store": {      "body": [ { name: "session-id", key: "sessionId" },      { name: "hmac-secret", key: "hmacSecret" },      { name: "user-id", key: "userId" },      { name: "next-token", key: "nextToken" } ]    },    "template": {      "user-id": {        "title": "User Name", "key": "userId",        "type": "string", "required": true,        "maxLength": 32, "minLength": 1      },      "candidate-password": {        "title": "Password", "key": "candidatePassword",        "type": "string", "required": true,        "maxLength": 255, "minLength": 1      }    }, The earlier code is largely documentation (but it is returned to the client when they request this resource). The following code handler is what actually performs the authentication:    "handler": function ( req, res, next ) {      var session = new Session( new DBUtils(      req.app.get( "client-pool" ) ) ),        username,        password;      // does our input validate?      var validationResults =       objUtils.validate( req.body, action.template );      if ( !validationResults.validates ) {        return next(         Errors.HTTP_Bad_Request( validationResults.message ) );      }      // got here -- good; copy the values out      username = req.body.userId;      password = req.body.candidatePassword;      // create a session with the username and password      session.createSession( username, password )        .then( function ( results ) {          // no session? bad username or password          if ( !results ) {            return next( Errors.HTTP_Unauthorized() );          }        // return the session information to the client        var o = {          sessionId: results.sessionId,          hmacSecret: results.hmacSecret,          userId:   results.userId,          nextToken: results.nextToken,          _links:   {}, _embedded: {}       };        // generate hypermedia        apiUtils.generateHypermediaForAction(         action, o._links, security, "self" );          [ require( "../task/getTaskList" ),          require( "../task/getTask" ), …          require( "../auth/logout" )          ].forEach( function ( apiAction ) {            apiUtils.generateHypermediaForAction(            apiAction, o._links, security );          } );          resUtils.json( res, 200, o );        } )        .catch( function ( err ) {          return next( err );          } )        .done();      }    }; The session.createSession method looks very similar to session.findSession, as shown in the following code: Session.prototype.createSession = function ( userName, candidatePassword, cb ) { var self = this, deferred = Q.defer(); if ( typeof userName === "undefined" || typeof candidatePassword === "undefined" ) {    if ( cb ) { return cb( null, false ); }    else { deferred.reject(); } } // attempt to authenticate self._dbUtils.execute( "CALL tasker.security.authenticate_user( :1, :2, :3," + " :4, :5 ) INTO :6", [ userName, candidatePassword, self._dbUtils.outVarchar2( { size: 4000 }, self._dbUtils.outVarchar2( { size: 4000 } ), self._dbUtils.outVarchar2( { size: 4000 } ), self._dbUtils.outVarchar2( { size: 1 } ] ) .then( function ( results ) {    // ReturnParam3 has Y or N; Y is good auth    if ( results.returnParam3 === "Y" ) {      // notify callback of auth info      var user = {        userId:   userName,        sessionId: results.returnParam,        nextToken: results.returnParam1,        hmacSecret: results.returnParam2      };      if ( cb ) { cb( null, user ); }      else { deferred.resolve( user ); }    } else {      // auth failed      if ( cb ) { cb( null, false ); }      else { deferred.reject(); }    } } ) .catch( function ( err ) {    if ( cb ) { return cb( err, false ) }    else { deferred.reject(); } } ) .done(); if ( !cb ) { return deferred.promise; } }; Once the API is fully codified, we need to go back to app.js and tell Express that it should use the API's routes: app.use ( "/", apiUtils.createRouterForApi(apiDef, checkAuth)); We also add a global variable so that whenever an API section needs to return the entire API as a hypermedia structure, it can do so without traversing the entire API again: app.set( "x-api-root", apiUtils.generateHypermediaForApi( apiDef, securityDef ) ); The checkAuth method shown previously is pretty simple; all it does is ensure that we don't authenticate more than once in a single request: function checkAuth ( req, res, next ) { if (req.isAuthenticated()) {    return next(); } passport.authenticate ( "req" )(req, res, next); } You might be wondering where we're actually forcing our handlers to use authentication. There's actually a bit of magic in /tasker-srv/api-utils. I've highlighted the relevant portions: createRouterForApi:function (api, checkAuthFn) { var router = express.Router(); // process each route in the api; a route consists of the // uri (route) and a series of verbs (get, post, etc.) api.forEach ( function ( apiRoute ) {    // add params    if ( typeof apiRoute.params !== "undefined" ) {      apiRoute.params.forEach ( function ( param ) {        if (typeof param.securedBy !== "undefined" ) {          router.param( param.name, function ( req, res,          next, v) {            return checkAuthFn( req, res,            param.handler.bind(this, req, res, next, v) );          });        } else {          router.param(param.name, param.handler);        }      });    }    var uri = apiRoute.route;    // create a new route with the uri    var route = router.route ( uri );    // process through each action    apiRoute.actions.forEach ( function (action) {      // just in case we have more than one verb, split them out      var verbs = action.verb.split(",");      // and add the handler specified to the route      // (if it's a valid verb)      verbs.forEach ( function (verb) {        if (typeof route[verb] === "function") {          if (typeof action.securedBy !== "undefined") {            route[verb]( checkAuthFn, action.handler );          } else {            route[verb]( action.handler );          }        }      });    }); }); return router; }; Once you've finished writing even a few handlers, you should be able to verify that the system works by posting requests to your API. First, make sure your server has started ; we use the following code line to start the server: export NODE_ENV=development; npm start For some of the routes, you could just load up a browser and point it at your server. If you type https://localhost:4443/ in your browser, you should see a response that looks a lot like this: If you're thinking this looks styled, you're right. The Tasker API generates responses based on the client's requested format. The browser requests data in HTML, and so our API generates a styled HTML page as a response. For an app, the response is JSON because the app requests that the response be in JSON. If you want to see how this works, see /tasker-srv/res-utils/index.js. If you want to actually send and receive data, though, you'll want to get a REST client rather than using the browser. There are many good free clients: Firefox has a couple of good clients as does Chrome. Or you can find a native client for your operating system. Although you can do everything with curl on the command prompt, RESTful clients are much easier to use and often offer useful features, such as dynamic variables, various authentication methods built in, and many can act as simple automated testers. Summary In this article, we've covered how to build a web server that bridges the gap between our database backend and our mobile application. We've provided an overview of RESTful-like APIs, and we've also quickly shown how to implement such a web API using Node.js. We've also covered authentication and session handling using Passport. Resources for Article:  Further resources on this subject: Building Mobile Apps [article] Adding a Geolocation Trigger to the Salesforce Account Object [article] Introducing SproutCore [article]
Read more
  • 0
  • 0
  • 8221
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-importance-hyper-v-security
Packt
23 Dec 2014
19 min read
Save for later

The importance of Hyper-V Security

Packt
23 Dec 2014
19 min read
In this article, by Eric Siron and Andy Syrewicze, authors of the book, Hyper-V Security, we will be introduced to one of the most difficult tribulations in the entire realm of computing—security. Computers are tools, and just like any tool, they are designed to be used. Unfortunately, not every usage is proper, and not every computer should be accessed by just anyone. A computer really has no way to classify proper usage against improper usage, or differentiate between a valid user and an unauthorized user any more than a hammer would. The act of securing them is quite literally an endeavor to turn them against their purpose. Hyper-V adds new dimensions to the security problem. Virtual machines have protection options that mirror their physical counterparts, but present unique challenges. The hypervisor presents challenges of its own, both in its role as the host for those virtual machines and through the management operating system that manifests it. In this article, we'll cover: The importance of Hyper-V security Basic security concerns A starting point to security The terminology of Hyper-V Acquiring Hyper-V (For more resources related to this topic, see here.) For many, security seems like a blatantly obvious necessity. For others, the need isn't as clear. Many decision-makers don't believe that their organization's product requires in-depth protection. Many administrators believe that the default protections are sufficient. There are certainly some institutions whose needs don't require an elaborate regimen of protections, but no one can skip due diligence. Your clients expect it The exact definition of a "client" varies from organization to organization, but every organization type provides some sort of service to someone. Whether you are a retail outlet or a non-profit organization that provides intangible services to individuals in need that cannot pay for them, your institution has an implicit agreement to protect the information relevant to those who depend on you. They most likely won't have any idea what Hyper-V is or what you use it for, but they will know enough to be displeased if it is revealed that any of your computer systems are not secure. Your organization could be vulnerable to litigation if clients believe their data is not being treated with sufficient importance. Your stakeholders expect it As with clients, stakeholders can mean many things. Simplistically, it's anyone who has a "stake" in the well-being of your organization. This could be members of the board of directors who aren't privy to day-to-day operations. It could be external investors. It could even include the previously mentioned clients. Even if they have no way to understand what's necessary or unnecessary to secure, they expect that it's being handled. Furthermore, they may disagree with you on what data is important to protect. If it's later discovered that something wasn't fully guarded that they assumed was being treated as highly confidential, the response could have extremely negative consequences. Your employees and volunteers expect it Almost all organizations have digitized some vital information of its employees and volunteers. They expect that this data is held in the highest confidentiality and is well guarded against theft and espionage. Even if the rest of your institution's data requires no particular protection, personnel data must always be safeguarded. In many jurisdictions, this is a legal requirement. Even if you aren't under the rule of law, civil litigation is always a possibility. Experience has taught us that security is important In the past, it was believed that attackers came from outside the institution and were simply after quick and easy money sources, such as credit card numbers. However, reality has shown that breaches occur for a wide variety of reasons, and many aren't obvious until after it's too late to do anything about it. The next section, Basic Security Concerns, will highlight a number of both common and unexpected attack types. Weak points aren't always obvious You know that you need to protect access to sensitive backend data with frontend passwords. You know that information traveling between the two needs to be encrypted. However, are you aware of every single point that the data will travel through? Is the storage location unprotected? Has there been a recent audit of individuals with access? Is there another application on one of the component systems that allows for unencrypted communications or remote access? Treating any system as though it doesn't need to be secured could allow it to become a gateway for others. The costs of repair exceeds the costs of prevention The summary of this section's message is that failing to enact security measures is not an acceptable option. It's not unusual to find people who understand that security is important, but believe that it's simply too expensive and that the systems to be protected are just not worth the effort. In reality, the costs of a breach can be catastrophic. Just adding up the previous points can lead you to that conclusion. Between lawyer bills, court costs, and any awards, litigation costs can be unbearably high. Of course, a breach might directly result in a financial loss of some kind. Beyond that, a loss of trust inevitably follows the compromise of systems, and this can have a greater long-term impact than anything else. Even when all those problems are taken care of, it's still necessary to clean up any damage to the systems and close the exploited breach points. Basic security concerns With a topic as large as computer security, it's always tough to know where to start. The best place is generally to begin by getting an idea of where and what your largest risk factors are. Every organization will have its own specific areas of concern, but there are a number of common elements that everyone needs to worry about. Attack motivations To understand what risks you face, it helps to know the reasons for which you might find yourself under attack. For many malware generators, there isn't a lot of reason involved. They write destructive code because they like destruction; they might be working from a place of genuine malice or a simple disregard for the well-being of others. For many others, their work comes from a need for vengeance over a real or perceived slight. The trespasses they seek revenge for could be relatively petty things, but some attacks are carried out over much more serious events, even major political affairs. Some authors seek a degree of notoriety, perhaps not among the public at large as much as a small group or subculture. Financial motivation can be the source of both the most benign and the most dangerous security compromise. For instance, someone may want to prove eligibility for a job by showing that they possess the necessary skills to secure a system. One possible way is by demonstrating an ability to compromise that system. Such breaches generally require a deep understanding of the relevant technology, so they can effectively illustrate thorough knowledge. As long as these examples are never released "into the wild" and are instead disclosed to the system manufacturer so that a fix can be engineered, they are ultimately harmless. Unfortunately, a great many attackers seek a shorter-term gain through methods such as extortion from the manufacturer or owners of compromised systems or theft of sensitive data. Data theft is often thought of in terms of financial information, such as credit card data. However, intellectual property should also be kept heavily guarded. Data that seems relatively benign might also be a target; if an attacker discovers that your company uses a specific e-mail template and can also obtain a list of customer e-mail accounts; they have enough information to launch a very convincing phishing campaign. Untargeted attacks The untargeted attack is likely the most common of all attacks, and can be the most disruptive. These generally manifest as viruses and worms. In the earlier days of computing, the most common distribution methods were, surprisingly, media that had been created by software makers for distribution of applications. Someone would modify the image data during the duplication process and ship malware to customers. As the Internet rose in popularity, it introduced new ways for malware to make the rounds. First came e-mail. Next, websites became pick-up locations for all types of malicious software. New technologies that allowed for enhanced interactivity and the embedding of rich media, such as JavaScript and Adobe's (originally Shockwave's) Flash, were also used as vehicles for destructive software. Most of the early malware was simply destructive. It wreaked havoc on data, corrupted systems, and locked users out of their own hardware. Later, they became money-making avenues for the unscrupulous. An example is key loggers, which capture key presses and sometimes mouse movements and clicks in an attempt to compromise logins and other sensitive data, such as credit card numbers. Another much more recent introduction is ransomware, which encrypts or deletes information with a promise to restore the data on payment. Some of the most surreptitious untargeted attacks are relatively low-tech. One such attack is called phishing. This involves using some form of convincing technique, usually through e-mail, to lure users into volunteering sensitive information. An attack vector related to phishing is spam e-mail. Most people just consider spam to be annoying, untargeted e-mail advertisements, but results from an experiment conducted in 2008 by McAfee, Inc., called Spammed Persistently All Month (SPAM), would seem to indicate that most spam also qualifies as a scam in some form or another. Another untargeted attack vector is any connection that a computer system makes into a public network. In the modern era, this is generally through a system's entry point into the Internet. With a limited number of Internet-accessible IP addresses available, attackers can simply scan large ranges of them, seeking systems that respond. Using automated tools, they can attempt to break through any security barriers that are in place. Untargeted attacks pose few risks that are specific to Hyper-V, so this book won't spend a great deal of time on that topic. While no defense can be perfect, they are generally mitigated effectively through standard practices. Targeted attacks The most common attacks are untargeted, but targeted attacks can be the most dangerous. These come in a variety of forms but often use similar techniques to untargeted attacks. One example would be a phishing e-mail that appears to have been sent from your internal IT department, asking you to confirm your user name and password. Another would be a website that looks like an internal corporate site, such as a payroll page, which captures your login information instead of displaying your latest pay stub. Some targeted attacks work against an organization's exposed faces. An immediately recognizable example is online banking. Most banks provide some method for their customers to access their accounts online, and they almost invariably include powerful tools such as money transfer systems. Of course, theft isn't necessarily the goal of a target attack. One well-known activity is the denial-of-service attack, in which an immense number of bogus requests are sent to a target system in a short amount of time, causing its services to be unavailable to legitimate users. The computing device Most of the compromises you are likely to deal with occur at the level of the computing device. Some of the most complex software in use today is the operating system. With thousands of programmers working on millions of lines of code, much of it left over from previous versions and programmers, it's just an unavoidable fact that all major operating systems contain security flaws. With millions of people working to locate these holes, regardless of their intentions, it's equally inevitable that these faults will be discovered and they will be compromised. The advent and rising popularity of smartphones and tablets has increased the number of potential attack sources. As more and more devices become "smart," such as common environmental controls and food storage equipment, they too introduce new entry points from which an entire network can be compromised. The network The true risk of the single compromised device is the network that it's attached to. By breaching the network itself, an attacker potentially gains the ability to eavesdrop on all communications or launch a direct attack against specific computers or groups of systems. Since many organizations consider some areas to be secured since they are behind measures such as firewalls, breaching the protecting devices exposes everything that they are intended to protect. Data-processing points Raw data is rarely useful to end users. There are many systems in place whose jobs are to sort, process, retrieve, and organize information, and they often use well-known techniques to do this. Anything that's well-known is open to assault. Common examples are SQL database servers, e-mail systems, content management applications, and customer relationship management software. When these systems are broken into, the data they work with is ripe for the taking. Data storage A lot of effort is poured into securing end points, processing systems, and networks, but a disturbingly high amount of data storage locations are left relatively unprotected. Many administrators simply believe that all paths to the storage are well protected, so the storage location itself is of little concern. What this often means is that a breach farther up the line results in an easily compromised storage system. For best resistance against attack, care must be taken at all levels. People By and large, the most vulnerable aspect of any computer system is its users. This includes not just the users who don't understand technology, but also the administrators who have grown lax. Passwords are written down; convincing requests for sensitive information are erroneously granted; inappropriate shortcuts are taken. One of the easiest and most common ways in which computers are breached is social engineering. Before undertaking a lot of complicated steps to steal your information, an attacker may try to simply ask you for it. People are trusting by nature, and often naively believe that anyone who asks has a legitimate reason to do so. On the other side, malicious internal staff can be a serious threat. Disgruntled employees, especially those in the IT department, already have access to sensitive areas and information. If they have vengeance in mind, their goal may be disruption and destruction more than theft. A starting point to security Now that you have some idea of what you're up against, you can start thinking of how you want to approach the problems. The easiest thing to do is look over the preceding items and identify what your current configuration is weakest against. You'll also want to identify what your organization considers the most important points and data to protect. Once that's done, it's a good idea to perform some sort of an inventory in an attempt to discover sensitive points that may not have made the list for some reason or another. Sometimes, this can be done simply by asking questions such as "What would the impact be if someone saw that file?". At all times, it's important to remember that there is no way a system can be truly secured without making it completely inaccessible to anyone. If even one person can get into the system, it's also possible for someone else. Computer security is not a one-time event; it is an ongoing process of re-evaluation. It's also important to remember that computers are just machines. No matter how advanced the hardware and software is, the computer does not think. If an instruction makes it all the way to the CPU, it won't stop to ponder if the user or program that submitted it should be allowed to do so. It won't consider the moral implications of carrying out the instruction. It will simply do as it's told. Security is a human endeavor. This book advocates both for taking specific steps to secure specific systems and for a defense in depth approach. The defense in depth style recognizes that not all attacks can be known or planned for in advance, so it attempts to mitigate them by using a layered strategy. If the firewall is penetrated, an internal network access control list may halt a break-in. If that doesn't work, intrusion prevention software may stop the attack. If that also fails, a simple password challenge may keep the intruder out. Hyper-V terminology Before we can properly discuss how to secure Hyper-V, we must reach an agreement on the words that we use. Terminology is a common point of confusion when it comes to Hyper-V and related technologies. This section will provide a definitive explanation for these terms, not only as they are used within this book, but also how they are generally used in official documentation and by experts. Term Definition Hyper-V The lone word Hyper-V represents the type 1 hypervisor technology developed and provided by Microsoft. This term does not refer to any particular product. It appears as an installable feature in Windows Server beginning with Version 2008, and in Professional and Enterprise desktop Windows operating system starting with version 8. Hyper-V Server Hyper-V Server is a standalone product available directly from Microsoft. It is a no-cost distribution of the hypervisor that is packaged in a heavily modified version of Windows Server. Client Hyper-V Client Hyper-V is the name given to Hyper-V as it appears in the desktop editions of Windows. The distinction is necessary as it has requirements and limitations that set it apart from Hyper-V as it exists in the server editions. Host The physical computer system that runs Hyper-V is called the host. Guest The term guest is often used interchangeably with "virtual machine." It is most commonly used to refer to the operating system inside the virtual machine. Management operating system As a type 1 hypervisor, Hyper-V is in direct control of the host's hardware and has no interface of its own. A management operating system is a special virtual machine that can interact with the hypervisor to control it and the hardware. In other hypervisors, this is known as the parent partition. The commonly used term Hyper-V Core and variants have no official meaning. Core is a special mode for Windows Server that does not include a GUI. It is often used to refer to Hyper-V Server, as that product also has no GUI. Crossing Hyper-V Server with the core modifier should be avoided as it leads to confusion. Acquiring Hyper-V This book expects that you have some familiarity with Hyper-V and will therefore not provide an installation walkthrough. The purpose of this section is to provide a basic comparison of the delivery methods for Hyper-V so that you can make an informed decision in light of the security concerns. Hyper-V Server Hyper-V Server is freely available from Microsoft. It is a complete product and installs directly to the host computer. You can download it from the evaluation center on Technet at the following URL: http://www.microsoft.com/en-us/evalcenter/evaluate-hyper-v-server-2012-r2. Despite being listed alongside evaluation software, Hyper-V Server does not expire and does not require any product keys. Before installing, please read the system requirements, which are linked to the download page. The reason why Hyper-V Server is often (erroneously) referred to as core is because it has no graphical interface of any kind. The only control options available on the console are the command-line and PowerShell. This is not the same thing as a Core installation of Windows as most of the Windows roles and features are not available. There are a number of benefits and disadvantages to using Hyper-V in this fashion. The primary benefit in the realm of security is that there are fewer components in the base installation image and there are fewer potential weak points for an attacker to compromise. Windows Server Windows Server is Microsoft's general-purpose server software. Out of the box, it contains a great many server technologies and can fit into just about any conceivable server role. Among those offerings, you'll find Hyper-V. Windows Server comes in two major editions with full Hyper-V support: Standard and Datacenter. The primary difference between these two is the licensing granted to guests that run Windows Server operating systems. Please consult a Microsoft licensing expert for more information. Technologically, the two editions are nearly identical. The lone difference is the presence of Automatic Virtual Machine Activation in the Datacenter edition, which allows it to activate Windows Server guests using its own license. Windows Server can be installed in three separate modes: Core, Minimal Server Interface, and full GUI mode. Each of these modes affects the actions you must take to secure the system. Like Hyper-V Server, each has advantages and disadvantages. Client Hyper-V Client Hyper-V is only available in Professional and higher desktop editions of Windows, but that's not all that makes it distinct from its cousin on the Server platforms. It requires a processor that can perform Second Level Address Translation (SLAT). It also has a smaller feature set. Among the technologies not included are RemoteFX, Hyper-V Replica, and Live Migration. Client Hyper-V is also less inclined to consume all available host memory for the purpose of running guests. While Client Hyper-V is not the focus of this book, many of the same concepts still apply. A very common use for Client Hyper-V is application development. Most software development firms consider their in-development programs to be highly valuable assets, so they should be as protected as any server-based asset. Summary This article introduced you to the "whys" of Hyper-V security and provided a brief introduction to the overall risks that almost all security systems face, and discussed generic responses. It also covered Hyper-V terminology and the available installation modes for the hypervisor. Resources for Article: Further resources on this subject: Your first step towards Hyper-V Replica [Article] Insight into Hyper-V Storage [Article] Disaster Recovery for Hyper-V [Article]
Read more
  • 0
  • 0
  • 4575

article-image-getting-started-selenium-webdriver-and-python
Packt
23 Dec 2014
19 min read
Save for later

GETTING STARTED WITH SELENIUM WEBDRIVER AND PYTHON

Packt
23 Dec 2014
19 min read
In this article by UnmeshGundecha, author of the book Learning Selenium Testing Tools with Python, we will introduce you to the Selenium WebDriver client library for Python by demonstrating its installation, basic features, and overall structure. Selenium automates browsers. It automates the interaction we do in a browser window such as navigating to a website, clicking on links, filling out forms, submitting forms, navigating through pages, and so on. It works on every major browser available out there. In order to use Selenium WebDriver, we need a programing language to write automation scripts. The language that we select should also have a Selenium client library available. Python is a widely used general-purpose, high-level programming language. It's easy and its syntax allows us to express concepts in fewer lines of code. It emphasizes code readability and provides constructs that enable us to write programs on both the small and large scale. It also provides a number of in-built and user-written libraries to achieve complex tasks quite easily. The Selenium WebDriver client library for Python provides access to all the Selenium WebDriver features and Selenium standalone server for remote and distributed testing of browser-based applications. Selenium Python language bindings are developed and maintained by David Burns, Adam Goucher, MaikRöder, Jason Huggins, Luke Semerau, Miki Tebeka, and Eric Allenin. The Selenium WebDriver client library is supported on Python Version 2.6, 2.7, 3.2, and 3.3. In this article, we will cover the following topics: Installing Python and Selenium package Selecting and setting up a Python editor Implementing a sample script using the Selenium WebDriver Python client library Implementing cross-browser support with Internet Explorer and Google Chrome (For more resources related to this topic, see here.) Preparing your machine As a first step of using Selenium with Python, we'll need to install it on our computer with the minimum requirements possible. Let's set up the basic environment with the steps explained in the following sections. Installing Python You will find Python installed by default on most Linux distributions, Mac OS X, and other Unix machines. On Windows, you will need to install it separately. Installers for different platforms can be found at http://python.org/download/. Installing the Selenium package The Selenium WebDriver Python client library is available in the Selenium package. To install the Selenium package in a simple way, use the pip installer tool available at https://pip.pypa.io/en/latest/. With pip, you can simply install or upgrade the Selenium package using the following command: pip install -U selenium This is a fairly simple process. This command will set up the Selenium WebDriver client library on your machine with all modules and classes that we will need to create automated scripts using Python. The pip tool will download the latest version of the Selenium package and install it on your machine. The optional –U flag will upgrade the existing version of the installed package to the latest version. You can also download the latest version of the Selenium package source from https://pypi.python.org/pypi/selenium. Just click on the Download button on the upper-right-hand side of the page, unarchive the downloaded file, and install it with following command: python setup.py install Browsing the Selenium WebDriver Python documentation The Selenium WebDriver Python client library documentation is available at http://selenium.googlecode.com/git/docs/api/py/api.html as shown in the following screenshot:   It offers detailed information on all core classes and functions of Selenium WebDriver. Also note the following links for Selenium documentation: The official documentation at http://docs.seleniumhq.org/docs/ offers documentation for all the Selenium components with examples in supported languages Selenium Wiki at https://code.google.com/p/selenium/w/list lists some useful topics. Selecting an IDE Now that we have Python and Selenium WebDriver set up, we will need an editor or an Integrated Development Environment (IDE) to write automation scripts. A good editor or IDE increases the productivity and helps in doing a lot of other things that make the coding experience simple and easy. While we can write Python code in simple editors such as Emacs, Vim, or Notepad, using an IDE will make life a lot easier. There are many IDEs to choose from. Generally, an IDE provides the following features to accelerate your development and coding time: A graphical code editor with code completion and IntelliSense A code explorer for functions and classes Syntax highlighting Project management Code templates Tools for unit testing and debugging Source control support If you're new to Python, or you're a tester working for the first time in Python, your development team will help you to set up the right IDE. However, if you're starting with Python for the first time and don't know which IDE to select, here are a few choices that you might want to consider. PyCharm PyCharm is developed by JetBrains, a leading vendor of professional development tools and IDEs such as IntelliJ IDEA, RubyMine, PhpStorm, and TeamCity. PyCharm is a polished, powerful, and versatile IDE that works pretty well. It brings best of the JetBrains experience in building powerful IDEs with lots of other features for a highly productive experience. PyCharm is supported on Windows, Linux, and Mac. To know more about PyCharm and its features visit http://www.jetbrains.com/pycharm/. PyCharm comes in two versions—a community edition and a professional edition. The community edition is free, whereas you have to pay for the professional edition. Here is the PyCharm community edition running a sample Selenium script in the following screenshot:   The community edition is great for building and running Selenium scripts with its fantastic debugging support. We will use PyCharm in the rest of this Article. Later in this article, we will set up PyCharm and create our first Selenium script. All the examples in this article are built using PyCharm; however, you can easily use these examples in your choice of editor or IDE. The PyDev Eclipse plugin The PyDev Eclipse plugin is another widely used editor among Python developers. Eclipse is a famous open source IDE primarily built for Java; however, it also offers support to various other programming languages and tools through its powerful plugin architecture. Eclipse is a cross-platform IDE supported on Windows, Linux, and Mac. You can get the latest edition of Eclipse at http://www.eclipse.org/downloads/. You need to install the PyDev plugin separately after setting up Eclipse. Use the tutorial from Lars Vogel to install PyDev at http://www.vogella.com/tutorials/Python/article.html to install PyDev. Installation instructions are also available at http://pydev.org/. Here's the Eclipse PyDev plugin running a sample Selenium script as shown in the following screenshot:   PyScripter For the Windows users, PyScripter can also be a great choice. It is open source, lightweight, and provides all the features that modern IDEs offer such as IntelliSense and code completion, testing, and debugging support. You can find more about PyScripter along with its download information at https://code.google.com/p/pyscripter/. Here's PyScripter running a sample Selenium script as shown in the following screenshot:   Setting up PyCharm Now that we have seen IDE choices, let's set up PyCharm. All examples in this article are created with PyCharm. However, you can set up any other IDE of your choice and use examples as they are. We will set up PyCharm with following steps to get started with Selenium Python: Download and install the PyCharm Community Edition from JetBrains site http://www.jetbrains.com/pycharm/download/index.html. Launch the PyCharm Community Edition. Click on the Create New Project option on the PyCharm Community Edition dialog box as shown in the following screenshot: On the Create New Project dialog box, as shown in next screenshot, specify the name of your project in the Project name field. In this example, setests is used as the project name. We need to configure the interpreter for the first time. Click on the button to set up the interpreter, as shown in the following screenshot: On the Python Interpreter dialog box, click on the plus icon. PyCharm will suggest the installed interpreter similar to the following screenshot. Select the interpreter from Select Interpreter Path. PyCharm will configure the selected interpreter as shown in the following screenshot. It will show a list of packages that are installed along with Python. Click on the Apply button and then on the OK button: On the Create New Project dialog box, click on the OK button to create the project: Taking your first steps with Selenium and Python We are now ready to start with creating and running automated scripts in Python. Let's begin with Selenium WebDriver and create a Python script that uses Selenium WebDriver classes and functions to automate browser interaction. We will use a sample web application for most of the examples in this artricle. This sample application is built on a famous e-commerce framework—Magento. You can find the application at http://demo.magentocommerce.com/. In this sample script, we will navigate to the demo version of the application, search for products, and list the names of products from the search result page with the following steps: Let's use the project that we created earlier while setting up PyCharm. Create a simple Python script that will use the Selenium WebDriver client library. In Project Explorer, right-click on setests and navigate to New | Python File from the pop-up menu: On the New Python file dialog box, enter searchproducts in the Name field and click on the OK button: PyCharm will add a new tab searchproducts.py in the code editor area. Copy the following code in the searchproduct.py tab: from selenium import webdriver   # create a new Firefox session driver = webdriver.Firefox() driver.implicitly_wait(30) driver.maximize_window()   # navigate to the application home page driver.get("http://demo.magentocommerce.com/")   # get the search textbox search_field = driver.find_element_by_name("q") search_field.clear()   # enter search keyword and submit search_field.send_keys("phones") search_field.submit()   # get all the anchor elements which have product names displayed # currently on result page using find_elements_by_xpath method products = driver.find_elements_by_xpath("//h2[@class='product-name']/a")   # get the number of anchor elements found print "Found " + str(len(products)) + " products:"   # iterate through each anchor element and print the text that is # name of the product for product in products: printproduct.text   # close the browser window driver.quit() If you're using any other IDE or editor of your choice, create a new file, copy the code to the new file, and save the file as searchproducts.py. To run the script, press the Ctrl + Shift + F10 combination in the PyCharm code window or select Run 'searchproducts' from the Run menu. This will start the execution and you will see a new Firefox window navigating to the demo site and the Selenium commands getting executed in the Firefox window. If all goes well, at the end, the script will close the Firefox window. The script will print the list of products in the PyCharm console as shown in the following screenshot: We can also run this script through the command line with the following command. Open the command line, then open the setests directory, and run following command: python searchproducts.py We will use command line as the preferred method in the rest of the article to execute the tests. We'll spend some time looking into the script that we created just now. We will go through each statement and understand Selenium WebDriver in brief. The selenium.webdriver module implements the browser driver classes that are supported by Selenium, including Firefox, Chrome, Internet Explorer, Safari, and various other browsers, and RemoteWebDriver to test on browsers that are hosted on remote machines. We need to import webdriver from the Selenium package to use the Selenium WebDriver methods: from selenium import webdriver Next, we need an instance of a browser that we want to use. This will provide a programmatic interface to interact with the browser using the Selenium commands. In this example, we are using Firefox. We can create an instance of Firefox as shown in following code: driver = webdriver.Firefox() During the run, this will launch a new Firefox window. We also set a few options on the driver: driver.implicitly_wait(30) driver.maximize_window() We configured a timeout for Selenium to execute steps using an implicit wait of 30 seconds for the driver and maximized the Firefox window through the Selenium APINext, we will navigate to the demo version of the application using its URL by calling the driver.get() method. After the get() method is called, WebDriver waits until the page is fully loaded in the Firefox window and returns the control to the script. After loading the page, Selenium will interact with various elements on the page, like a human user. For example, on the Home page of the application, we need to enter a search term in a textbox and click on the Search button. These elements are implemented as HTML input elements and Selenium needs to find these elements to simulate the user action. Selenium WebDriver provides a number of methods to find these elements and interact with them to perform operations such as sending values, clicking buttons, selecting items in dropdowns, and so on. In this example, we are finding the Search textbox using the find_element_by_name method. This will return the first element matching the name attribute specified in the find method. The HTML elements are defined with tag and attributes. We can use this information to find an element, by following the given steps: In this example, the Search textbox has the name attribute defined as q and we can use this attribute as shown in the following code example: search_field = driver.find_element_by_name("q") Once the Search textbox is found, we will interact with this element by clearing the previous value (if entered) using the clear() method and enter the specified new value using the send_keys() method. Next, we will submit the search request by calling the submit() method: search_field.clear() search_field.send_keys("phones") search_field.submit() After submission of the search request, Firefox will load the result page returned by the application. The result page has a list of products that match the search term, which is phones. We can read the list of results and specifically the names of all the products that are rendered in the anchor <a> element using the find_elements_by_xpath() method. This will return more than one matching element as a list: products =   driver.find_elements_by_xpath("//h2[@class= 'product-name']/a") Next, we will print the number of products (that is the number of anchor <a> elements) that are found on the page and the names of the products using the .text property of all the anchor <a> elements: print "Found " + str(len(products)) + " products:" for product in products: printproduct.text At end of the script, we will close the Firefox browser using the driver.quit() method: driver.quit() This example script gives us a concise example of using Selenium WebDriver and Python together to create a simple automation script. We are not testing anything in this script yet. We will extend this simple script into a set of tests and use various other libraries and features of Python. Cross-browser support So far we have built and run our script with Firefox. Selenium has extensive support for cross-browser testing where you can automate on all the major browsers including Internet Explorer, Google Chrome, Safari, Opera, and headless browsers such as PhantomJS. In this section, we will set up and run the script that we created in the previous section with Internet Explorer and Google Chrome to see the cross-browser capabilities of Selenium WebDriver. Setting up Internet Explorer There is a little more to run scripts on Internet Explorer. To run tests on Internet Explorer, we need to download and set up the InternetExplorerDriver server. The InternetExplorerDriver server is a standalone server executable that implements WebDriver's wire protocol to work as glue between the test script and Internet Explorer. It supports major IE versions on Windows XP, Vista, Windows 7, and Windows 8 operating systems. Let's set up the InternetExplorerDriver server with the following steps: Download the InternetExplorerDriver server from http://www.seleniumhq.org/download/. You can download 32- or 64-bit versions based on the system configuration that you are using. After downloading the InternetExplorerDriver server, unzip and copy the file to the same directory where scripts are stored. On IE 7 or higher, the Protected Mode settings for each zone must have the same value. Protected Mode can either be on or off, as long as it is for all the zones. To set the Protected Mode settings: Choose Internet Options from the Tools menu. On the Internet Options dialog box, click on the Security tab. Select each zone listed in Select a zone to view or change security settings and make sure Enable Protected Mode (requires restarting Internet Explorer) is either checked or unchecked for all the zones. All the zones should have the same settings as shown in the following screenshot: While using the InternetExplorerDriver server, it is also important to keep the browser zoom level set to 100 percent so that the native mouse events can be set to the correct coordinates. Finally, modify the script to use Internet Explorer. Instead of creating an instance of the Firefox class, we will use the IE class in the following way: importos from selenium import webdriver   # get the path of IEDriverServer dir = os.path.dirname(__file__) ie_driver_path = dir + "IEDriverServer.exe"   # create a new Internet Explorer session driver = webdriver.Ie(ie_driver_path) driver.implicitly_wait(30) driver.maximize_window()   # navigate to the application home page driver.get("http://demo.magentocommerce.com/")   # get the search textbox search_field = driver.find_element_by_name("q") search_field.clear()   # enter search keyword and submit search_field.send_keys("phones") search_field.submit()   # get all the anchor elements which have product names displayed # currently on result page using find_elements_by_xpath method products = driver.find_elements_by_xpath("//h2[@class='product-name']/a")   # get the number of anchor elements found print "Found " + str(len(products)) + " products:"   # iterate through each anchor element and print the text that is # name of the product for product in products: printproduct.text   # close the browser window driver.quit() In this script, we passed the path of the InternetExplorerDriver server while creating the instance of an IE browser class. Run the script and Selenium will first launch the InternetExplorerDriver server, which launches the browser, and execute the steps. The InternetExplorerDriver server acts as an intermediary between the Selenium script and the browser. Execution of the actual steps is very similar to what we observed with Firefox. Read more about the important configuration options for Internet Explorer at https://code.google.com/p/selenium/wiki/InternetExplorerDriver and the DesiredCapabilities article at https://code.google.com/p/selenium/wiki/DesiredCapabilities. Setting up Google Chrome Setting up and running Selenium scripts on Google Chrome is similar to Internet Explorer. We need to download the ChromeDriver server similar to InternetExplorerDriver. The ChromeDriver server is a standalone server developed and maintained by the Chromium team. It implements WebDriver's wire protocol for automating Google Chrome. It is supported on Windows, Linux, and Mac operating systems. Set up the ChromeDriver server using the following steps: Download the ChromeDriver server from http://chromedriver.storage.googleapis.com/index.html. After downloading the ChromeDriver server, unzip and copy the file to the same directory where the scripts are stored. Finally, modify the sample script to use Chrome. Instead of creating an instance of the Firefox class, we will use the Chrome class in the following way: importos from selenium import webdriver   # get the path of chromedriver dir = os.path.dirname(__file__) chrome_driver_path = dir + "chromedriver.exe" #remove the .exe extension on linux or mac platform   # create a new Chrome session driver = webdriver.Chrome(chrome_driver_path) driver.implicitly_wait(30) driver.maximize_window()   # navigate to the application home page driver.get("http://demo.magentocommerce.com/")   # get the search textbox search_field = driver.find_element_by_name("q") search_field.clear()   # enter search keyword and submit search_field.send_keys("phones") search_field.submit()   # get all the anchor elements which have product names displayed # currently on result page using find_elements_by_xpath method products = driver.find_elements_by_xpath("//h2[@class='product-name']/a")   # get the number of anchor elements found print "Found " + str(len(products)) + " products:"   # iterate through each anchor element and print the text that is # name of the product for product in products: printproduct.text   # close the browser window driver.quit() In this script, we passed the path of the ChromeDriver server while creating an instance of the Chrome browser class. Run the script. Selenium will first launch the Chromedriver server, which launches the Chrome browser, and execute the steps. Execution of the actual steps is very similar to what we observed with Firefox. Read more about ChromeDriver at https://code.google.com/p/selenium/wiki/ChromeDriver and https://sites.google.com/a/chromium.org/chromedriver/home. Summary In this article, we introduced you to Selenium and its components. We installed the selenium package using the pip tool. Then we looked at various Editors and IDEs to ease our coding experience with Selenium and Python and set up PyCharm. Then we built a simple script on a sample application covering some of the high-level concepts of Selenium WebDriver Python client library using Firefox. We ran the script and analyzed the outcome. Finally, we explored the cross-browser testing support of Selenium WebDriver by configuring and running the script with Internet Explorer and Google Chrome. Resources for Article: Further resources on this subject: Quick Start into Selenium Tests [article] Exploring Advanced Interactions of WebDriver [article] Mobile Devices [article]
Read more
  • 0
  • 0
  • 10098

article-image-hadoop-and-sql
Packt
23 Dec 2014
61 min read
Save for later

Hadoop and SQL

Packt
23 Dec 2014
61 min read
In this article by Garry Turkington and Gabriele Modena, the author of the book Learning Hadoop 2. MapReduce is a powerful paradigm that enables complex data processing that can reveal valuable insights. However, it does require a different mindset and some training and experience on the model of breaking processing analytics into a series of map and reduce steps. There are several products that are built atop Hadoop to provide higher-level or more familiar views of the data held within HDFS, and Pig is a very popular one. This article will explore the other most common abstraction implemented atop Hadoop SQL. (For more resources related to this topic, see here.) In this article, we will cover the following topics: What the use cases for SQL on Hadoop are and why it is so popular HiveQL, the SQL dialect introduced by Apache Hive Using HiveQL to perform SQL-like analysis of the Twitter dataset How HiveQL can approximate common features of relational databases such as joins and views How HiveQL allows the incorporation of user-defined functions into its queries How SQL on Hadoop complements Pig Other SQL-on-Hadoop products such as Impala and how they differ from Hive Why SQL on Hadoop Until now, we saw how to write Hadoop programs using the MapReduce APIs and how Pig Latin provides a scripting abstraction and a wrapper for custom business logic by means of UDFs. Pig is a very powerful tool, but its dataflow-based programming model is not familiar to most developers or business analysts. The traditional tool of choice for such people to explore data is SQL. Back in 2008, Facebook released Hive, the first widely used implementation of SQL on Hadoop. Instead of providing a way of more quickly developing map and reduce tasks, Hive offers an implementation of HiveQL, a query language based on SQL. Hive takes HiveQL statements and immediately and automatically translates the queries into one or more MapReduce jobs. It then executes the overall MapReduce program and returns the results to the user. This interface to Hadoop not only reduces the time required to produce results from data analysis, it also significantly widens the net as to who can use Hadoop. Instead of requiring software development skills, anyone who's familiar with SQL can use Hive. The combination of these attributes is that HiveQL is often used as a tool for business and data analysts to perform ad hoc queries on the data stored on HDFS. With Hive, the data analyst can work on refining queries without the involvement of a software developer. Just as with Pig, Hive also allows HiveQL to be extended by means of user-defined functions, enabling the base SQL dialect to be customized with business-specific functionality. Other SQL-on-Hadoop solutions Though Hive was the first product to introduce and support HiveQL, it is no longer the only one. Later in this article, we will also discuss Impala, released in 2013 and already a very popular tool, particularly for low-latency queries. There are others, but we will mostly discuss Hive and Impala as they have been the most successful. While introducing the core features and capabilities of SQL on Hadoop, however, we will give examples using Hive; even though Hive and Impala share many SQL features, they also have numerous differences. We don't want to constantly have to caveat each new feature with exactly how it is supported in Hive compared to Impala. We'll generally be looking at aspects of the feature set that is common to both, but if you use both products, it's important to read the latest release notes to understand the differences. Prerequisites Before diving into specific technologies, let's generate some data that we'll use in the examples throughout this article. We'll create a modified version of a former Pig script as the main functionality for this. The script in this article assumes that the Elephant Bird JARs used previously are available in the /jar directory on HDFS. The full source code is at https://github.com/learninghadoop2/book-examples/ch7/extract_for_hive.pig, but the core of extract_for_hive.pig is as follows: -- load JSON data tweets = load '$inputDir' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); -- Tweets tweets_tsv = foreach tweets { generate    (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt,    (chararray)$0#'id_str', (chararray)$0#'text' as text,    (chararray)$0#'in_reply_to', (boolean)$0#'retweeted' as is_retweeted, (chararray)$0#'user'#'id_str' as user_id, (chararray)$0#'place'#'id' as place_id; } store tweets_tsv into '$outputDir/tweets' using PigStorage('u0001'); -- Places needed_fields = foreach tweets {    generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt,      (chararray)$0#'id_str' as id_str, $0#'place' as place; } place_fields = foreach needed_fields { generate    (chararray)place#'id' as place_id,    (chararray)place#'country_code' as co,    (chararray)place#'country' as country,    (chararray)place#'name' as place_name,    (chararray)place#'full_name' as place_full_name,    (chararray)place#'place_type' as place_type; } filtered_places = filter place_fields by co != ''; unique_places = distinct filtered_places; store unique_places into '$outputDir/places' using PigStorage('u0001');   -- Users users = foreach tweets {    generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)$0#'id_str' as id_str, $0#'user' as user; } user_fields = foreach users {    generate    (chararray)CustomFormatToISO(user#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)user#'id_str' as user_id, (chararray)user#'location' as user_location, (chararray)user#'name' as user_name, (chararray)user#'description' as user_description, (int)user#'followers_count' as followers_count, (int)user#'friends_count' as friends_count, (int)user#'favourites_count' as favourites_count, (chararray)user#'screen_name' as screen_name, (int)user#'listed_count' as listed_count;   } unique_users = distinct user_fields; store unique_users into '$outputDir/users' using PigStorage('u0001'); Have a look at the following code: $ pig –f extract_for_hive.pig –param inputDir=<json input> -param outputDir=<output path> The preceding code writes data into three separate TSV files for the tweet, user, and place information. Notice that in the store command, we pass an argument when calling PigStorage. This single argument changes the default field separator from a tab character to unicode value U0001, or you can also use Ctrl +C + A. This is often used as a separator in Hive tables and will be particularly useful to us as our tweet data could contain tabs in other fields. Overview of Hive We will now show how you can import data into Hive and run a query against the table abstraction Hive provides over the data. In this example, and in the remainder of the article, we will assume that queries are typed into the shell that can be invoked by executing the hive command. Even though the classic CLI tool for Hive was the tool with the same name, it is specifically called hive (all lowercase); recently a client called Beeline also became available and will likely be the preferred CLI client in the near future. When importing any new data into Hive, there is generally a three-stage process, as follows: Create the specification of the table into which the data is to be imported Import the data into the created table Execute HiveQL queries against the table Most of the HiveQL statements are direct analogues to similarly named statements in standard SQL. We assume only a passing knowledge of SQL throughout this article, but if you need a refresher, there are numerous good online learning resources. Hive gives a structured query view of our data, and to enable that, we must first define the specification of the table's columns and import the data into the table before we can execute any queries. A table specification is generated using a CREATE statement that specifies the table name, the name and types of its columns, and some metadata about how the table is stored: CREATE table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; The statement creates a new table tweet defined by a list of names for columns in the dataset and their data type. We specify that fields are delimited by a tab character t and that the format used to store data is TEXTFILE. Data can be imported from a location in HDFS tweets/ into hive using the LOAD DATA statement: LOAD DATA INPATH 'tweets' OVERWRITE INTO TABLE tweets; By default, data for Hive tables is stored on HDFS under /user/hive/warehouse. If a LOAD statement is given a path to data on HDFS, it will not simply copy the data into /user/hive/warehouse, but will move it there instead. If you want to analyze data on HDFS that is used by other applications, then either create a copy or use the EXTERNAL mechanism that will be described later. Once data has been imported into Hive, we can run queries against it. For instance: SELECT COUNT(*) FROM tweets; The preceding code will return the total number of tweets present in the dataset. HiveQL, like SQL, is not case sensitive in terms of keywords, columns, or table names. By convention, SQL statements use uppercase for SQL language keywords, and we will generally follow this when using HiveQL within files, as will be shown later. However, when typing interactive commands, we will frequently take the line of least resistance and use lowercase. If you look closely at the time taken by the various commands in the preceding example, you'll notice that loading data into a table takes about as long as creating the table specification, but even the simple count of all rows takes significantly longer. The output also shows that table creation and the loading of data do not actually cause MapReduce jobs to be executed, which explains the very short execution times. The nature of Hive tables Although Hive copies the data file into its working directory, it does not actually process the input data into rows at that point. Neither the CREATE TABLE nor LOAD DATA statements don't truly create concrete table data as such; instead, they produce the metadata that will be used when Hive generates MapReduce jobs to access the data conceptually stored in the table but actually residing on HDFS. Even though the HiveQL statements refer to a specific table structure, it is Hive's responsibility to generate code that correctly maps this to the actual on-disk format in which the data files are stored. This might seem to suggest that Hive isn't a real database; this is true, it isn't. Whereas a relational database will require a table schema to be defined before data is ingested and then ingest only data that conforms to that specification, Hive is much more flexible. The less concrete nature of Hive tables means that schemas can be defined based on the data as it has already arrived and not on some assumption of how the data should be, which might prove to be wrong. Though changeable data formats are troublesome regardless of technology, the Hive model provides an additional degree of freedom in handling the problem when, not if, it arises. Hive architecture Until version 2, Hadoop was primarily a batch system. MapReduce jobs tend to have high latency and overhead derived from submission and scheduling. Internally, Hive compiles HiveQL statements into MapReduce jobs. Hive queries have traditionally been characterized by high latency. This has changed with the Stinger initiative and the improvements introduced in Hive 0.13 that we will discuss later. Hive runs as a client application that processes HiveQL queries, converts them into MapReduce jobs, and submits these to a Hadoop cluster either to native MapReduce in Hadoop 1 or to the MapReduce Application Master running on YARN in Hadoop 2. Regardless of the model, Hive uses a component called the metastore, in which it holds all its metadata about the tables defined in the system. Ironically, this is stored in a relational database dedicated to Hive's usage. In the earliest versions of Hive, all clients communicated directly with the metastore, but this meant that every user of the Hive CLI tool needed to know the metastore username and password. HiveServer was created to act as a point of entry for remote clients, which could also act as a single access-control point and which controlled all access to the underlying metastore. Because of limitations in HiveServer, the newest way to access Hive is through the multi-client HiveServer2. HiveServer2 introduces a number of improvements over its predecessor, including user authentication and support for multiple connections from the same client. More information can be found at https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2. Instances of HiveServer and HiveServer2 can be manually executed with the hive --service hiveserver and hive --service hiveserver2 commands, respectively. In the examples we saw before and in the remainder of this article, we implicitly use HiveServer to submit queries via the Hive command-line tool. HiveServer2 comes with Beeline. For compatibility and maturity reasons, Beeline being relatively new, both tools are available on Cloudera and most other major distributions. The Beeline client is part of the core Apache Hive distribution and so is also fully open source. Beeline can be executed in embedded version with the following command: $ beeline -u jdbc:hive2:// Data types HiveQL supports many of the common data types provided by standard database systems. These include primitive types, such as float, double, int, and string, through to structured collection types that provide the SQL analogues to types such as arrays, structs, and unions (structs with options for some fields). Since Hive is implemented in Java, primitive types will behave like their Java counterparts. We can distinguish Hive data types into the following five broad categories: Numeric: tinyint, smallint, int, bigint, float, double, and decimal Date and time: timestamp and date String: string, varchar, and char Collections: array, map, struct, and uniontype Misc: boolean, binary, and NULL DDL statements HiveQL provides a number of statements to create, delete, and alter databases, tables, and views. The CREATE DATABASE <name> statement creates a new database with the given name. A database represents a namespace where table and view metadata is contained. If multiple databases are present, the USE <database name> statement specifies which one to use to query tables or create new metadata. If no database is explicitly specified, Hive will run all statements against the default database. The following SHOW [DATABASES, TABLES, VIEWS] statement displays the databases currently available within a data warehouse and which table and view metadata is present within the database currently in use: CREATE DATABASE twitter; SHOW databases; USE twitter; SHOW TABLES; The CREATE TABLE [IF NOT EXISTS] <name> statement creates a table with the given name. As alluded to earlier, what is really created is the metadata representing the table and its mapping to files on HDFS as well as a directory in which to store the data files. If a table or view with the same name already exists, Hive will raise an exception. Both table and column names are case insensitive. In older versions of Hive (0.12 and earlier), only alphanumeric and underscore characters were allowed in table and column names. As of Hive 0.13, the system supports unicode characters in column names. Reserved words, such as load and create, need to be escaped by backticks (the ` character) to be treated literally. The EXTERNAL keyword specifies that the table exists in resources out of Hive's control, which can be a useful mechanism to extract data from another source at the beginning of a Hadoop-based Extract-Transform-Load (ETL) pipeline. The LOCATION clause specifies where the source file (or directory) is to be found. The EXTERNAL keyword and LOCATION clause have been used in the following code: CREATE EXTERNAL TABLE tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/tweets'; This table will be created in metastore, but the data will not be copied into the /user/hive/warehouse directory. Note that Hive has no concept of primary key or unique identifier. Uniqueness and data normalization are aspects to be addressed before loading data into the data warehouse. The CREATE VIEW <view name> … AS SELECT statement creates a view with the given name. For example, we might want to create a view to isolate retweets from other messages, as follows: CREATE VIEW retweets COMMENT 'Tweets that have been retweeted' AS SELECT * FROM tweets WHERE retweeted = true; Unless otherwise specified, column names are derived from the defining SELECT statement. Hive does not currently support materialized views. The DROP TABLE and DROP VIEW statements remove both metadata and data for a given table or view. When dropping an EXTERNAL table or a view, only metadata will be removed and the actual data files will not be affected. Hive allows table metadata to be altered via the ALTER TABLE statement, which can be used to change a column type, name, position, and comment or to add and replace columns. When adding columns, it is important to remember that only metadata will be changed and not the dataset itself. This means that if we were to add a column in the middle of the table, which didn't exist in older files, then, while selecting from older data, we might get wrong values in the wrong columns. This is because we would be looking at old files with a new format. Similarly, ALTER VIEW <view name> AS <select statement> changes the definition of an existing view. File formats and storage The data files underlying a Hive table are no different from any other file on HDFS. Users can directly read the HDFS files in the Hive tables using other tools. They can also use other tools to write to HDFS files that can be loaded into Hive through CREATE EXTERNAL TABLE or through LOAD DATA INPATH. Hive uses the Serializer and Deserializer classes, SerDe, as well as FileFormat to read and write table rows. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified in a CREATE TABLE statement. The DELIMITED clause instructs the system to read delimited files. Delimiter characters can be escaped using the ESCAPED BY clause. Hive currently uses the following FileFormat classes to read and write HDFS files: TextInputFormat and HiveIgnoreKeyTextOutputFormat: These will read/write data in plain text file format SequenceFileInputFormat and SequenceFileOutputFormat: These classes read/write data in the Hadoop SequenceFile format Additionally, the following SerDe classes can be used to serialize and deserialize data: MetadataTypedColumnsetSerDe: This will read/write delimited records such as CSV or tab-separated records ThriftSerDe, and DynamicSerDe: These will read/write Thrift objects JSON As of version 0.13, Hive ships with the native org.apache.hive.hcatalog.data.JsonSerDe JSON SerDe. For older versions of Hive, Hive-JSON-Serde (found at https://github.com/rcongiu/Hive-JSON-Serde) is arguably one of the most feature-rich JSON serialization/deserialization modules. We can use either module to load JSON tweets without any need for preprocessing and just define a Hive schema that matches the content of a JSON document. In the following example, we use Hive-JSON-Serde. As with any third-party module, we load the SerDe JARS into Hive with the following code: ADD JAR JAR json-serde-1.3-jar-with-dependencies.jar; Then, we issue the usual create statement, as follows: CREATE EXTERNAL TABLE tweets (    contributors string,    coordinates struct <      coordinates: array <float>,      type: string>,    created_at string,    entities struct <      hashtags: array <struct <            indices: array <tinyint>,            text: string>>, … ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION 'tweets'; With this SerDe, we can map nested documents (such as entities or users) to the struct or map types. We tell Hive that the data stored at LOCATION 'tweets' is text (STORED AS TEXTFILE) and that each row is a JSON object (ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'). In Hive 0.13 and later, we can express this property as ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'. Manually specifying the schema for complex documents can be a tedious and error-prone process. The hive-json module (found at https://github.com/hortonworks/hive-json) is a handy utility to analyze large documents and generate an appropriate Hive schema. Depending on the document collection, further refinement might be necessary. In our example, we used a schema generated with hive-json that maps the tweets JSON to a number of struct data types. This allows us to query the data using a handy dot notation. For instance, we can extract the screen name and description fields of a user object with the following code: SELECT user.screen_name, user.description FROM tweets_json LIMIT 10; Avro AvroSerde (https://cwiki.apache.org/confluence/display/Hive/AvroSerDe) allows us to read and write data in Avro format. Starting from 0.14, Avro-backed tables can be created using the STORED AS AVRO statement, and Hive will take care of creating an appropriate Avro schema for the table. Prior versions of Hive are a bit more verbose. This dataset was created using Pig's AvroStorage class, which generated the following schema: { "type":"record", "name":"record", "fields": [    {"name":"topic","type":["null","int"]},    {"name":"source","type":["null","int"]},    {"name":"rank","type":["null","float"]} ] } The structure is quite self-explanatory. The table structure is captured in an Avro record, which contains header information (a name and optional namespace to qualify the name) and an array of the fields. Each field is specified with its name and type as well as an optional documentation string. For a few of the fields, the type is not a single value, but instead a pair of values, one of which is null. This is an Avro union, and this is the idiomatic way of handling columns that might have a null value. Avro specifies null as a concrete type, and any location where another type might have a null value needs to be specified in this way. This will be handled transparently for us when we use the following schema. With this definition, we can now create a Hive table that uses this schema for its table specification, as follows: CREATE EXTERNAL TABLE tweets_pagerank ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'='{    "type":"record",    "name":"record",    "fields": [        {"name":"topic","type":["null","int"]},        {"name":"source","type":["null","int"]},        {"name":"rank","type":["null","float"]}    ] }') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '${data}/ch5-pagerank'; Then, look at the following table definition from within Hive (note also that HCatalog): DESCRIBE tweets_pagerank; OK topic                 int                   from deserializer   source               int                   from deserializer   rank                 float                 from deserializer In the ddl, we told Hive that data is stored in the Avro format using AvroContainerInputFormat and AvroContainerOutputFormat. Each row needs to be serialized and deserialized using org.apache.hadoop.hive.serde2.avro.AvroSerDe. The table schema is inferred by Hive from the Avro embedded in avro.schema.literal. Alternatively, we can store a schema on HDFS and have Hive read it to determine the table structure. Create the preceding schema in a file called pagerank.avsc—this is the standard file extension for Avro schemas. Then place it on HDFS; we want to have a common location for schema files such as /schema/avro. Finally, define the table using the avro.schema.url SerDe property WITH SERDEPROPERTIES ('avro.schema.url'='hdfs://<namenode>/schema/avro/pagerank.avsc'). If Avro dependencies are not present in the classpath, we need to add the Avro MapReduce JAR to our environment before accessing individual fields. Within Hive, for example, on the Cloudera CDH5 VM: ADD JAR /opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar; We can also use this table like any other. For instance, we can query the data to select the user and topic pairs with a high PageRank: SELECT source, topic from tweets_pagerank WHERE rank >= 0.9; We will see how Avro and avro.schema.url play an instrumental role in enabling schema migrations. Columnar stores Hive can also take advantage of columnar storage via the ORC (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) and Parquet (https://cwiki.apache.org/confluence/display/Hive/Parquet) formats. If a table is defined with very many columns, it is not unusual for any given query to only process a small subset of these columns. But even in SequenceFile, each full row and all its columns will be read from the disk, decompressed, and processed. This consumes a lot of system resources for data that we know in advance is not of interest. Traditional relational databases also store data on a row basis, and a type of database called columnarchanged this to be column-focused. In the simplest model, instead of one file for each table, there would be one file for each column in the table. If a query only needed to access five columns in a table with 100 columns in total, then only the files for those five columns will be read. Both ORC and Parquet use this principle as well as other optimizations to enable much faster queries. Queries Tables can be queried using the familiar SELECT … FROM statement. The WHERE statement allows the specification of filtering conditions, GROUP BY aggregates records, ORDER BY specifies sorting criteria, and LIMIT specifies the number of records to retrieve. Aggregate functions, such as count and sum, can be applied to aggregated records. For instance, the following code returns the top 10 most prolific users in the dataset: SELECT user_id, COUNT(*) AS cnt FROM tweets GROUP BY user_id ORDER BY cnt DESC LIMIT 10 The following are the top 10 most prolific users in the dataset: NULL 7091 1332188053 4 959468857 3 1367752118 3 362562944 3 58646041 3 2375296688 3 1468188529 3 37114209 3 2385040940 3 This allows us to identify the number of tweets, 7,091, with no user object. We can improve the readability of the hive output by setting the following code: SET hive.cli.print.header=true; This will instruct hive, though not beeline, to print column names as part of the output. You can add the command to the .hiverc file usually found in the root of the executing user's home directory to have it apply to all hive CLI sessions. HiveQL implements a JOIN operator that enables us to combine tables together. In the Prerequisites section, we generated separate datasets for the user and place objects. Let's now load them into hive using external tables. We first create a user table to store user data, as follows: CREATE EXTERNAL TABLE user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/users'; We then create a place table to store location data, as follows: CREATE EXTERNAL TABLE place ( place_id string, country_code string, country string, `name` string, full_name string, place_type string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/places'; We can use the JOIN operator to display the names of the 10 most prolific users, as follows: SELECT tweets.user_id, user.name, COUNT(tweets.user_id) AS cnt FROM tweets JOIN user ON user.user_id = tweets.user_id GROUP BY tweets.user_id, user.user_id, user.name ORDER BY cnt DESC LIMIT 10; Only equality, outer, and left (semi) joins are supported in Hive. Notice that there might be multiple entries with a given user ID but different values for the followers_count, friends_count, and favourites_count columns. To avoid duplicate entries, we count only user_id from the tweets tables. Alternatively, we can rewrite the previous query as follows: SELECT tweets.user_id, u.name, COUNT(*) AS cnt FROM tweets join (SELECT user_id, name FROM user GROUP BY user_id, name) u ON u.user_id = tweets.user_id GROUP BY tweets.user_id, u.name ORDER BY cnt DESC LIMIT 10;   Instead of directly joining the user table, we execute a subquery, as follows: SELECT user_id, name FROM user GROUP BY user_id, name; The subquery extracts unique user IDs and names. Note that Hive has limited support for subqueries. Historically, only permitting a subquery in the FROM clause of a SELECT statement. Hive 0.13 has added limited support for subqueries within the WHERE clause also. HiveQL is an ever-evolving rich language, a full exposition of which is beyond the scope of this article. A description of its query and ddl capabilities can be found at https://cwiki.apache.org/confluence/display/Hive/LanguageManual. Structuring Hive tables for given workloads Often Hive isn't used in isolation, instead tables are created with particular workloads in mind or with needs invoked in ways that are suitable for inclusion in automated processes. We'll now explore some of these scenarios. Partitioning a table With columnar file formats, we explained the benefits of excluding unneeded data as early as possible when processing a query. A similar concept has been used in SQL for some time: table partitioning. When creating a partitioned table, a column is specified as the partition key. All values with that key are then stored together. In Hive's case, different subdirectories for each partition key are created under the table directory in the warehouse location on HDFS. It's important to understand the cardinality of the partition column. With too few distinct values, the benefits are reduced as the files are still very large. If there are too many values, then queries might need a large number of files to be scanned to access all the required data. Perhaps, the most common partition key is one based on date. We could, for example, partition our user table from earlier based on the created_at column, that is, the date the user was first registered. Note that since partitioning a table by definition affects its file structure, we create this table now as a non-external one, as follows: CREATE TABLE partitioned_user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at_date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; To load data into a partition, we can explicitly give a value for the partition in which to insert the data, as follows: INSERT INTO TABLE partitioned_user PARTITION( created_at_date = '2014-01-01') SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count FROM user; This is at best verbose, as we need a statement for each partition key value; if a single LOAD or INSERT statement contains data for multiple partitions, it just won't work. Hive also has a feature called dynamic partitioning, which can help us here. We set the following three variables: SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict; SET hive.exec.max.dynamic.partitions.pernode=5000; The first two statements enable all partitions (nonstrict option) to be dynamic. The third one allows 5,000 distinct partitions to be created on each mapper and reducer node. We can then simply use the name of the column to be used as the partition key, and Hive will insert data into partitions depending on the value of the key for a given row: INSERT INTO TABLE partitioned_user PARTITION( created_at_date ) SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user; Even though we use only a single partition column, here we can partition a table by multiple column keys; just have them as a comma-separated list in the PARTITIONED BY clause. Note that the partition key columns need to be included as the last columns in any statement being used to insert into a partitioned table as in the preceding code. We use Hive's to_date function to convert the created_at timestamp to a YYYY-MM-DD formatted string. Partitioned data is stored in HDFS as /path/to/warehouse/<database>/<table>/key=<value>. In our example, the partitioned_user table structure will look like /user/hive/warehouse/default/partitioned_user/created_at=2014-04-01. If data is added directly to the filesystem, for instance, by some third-party processing tool or by hadoop fs -put, the metastore won't automatically detect the new partitions. The user will need to manually run an ALTER TABLE statement such as the following for each newly added partition: ALTER TABLE <table_name> ADD PARTITION <location>; Using the MSCK REPAIR TABLE <table_name>; statement, all metadata for all partitions not currently present in the metastore will be added. On EMR, this is equivalent to executing the following code: ALTER TABLE <table_name> RECOVER PARTITIONS; Notice that both statements will work also with EXTERNAL tables. In the following article, we will see how this pattern can be exploited to create flexible and interoperable pipelines. Overwriting and updating data Partitioning is also useful when we need to update a portion of a table. Normally a statement of the following form will replace all the data for the destination table: INSERT OVERWRITE INTO <table>… If OVERWRITE is omitted, then each INSERT statement will add additional data to the table. Sometimes, this is desirable, but often, the source data being ingested into a Hive table is intended to fully update a subset of the data and keep the rest untouched. If we perform an INSERT OVERWRITE statement (or a LOAD OVERWRITE statement) into a partition of a table, then only the specified partition will be affected. Thus, if we were inserting user data and only wanted to affect the partitions with data in the source file, we could achieve this by adding the OVERWRITE keyword to our previous INSERT statement. We can also add caveats to the SELECT statement. Say, for example, we only wanted to update data for a certain month: INSERT INTO TABLE partitioned_user PARTITION (created_at_date) SELECT created_at , user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user WHERE to_date(created_at) BETWEEN '2014-03-01' and '2014-03-31'; Bucketing and sorting Partitioning a table is a construct that you take explicit advantage of by using the partition column (or columns) in the WHERE clause of queries against the tables. There is another mechanism called bucketing that can further segment how a table is stored and does so in a way that allows Hive itself to optimize its internal query plans to take advantage of the structure. Let's create bucketed versions of our tweets and user tables; note the following additional CLUSTER BY and SORT BY statements in the CREATE TABLE statements: CREATE table bucketed_tweets ( tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE;   CREATE TABLE bucketed_user ( user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) SORTED BY(name) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; Note that we changed the tweets table to also be partitioned; you can only bucket a table that is partitioned. Just as we need to specify a partition column when inserting into a partitioned table, we must also take care to ensure that data inserted into a bucketed table is correctly clustered. We do this by setting the following flag before inserting the data into the table: SET hive.enforce.bucketing=true; Just as with partitioned tables, you cannot apply the bucketing function when using the LOAD DATA statement; if you wish to load external data into a bucketed table, first insert it into a temporary table, and then use the INSERT…SELECT… syntax to populate the bucketed table. When data is inserted into a bucketed table, rows are allocated to a bucket based on the result of a hash function applied to the column specified in the CLUSTERED BY clause. One of the greatest advantages of bucketing a table comes when we need to join two tables that are similarly bucketed, as in the previous example. So, for example, any query of the following form would be vastly improved: SET hive.optimize.bucketmapjoin=true; SELECT … FROM bucketed_user u JOIN bucketed_tweet t ON u.user_id = t.user_id; With the join being performed on the column used to bucket the table, Hive can optimize the amount of processing as it knows that each bucket contains the same set of user_id columns in both tables. While determining which rows against which to match, only those in the bucket need to be compared against, and not the whole table. This does require that the tables are both clustered on the same column and that the bucket numbers are either identical or one is a multiple of the other. In the latter case, with say one table clustered into 32 buckets and another into 64, the nature of the default hash function used to allocate data to a bucket means that the IDs in bucket 3 in the first table will cover those in both buckets 3 and 35 in the second. Sampling data Bucketing a table can also help while using Hive's ability to sample data in a table. Sampling allows a query to gather only a specified subset of the overall rows in the table. This is useful when you have an extremely large table with moderately consistent data patterns. In such a case, applying a query to a small fraction of the data will be much faster and will still give a broadly representative result. Note, of course, that this only applies to queries where you are looking to determine table characteristics, such as pattern ranges in the data; if you are trying to count anything, then the result needs to be scaled to the full table size. For a non-bucketed table, you can sample in a mechanism similar to what we saw earlier by specifying that the query should only be applied to a certain subset of the table: SELECT max(friends_count) FROM user TABLESAMPLE(BUCKET 2 OUT OF 64 ON name); In this query, Hive will effectively hash the rows in the table into 64 buckets based on the name column. It will then only use the second bucket for the query. Multiple buckets can be specified, and if RAND() is given as the ON clause, then the entire row is used by the bucketing function. Though successful, this is highly inefficient as the full table needs to be scanned to generate the required subset of data. If we sample on a bucketed table and ensure the number of buckets sampled is equal to or a multiple of the buckets in the table, then Hive will only read the buckets in question. The following code is representative of this case: SELECT MAX(friends_count) FROM bucketed_user TABLESAMPLE(BUCKET 2 OUT OF 32 on user_id); In the preceding query against the bucketed_user table, which is created with 64 buckets on the user_id column, the sampling, since it is using the same column, will only read the required buckets. In this case, these will be buckets 2 and 34 from each partition. A final form of sampling is block sampling. In this case, we can specify the required amount of the table to be sampled, and Hive will use an approximation of this by only reading enough source data blocks on HDFS to meet the required size. Currently, the data size can be specified as either a percentage of the table, as an absolute data size, or as a number of rows (in each block). The syntax for TABLESAMPLE is as follows, which will sample 0.5 percent of the table, 1 GB of data or 100 rows per split, respectively: TABLESAMPLE(0.5 PERCENT) TABLESAMPLE(1G) TABLESAMPLE(100 ROWS) If these latter forms of sampling are of interest, then consult the documentation, as there are some specific limitations on the input format and file formats that are supported. Writing scripts We can place Hive commands in a file and run them with the -f option in the hive CLI utility: $ cat show_tables.hql show tables; $ hive -f show_tables.hql We can parameterize HiveQL statements by means of the hiveconf mechanism. This allows us to specify an environment variable name at the point it is used rather than at the point of invocation. For example: $ cat show_tables2.hql show tables like '${hiveconf:TABLENAME}'; $ hive -hiveconf TABLENAME=user -f show_tables2.hql The variable can also be set within the Hive script or an interactive session: SET TABLE_NAME='user'; The preceding hiveconf argument will add any new variables in the same namespace as the Hive configuration options. As of Hive 0.8, there is a similar option called hivevar that adds any user variables into a distinct namespace. Using hivevar, the preceding command would be as follows: $ cat show_tables3.hql show tables like '${hivevar:TABLENAME}'; $ hive -hivevar TABLENAME=user –f show_tables3.hql Or we can write the command interactively: SET hivevar_TABLE_NAME='user'; Hive and Amazon Web Services With ElasticMapReduce as the AWS Hadoop-on-demand service, it is of course possible to run Hive on an EMR cluster. But it is also possible to use Amazon storage services, particularly S3, from any Hadoop cluster, be it within EMR or your own local cluster. Hive and S3 It is possible to specify a default filesystem other than HDFS for Hadoop and S3 is one option. But, it doesn't have to be an all-or-nothing thing; it is possible to have specific tables stored in S3. The data for these tables will be retrieved into the cluster to be processed, and any resulting data can either be written to a different S3 location (the same table cannot be the source and destination of a single query) or onto HDFS. We can take a file of our tweet data and place it onto a location in S3 with a command such as the following: $ aws s3 put tweets.tsv s3://<bucket-name>/tweets/ We firstly need to specify the access key and secret access key that can access the bucket. This can be done in three ways: Set fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey to the appropriate values in the Hive CLI Set the same values in hive-site.xml though note this limits use of S3 to a single set of credentials Specify the table location explicitly in the table URL, that is, s3n://<access key>:<secret access key>@<bucket>/<path> Then we can create a table referencing this data, as follows: CREATE table remote_tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) CLUSTERED BY(user_ID) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION 's3n://<bucket-name>/tweets' This can be an incredibly effective way of pulling S3 data into a local Hadoop cluster for processing. In order to use AWS credentials in the URI of an S3 location regardless of how the parameters are passed, the secret and access keys must not contain /, +, =, or characters. If necessary, a new set of credentials can be generated from the IAM console at https://console.aws.amazon.com/iam/. In theory, you can just leave the data in the external table and refer to it when needed to avoid WAN data transfer latencies (and costs), even though it often makes sense to pull the data into a local table and do future processing from there. If the table is partitioned, then you might find yourself retrieving a new partition each day, for example. Hive on ElasticMapReduce On one level, using Hive within Amazon ElasticMapReduce is just the same as everything discussed in this article. You can create a persistent cluster, log in to the master node, and use the Hive CLI to create tables and submit queries. Doing all this will use the local storage on the EC2 instances for the table data. Not surprisingly, jobs on EMR clusters can also refer to tables whose data is stored on S3 (or DynamoDB). And, not surprisingly, Amazon has made extensions to its version of Hive to make all this very seamless. It is quite simple from within an EMR job to pull data from a table stored in S3, process it, write any intermediate data to the EMR local storage, and then write the output results into S3, DynamoDB, or one of a growing list of other AWS services. The pattern mentioned earlier where new data is added to a new partition directory for a table each day has proved very effective in S3; it is often the storage location of choice for large and incrementally growing datasets. There is a syntax difference when using EMR; instead of the MSCK command mentioned earlier, the command to update a Hive table with new data added to a partition directory is as follows: ALTER TABLE <table-name> RECOVER PARTITIONS; Consult the EMR documentation for the latest enhancements at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html. Also, consult the broader EMR documentation. In particular, the integration points with other AWS services is an area of rapid growth. Extending HiveQL The HiveQL language can be extended by means of plugins and third-party functions. In Hive, there are three types of functions characterized by the number of rows they take as input and produce as output: User Defined Functions (UDFs): These are simpler functions that act on one row at a time. User Defined Aggregate Functions (UDAFs): These functions take multiple rows as input and generate multiple rows as output. These are aggregate functions to be used in conjunction with a GROUP BY statement (similar to COUNT(), AVG(), MIN(), MAX(), and so on). User Defined Table Functions (UDTFs): These take multiple rows as input and generate a logical table comprised of multiple rows that can be used in join expressions. These APIs are provided only in Java. For other languages, it is possible to stream data through a user-defined script using the TRANSFORM, MAP, and REDUCE clauses that act as a frontend to Hadoop's streaming capabilities. Two APIs are available to write UDFs. A simple API org.apache.hadoop.hive.ql.exec.UDF can be used for functions that take read and return basic writable types. A richer API, which provides support for data types other than writable is available in the org.apache.hadoop.hive.ql.udf.generic.GenericUDF package. We'll now illustrate how org.apache.hadoop.hive.ql.exec.UDF can be used to implement a string to ID function similar to the one we used in Iterative Computation with Spark, to map hashtags to integers in Pig. Building a UDF with this API only requires extending the UDF class and writing an evaluate() method, as follows: public class StringToInt extends UDF {    public Integer evaluate(Text input) {        if (input == null)            return null;            String str = input.toString();          return str.hashCode();    } } The function takes a Text object as input and maps it to an integer value with the hashCode() method. The source code of this function can be found at https://github.com/learninghadoop2/book-examples/ch7/udf/ com/learninghadoop2/hive/udf/StringToInt.java. A more robust hash function should be used in production. We compile the class and archive it into a JAR file, as follows: $ javac -classpath $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/* com/learninghadoop2/hive/udf/StringToInt.java $ jar cvf myudfs-hive.jar com/learninghadoop2/hive/udf/StringToInt.class Before being able to use it, a UDF must be registered in Hive with the following commands: ADD JAR myudfs-hive.jar; CREATE TEMPORARY FUNCTION string_to_int AS 'com.learninghadoop2.hive.udf.StringToInt'; The ADD JAR statement adds a JAR file to the distributed cache. The CREATE TEMPORARY FUNCTION <function> AS <class> statement registers as a function in Hive that implements a given Java class. The function will be dropped once the Hive session is closed. As of Hive 0.13, it is possible to create permanent functions whose definition is kept in the metastore using CREATE FUNCTION … . Once registered, StringToInt can be used in a Hive query just as any other function. In the following example, we first extract a list of hashtags from the tweet's text by applying the regexp_extract function. Then, we use string_to_int to map each tag to a numerical ID: SELECT unique_hashtags.hashtag, string_to_int(unique_hashtags.hashtag) AS tag_id FROM    (        SELECT regexp_extract(text,            '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)') as hashtag        FROM tweets        GROUP BY regexp_extract(text,        '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)') ) unique_hashtags GROUP BY unique_hashtags.hashtag, string_to_int(unique_hashtags.hashtag); We can use the preceding query to create a lookup table, as follows: CREATE TABLE lookuptable (tag string, tag_id bigint); INSERT OVERWRITE TABLE lookuptable SELECT unique_hashtags.hashtag,    string_to_int(unique_hashtags.hashtag) as tag_id FROM (    SELECT regexp_extract(text,        '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)') AS hashtag          FROM tweets          GROUP BY regexp_extract(text,            '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)')    ) unique_hashtags GROUP BY unique_hashtags.hashtag, string_to_int(unique_hashtags.hashtag); Programmatic interfaces In addition to the hive and beeline command-line tools, it is possible to submit HiveQL queries to the system via the JDBC and Thrift programmatic interfaces. Support for odbc was bundled in older versions of Hive, but as of Hive 0.12, it needs to be built from scratch. More information on this process can be found at https://cwiki.apache.org/confluence/display/Hive/HiveODBC. JDBC A Hive client written using JDBC APIs looks exactly the same as a client program written for other database systems (for example MySQL). The following is a sample Hive client program using JDBC APIs. The source code for this example can be found at https://github.com/learninghadoop2/book-examples/ch7/clients/ com/learninghadoop2/hive/client/HiveJdbcClient.java. public class HiveJdbcClient {      private static String driverName = " org.apache.hive.jdbc.HiveDriver";           // connection string      public static String URL = "jdbc:hive2://localhost:10000";        // Show all tables in the default database      public static String QUERY = "show tables";        public static void main(String[] args) throws SQLException {          try {                Class.forName (driverName);          }          catch (ClassNotFoundException e) {                e.printStackTrace();                System.exit(1);          }          Connection con = DriverManager.getConnection (URL);          Statement stmt = con.createStatement();                   ResultSet resultSet = stmt.executeQuery(QUERY);          while (resultSet.next()) {                System.out.println(resultSet.getString(1));          }    } } The URL part is the JDBC URI that describes the connection end point. The format for establishing a remote connection is jdbc:hive2:<host>:<port>/<database>. Connections in embedded mode can be established by not specifying a host or port jdbc:hive2://. The hive and hive2 part are the drivers to be used when connecting to HiveServer and HiveServer2. The QUERY statement contains the HiveQL query to be executed. Hive's JDBC interface exposes only the default database. In order to access other databases, you need to reference them explicitly in the underlying queries using the <database>.<table> notation. First we load the HiveServer2 JDBC driver org.apache.hive.jdbc.HiveDriver. Use org.apache.hadoop.hive.jdbc.HiveDriver to connect to HiveServer. Then, like with any other JDBC program, we establish a connection to URL and use it to instantiate a Statement class. We execute QUERY, with no authentication, and store the output dataset into the ResultSet object. Finally, we scan resultSet and print its content to the command line. Compile and execute the example with the following commands: $ javac HiveJdbcClient.java $ java -cp $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*:/opt/cloudera/parcels/CDH/lib/hive/lib/hive-jdbc.jar: com.learninghadoop2.hive.client.HiveJdbcClient Thrift Thrift provides lower-level access to Hive and has a number of advantages over the JDBC implementation of HiveServer. Primarily, it allows multiple connections from the same client, and it allows programming languages other than Java to be used with ease. With HiveServer2, it is a less commonly used option but still worth mentioning for compatibility. A sample Thrift client implemented using the Java API can be found at https://github.com/learninghadoop2/book-examples/ch7/clients/ com/learninghadoop2/hive/client/HiveThriftClient.java. This client can be used to connect to HiveServer, but due to protocol differences, the client won't work with HiveServer2. In the example, we define a getClient() method that takes as input the host and port of a HiveServer service and returns an instance of org.apache.hadoop.hive.service.ThriftHive.Client. A client is obtained by first instantiating a socket connection, org.apache.thrift.transport.TSocket, to the HiveServer service, and by specifying a protocol, org.apache.thrift.protocol.TBinaryProtocol, to serialize and transmit data, as follows:        TSocket transport = new TSocket(host, port);        transport.setTimeout(TIMEOUT);        transport.open();        TBinaryProtocol protocol = new TBinaryProtocol(transport);        client = new ThriftHive.Client(protocol); Finally, we call getClient() from the main method and use the client to execute a query against an instance of HiveServer running on localhost on port 11111, as follows:      public static void main(String[] args) throws Exception {          Client client = getClient("localhost", 11111);          client.execute("show tables");          List<String> results = client.fetchAll();           for (String result : results) { System.out.println(result);           }      } Make sure that HiveServer is running on port 11111, and if not, start an instance with the following command: $ sudo hive --service hiveserver -p 11111 Compile and execute the HiveThriftClient.java example with the following command: $ javac $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/* com/learninghadoop2/hive/client/HiveThriftClient.java $ java -cp $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*: com.learninghadoop2.hive.client.HiveThriftClient Stinger initiative Hive has remained very successful and capable since its earliest releases, particularly in its ability to provide SQL-like processing on enormous datasets. But other technologies did not stand still, and Hive acquired a reputation of being relatively slow, particularly in regard to lengthy startup times on large jobs and its inability to give quick responses to conceptually simple queries. These perceived limitations were less due to Hive itself and more a consequence of how translation of SQL queries into the MapReduce model has much built-in inefficiency when compared to other ways of implementing a SQL query. Particularly, in regard to very large datasets, MapReduce saw lots of I/OI/O (and consequently time) spent writing out the results of one MapReduce job just to have them read by another. Processing - MapReduce and Beyond, this is a major driver in the design of Tez, which can schedule tasks on a Hadoop cluster as a graph of tasks that does not require inefficient writes and reads between tasks in the graph. The following is a query on the MapReduce framework versus Tez: SELECT a.country, COUNT(b.place_id) FROM place a JOIN tweets b ON (a. place_id = b.place_id) GROUP BY a.country; The following figure contrasts the execution plan for the preceding query on the MapReduce framework versus Tez: Hive on MapReduce versus Tez In plain MapReduce, two jobs are created for the GROUP BY and JOIN clauses. The first job is composed of a set of MapReduce tasks that read data from the disk to carry out grouping. The reducers write intermediate results to the disk so that output can be synchronized. The mappers in the second job read the intermediate results from the disk as well as data from table b. The combined dataset is then passed to the reducer where shared keys are joined. Were we to execute an ORDER BY statement, this would have resulted in a third job and further MapReduce passes. The same query is executed on Tez as a single job by a single set of Map tasks that read data from the disk. I/O grouping and joining are pipelined across reducers. Alongside these architectural limitations, there were quite a few areas around SQL language support that could also provide better efficiency, and in early 2013, the Stinger initiative was launched with an explicit goal of making Hive over 100 times as fast and with much richer SQL support. Hive 0.13 has all the features of the three phases of Stinger, resulting in a much more complete SQL dialect. Also, Tez is offered as an execution framework in addition to a more efficient MapReduce-based implementation atop YARN. With Tez as the execution engine, Hive is no longer limited to a series of linear MapReduce jobs and can instead build a processing graph where any given step can, for example, stream results to multiple sub-steps. To take advantage of the Tez framework, there is a new Hive variable setting, as follows: set hive.execution.engine=tez; This setting relies on Tez being installed on the cluster; it is available in source form from http://tez.incubator.apache.org or in several distributions, though at the time of writing, not Cloudera, due to its support of Impala. The alternative value is mr, which uses the classic MapReduce model (atop YARN), so it is possible in a single installation to compare the performance of Hive using Tez. Impala Hive is not the only product providing the SQL-on-Hadoop capability. The second most widely used is likely Impala, announced in late 2012 and released in spring 2013. Though originally developed internally within Cloudera, its source code is periodically pushed to an open source Git repository (https://github.com/cloudera/impala). Impala was created out of the same perception of Hive's weaknesses that led to the Stinger initiative. Impala also took some inspiration from Google Dremel (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf) which was first openly described by a paper published in 2009. Dremel was built at Google to address the gap between the need for very fast queries on very large datasets and the high latency inherent in the existing MapReduce model underpinning Hive at the time. Dremel was a sophisticated approach to this problem that, rather than building mitigations atop MapReduce such as implemented by Hive, instead created a new service that accessed the same data stored in HDFS. Dremel also benefited from significant work to optimize the storage format of the data in a way that made it more amenable to very fast analytic queries. The architecture of Impala The basic architecture has three main components; the Impala daemons, the state store, and the clients. Recent versions have added additional components that improve the service, but we'll focus on the high-level architecture. The Impala daemon (impalad) should be run on each host where a DataNode process is managing HDFS data. Note that impalad does not access the filesystem blocks through the full HDFS FileSystem API; instead, it uses a feature called short-circuit reads to make data access more efficient. When a client submits a query, it can do so to any of the running impalad processes, and this one will become the coordinator for the execution of that query. The key aspect of Impala's performance is that for each query, it generates custom native code, which is then pushed to and executed by all the impalad processes on the system. This highly optimized code performs the query on the local data, and each impalad then returns its subset of the result set to the coordinator node, which performs the final data consolidation to produce the final result. This type of architecture should be familiar to anyone who has worked with any of the (usually commercial and expensive) Massively Parallel Processing (MPP) (the term used for this type of shared scale-out architecture) data warehouse solutions available today. As the cluster runs the state, the store ensures that each impalad process is aware of all the others and provides a view of the overall cluster health. Co-existing with Hive Impala, as a newer product, tends to have a more restricted set of SQL data types and supports a more constrained dialect of SQL than Hive. It is, however, expanding this support with each new release. Refer to the Impala documentation (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Impala/impala.html) to get an overview of the current level of support. Impala supports the Hive metastore mechanism used by Hive as a store of the metadata surrounding its table structure and storage. This means that on a cluster with an existing Hive setup, it should be immediately possible to use Impala as it will access the same metastore and therefore provide access to the same tables available in Hive. But be warned that the differences in SQL dialect and data types might cause unexpected results when working in a combined Hive and Impala environment. Some queries might work on one but not the other, they might show very different performance characteristics (more on this later), or they might actually give different results. This last point might become apparent when using data types such as float and double that are simply treated differently in the underlying systems (Hive is implemented on Java while Impala is written in C++). As of version 1.2, it supports UDFs written both in C++ and Java, although C++ is strongly recommended as a much faster solution. Keep this in mind if you are looking to share custom functions between Hive and Impala. A different philosophy When Impala was first released, its greatest benefit was in how it truly enabled what is often called speed of thought analysis. Queries could be returned sufficiently fast that an analyst could explore a thread of analysis in a completely interactive fashion without having to wait for minutes at a time for each query to complete. It's fair to say that most adopters of Impala were at times stunned by its performance, especially when compared to the version of Hive shipping at the time. The Impala focus has remained mostly on these shorter queries, and this does impose some limitations on the system. Impala tends to be quite memory-heavy as it relies on in-memory processing to achieve much of its performance. If a query requires a dataset to be held in memory rather than being available on the executing node, then that query will simply fail in versions of Impala before 2.0. Comparing the work on Stinger to Impala, it could be argued that Impala has a much stronger focus on excelling in the shorter (and arguably more common) queries that support interactive data analysis. Many business intelligence tools and services are now certified to directly run on Impala. The Stinger initiative has put less effort into making Hive just as fast in the area where Impala excels but has instead improved Hive (to varying degrees) for all workloads. Impala is still developing at a fast pace and Stinger has put additional momentum into Hive, so it is most likely wise to consider both products and determine which best meets the performance and functionality requirements of your projects and workflows. It should also be kept in mind that there are competitive commercial pressures shaping the direction of Impala and Hive. Impala was created and is still driven by Cloudera, the most popular vendor of Hadoop distributions. The Stinger initiative, though contributed to by many companies as diverse as Microsoft (yes, really!) and Intel, was lead by Hortonworks, probably the second largest vendor of Hadoop distributions. The fact is that if you are using the Cloudera distribution of Hadoop, then some of the core features of Hive might be slower to arrive, whereas Impala will always be up-to-date. Conversely, if you use another distribution, you might get the latest Hive release, but that might either have an older Impala or, as is currently the case, you might have to download and install it yourself. A similar situation has arisen with the Parquet and ORC file formats mentioned earlier. Parquet is preferred by Impala and developed by a group of companies led by Cloudera, while ORC is preferred by Hive and is championed by Hortonworks. Unfortunately, the reality is that Parquet support is often very quick to arrive in the Cloudera distribution but less so in say the Hortonworks distribution, where the ORC file format is preferred. These themes are a little concerning since, although competition in this space is a good thing, and arguably the announcement of Impala helped energize the Hive community, there is a greater risk that your choice of distribution might have a larger impact on the tools and file formats that will be fully supported, unlike in the past. Hopefully, the current situation is just an artifact of where we are in the development cycles of all these new and improved technologies, but do consider your choice of distribution carefully in relation to your SQL-on-Hadoop needs. Drill, Tajo, and beyond You should also consider that SQL on Hadoop no longer only refers to Hive or Impala. Apache Drill (http://drill.incubator.apache.org) is a fuller implementation of the Dremel model first described by Google. Although Impala implements the Dremel architecture across HDFS data, Drill looks to provide similar functionality across multiple data sources. It is still in its early stages, but if your needs are broader than what Hive or Impala provides, it might be worth considering. Tajo (http://tajo.apache.org) is another Apache project that seeks to be a full data warehouse system on Hadoop data. With an architecture similar to that of Impala, it offers a much richer system with components such as multiple optimizers and ETL tools that are commonplace in traditional data warehouses but less frequently bundled in the Hadoop world. It has a much smaller user base but has been used by certain companies very successfully for a significant length of time, and might be worth considering if you need a fuller data warehousing solution. Other products are also emerging in this space, and it's a good idea to do some research. Hive and Impala are awesome tools, but if you find that they don't meet your needs, then look around—something else might. Summary In its early days, Hadoop was sometimes erroneously seen as the latest supposed relational database killer. Over time, it has become more apparent that the more sensible approach is to view it as a complement to RDBMS technologies and that, in fact, the RDBMS community has developed tools such as SQL that are also valuable in the Hadoop world. HiveQL is an implementation of SQL on Hadoop and was the primary focus of this article. In regard to HiveQL and its implementations, we covered the following topics: How HiveQL provides a logical model atop data stored in HDFS in contrast to relational databases where the table structure is enforced in advance How HiveQL supports many standard SQL data types and commands including joins and views The ETL-like features offered by HiveQL, including the ability to import data into tables and optimize the table structure through partitioning and similar mechanisms How HiveQL offers the ability to extend its core set of operators with user-defined code and how this contrasts to the Pig UDF mechanism The recent history of Hive developments, such as the Stinger initiative, that have seen Hive transition to an updated implementation that uses Tez The broader ecosystem around HiveQL that now includes products such as Impala, Tajo, and Drill and how each of these focuses on specific areas in which to excel With Pig and Hive, we've introduced alternative models to process MapReduce data, but so far we've not looked at another question: what approaches and tools are required to actually allow this massive dataset being collected in Hadoop to remain useful and manageable over time? In the next article, we'll take a slight step up the abstraction hierarchy and look at how to manage the life cycle of this enormous data asset. Resources for Article: Further resources on this subject: Big Data Analysis [Article] Understanding MapReduce [Article] Amazon DynamoDB - Modelling relationships, Error handling [Article]
Read more
  • 0
  • 0
  • 6611

article-image-cloud-distribution-points
Packt
22 Dec 2014
8 min read
Save for later

Cloud distribution points

Packt
22 Dec 2014
8 min read
In this article by Vangel Krstevski, author of Mastering System Center Configuration manager, we will learn that a cloud distribution point is a fallback distribution point for the Configuration Manager clients and supports most of the content types. To create a cloud distribution point, you need a Windows Azure subscription, a DNS server, and certificates. For your production environment, you can use the Azure pricing calculator to calculate your subscription fee at http://azure.microsoft.com/en-us/pricing/calculator/?scenario=full. (For more resources related to this topic, see here.) Starting with System Center Configuration Manager SP1, you can use a Windows Azure cloud service to host a distribution point server. When you deploy a cloud-based distribution point server, you configure the client settings and through them, enable users and devices to access the content. You also have to specify a primary site that will manage the content transfer to the cloud-based distribution point. Additionally, you need to specify the thresholds for the amount of content that you want to store on the distribution point and the amount of content that you want to enable clients to transfer from the distribution point. Based on these thresholds, the Configuration Manager can raise alerts that warn you when the combined amount of content that you have stored on the distribution point is near the specified storage amount, or when the transfer of data by the clients is close to the threshold that you defined. The following features are supported by both on-premise and cloud-based distribution points: The management of cloud-based distribution points individually or as members of distribution point groups A cloud-based distribution point can be used as a fallback content location You receive support for both intranet- and Internet-based clients A cloud-based distribution point provides the following additional benefits: The content that is sent to the cloud-based distribution point is encrypted by Configuration Manager before the Configuration Manager sends it to Windows Azure In Windows Azure, you can manually scale the cloud service to meet the changing demands for content request by clients, without the requirement to install and provision additional distribution points The cloud-based distribution point supports the download of content by clients that are configured for Windows BranchCache A cloud-based distribution point has the following limitations: You cannot use a cloud-based distribution point to host software update packages. You cannot use a cloud-based distribution point for PXE-enabled or multicast-enabled deployments. Clients are not offered a cloud-based distribution point as a content location for a task sequence that is deployed using the Download content locally when needed by running task sequence deployment option. However, task sequences that are deployed using the Download all content locally before starting task sequence deployment option can use a cloud-based distribution point as a valid content location. A cloud-based distribution point does not support packages that run from the distribution point. All content must be downloaded by the client and then run locally. A cloud-based distribution point does not support the streaming of applications by using Application Virtualization or similar programs. Prestaged content is not supported. The primary site Distribution Manager that is used for distribution point management does all the content transfers to the distribution point. A cloud-based distribution point cannot be configured as a pull-distribution point. To configure a cloud-based distribution point, follow these steps: Create a management certificate and install it on the site server. This certificate establishes a trust relationship between the site server and Windows Azure. Create a cloud distribution point service certificate and install it on the site server. Create a Windows Azure subscription and import the previously created management certificate in Windows Azure through the management portal. Install a cloud distribution point role in Configuration Manager. Set up the client settings to allow Configuration Manager clients to use the cloud-based distribution point. Create a record in your DNS with the IP address of the cloud distribution point. Cloud distribution points – prerequisites A cloud-based distribution point has the following prerequisites: A Windows Azure subscription. A self-signed or PKI management certificate for communication between the Configuration Manager primary site server and the Windows Azure Cloud Service. A service certificate (PKI) that Configuration Manager clients will use in order to connect to the cloud-based distribution points and also to download content from these distribution points using secure transfer or HTTPS. Before users and devices can access the content on a cloud-based distribution point, a device or a user has to receive the client setting for cloud services of Allow access to cloud distribution points set to Yes. By default, this value is set to No. A client must be able to resolve the name of the cloud service, which requires a Domain Name System (DNS) alias and CNAME record, in your DNS namespace. A client must have Internet access in order to use the cloud-based distribution point. Creating certificates Use the following link to create the needed certificates for the Cloud distribution point creation: http://technet.microsoft.com/en-us/library/230dfec0-bddb-4429-a5db-30020e881f1e#BKMK_clouddp2008_cm2012 Importing the certificates in Windows Azure First, what you need to do is log in to your Windows Azure subscription. To do this, you have to perform the following steps: Go to https://manage.windowsazure.com. After you log in, go to SETTINGS from the menu on the left-hand side, as shown in the following screenshot: Click on MANAGEMENT CERTIFICATES, as shown here: Upload the management certificate that you created for the site server, as shown in the following screenshot: After the import, you will be able to see the certificate in the list of imported MANAGEMENT CERTIFICATES, as shown here: Creating the cloud distribution point In order to create the cloud distribution point, you have to do the following: Start the System Center Configuration Manager console. Navigate to Administration | Hierarchy Configuration | Cloud Services | Cloud Distribution Points, as shown in the following screenshot: From the ribbon bar, click on Create Cloud Distribution Point. On the General page, you have to enter the Windows Azure subscription ID. You can find your Windows Azure subscription ID in the Settings section of the Windows Azure management portal. Click on Browse… to select the certificate that you created for the site server, as shown here: On the Settings page, select the region, for example, West Europe. Click on Browse… and import the cloud distribution point service certificate, as shown in the next screenshot: On the Alerts page, you can configure the settings about the threshold levels of your cloud distribution point. These levels are important because they can alert you when levels drop below a certain level that you have defined. For the purpose of this project, just click on Next: Review all the settings in the Summary page and click on Next to start the cloud distribution point's installation process. After the Cloud distribution point is created, you will be able to see it in the list of Cloud Distribution Points in the System Center Configuration Manager console, as shown here: Configuring DNS for the cloud distribution point For clients to download content from a cloud distribution point, a DNS record must exist for the cloud distribution point's IP address. You can do this by adding a CNAME record in your DNS server that points to the site URL of the Windows Azure Cloud Service. The FQDN of your Windows Azure Cloud Service can be found by proceeding with the following steps: Log in to your Windows Azure subscription. Select Cloud Services from the menu on the left-hand side. From the list of cloud services, click on the service name that represents your cloud distribution point. This will open the cloud service dashboard. The site URL information can be found on the right-hand side of the dashboard, as shown in the following screenshot: Open your DNS server and create the CNAME record. For the alias name, enter CloudDP and for the FQDN of the target host, enter the site URL of your Windows Azure, shown as follows: Summary In this article we saw that the main benefit of a cloud distribution point is that it can work as a backup distribution point. We also saw how we can use System Center Configuration Manager 2012 R2 to deliver applications to different mobile device platforms. We also learned how to connect the Configuration Manager to Windows Intune in order to provide mobile device management and application deployment and to ensure secure and managed access to company resources. Resources for Article: Further resources on this subject: Configuring Roles for High Availability [article] Configuring and Managing the Mailbox Server Role [article] Integration with System Center Operations Manager 2012 SP1 [article]
Read more
  • 0
  • 0
  • 2876
article-image-maintaining-your-gitlab-instance
Packt
22 Dec 2014
14 min read
Save for later

Maintaining Your GitLab Instance

Packt
22 Dec 2014
14 min read
In this article by Jeroen van Baarsen, the author of the book GitLab Cookbook, we will see how we can manage GitLab. We look at procedures to update GitLab and also concentrate on troubleshooting problems faced. Updating an Omnibus installation Updating GitLab from a source installation Troubleshooting your GitLab installation Creating a backup Restoring a backup Importing Git repositories (For more resources related to this topic, see here.) Introduction When running your GitLab installation, you need to update it every once in a while. GitLab is released every 22nd day of the month, so around the end of the month would be a nice time to update! The releases on the 22nd day are well tested, as gitlab.com is using the release candidates in production all the time. This way, you can be sure that the release meets the standards of the GitLab team! In this chapter, we will take a look at how you can update your GitLab server and how you can create backups and restore them. If you want to know what has changed in the new release, you can take a look at the change log provided in the repository for GitLab at https://gitlab.com/gitlab-org/gitlab-ce/blob/master/CHANGELOG. Updating an Omnibus installation In this article, we will take a look at how you can update your GitLab installation when you install it via the Omnibus package. In this article, I'll make the assumption that you're using Ubuntu 12.04. You can use other Linux distributions; the steps for these can be found at about.gitlab.com. How to do it… Let's update the Omnibus installation with the following steps: Log in to your server via SSH. Stop the unicorn server: $ sudo gitlab-ctl stop unicorn Stop the background job server: $ sudo gitlab-ctl stop sidekiq Create a backup in case the upgrade fails: $ sudo gitlab-rake gitlab:backup:create Download the package from the GitLab website (https://about.gitlab.com/downloads/): $ wget https://downloads-packages.s3.amazonaws.com/ubuntu-12.04/gitlab_7.1.1-omnibus.1-1_amd64.deb Install the new package (change the x.x.x part to the correct version number from your download): sudo dpkg -i gitlab_x.x.x-omnibus.xxx.deb Reconfigure GitLab: sudo gitlab-ctl reconfigure Restart all the services: sudo gitlab-ctl restart How it works… On the 22nd day of every month, a new version of GitLab is released. This also includes the Omnibus package. As the installation of Omnibus based on GitLab does not take very long, the GitLab team has decided to install a new version of GitLab and preserve the old data of the old installation; this way, you don't go over an update process, but you will be guided through the installation process as if you're installing a new GitLab instance. So, when you're updating an Omnibus-based installation, you're not really updating but rather installing a newer version and reconfiguring it to use the old data. One thing you have to keep in mind is that making backups is very important. As any update can go wrong at some level, it's a good feeling to know that when stuff goes wrong, you always have a backup that you can use to get back up and running as quickly as possible. Updating GitLab from a source installation Updating used to be a lot of work; you had to open the update document to find out that you need to perform about 15 steps to upgrade your GitLab installation. To tackle this issue, the GitLab team has created a semiautomatic upgrader. When you run the upgrader, it will check whether there is a new minor version. If there is, it will start the upgrade process for you. It will perform database migrations and update config files for you. How to do it… Upgrade your source installation with the following steps: Log in to your server using SSH. We start with creating a backup just in case something goes wrong. Go to the folder of your GitLab instance: $ cd /home/git/gitlab Create the backup; this might take a little while depending on the amount of repositories and the size of each individual repository: $ sudo -u git -H bundle exec rake gitlab:backup:create RAILS_ENV=production Stop the server: $ sudo service gitlab stop Run the upgrader: $ if [ -f bin/upgrade.rb ]; then sudo -u git -Hruby bin/upgrade.rb; else sudo -u git -Hruby script/upgrade.rb; fi Start the application: $ sudo service gitlab start && sudo service nginx restart Check whether everything went fine. We can use the self-check GitLab ships with: $ sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production If the GitLab check indicates that we need to upgrade the GitLab shell, we can do this by performing the following steps. Go to the shell directory: $ cd /home/git/gitlab-shell Fetch the latest code: $ sudo -u git -H git fetch Run the following command to set the pointer to the latest shell release. Change 1.9.4 to the current version number. You can find this number in the output of the check we did in step 9: sudo -u git -H git checkout v1.9.4 How it works… It is highly recommended that you create a backup before you run the upgrader as every update can break something in your installation. It's a good feeling to know that when all goes wrong, you're prepared and can roll back the upgrade using the backup. Troubleshooting your GitLab installation When your GitLab instance does not work the way you expect it to work, it's nice to have a way to check what parts of your installation are not working properly. In this article, we will take a look at the self-diagnostic tools provided by GitLab. How to do it… Learn how to troubleshoot your GitLab server with the following steps: Log in to your server using SSH. The first case shows troubleshooting in the case of a GitLab source installation. Go to your gitlab folder: $ cd /home/git/gitlab To autodiagnose your installation, run the following command: $ sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production When there is a problem with your GitLab installation, it will be outputted in red text, as shown in the following screenshot: The solution for the problem is also given; just follow the explanation given by the problem you walk into. In this case, we have to run the following command: $ sudo-u git -h bundle exec rake gitlab:satellites:create RAILS_ENV=production If everything is green, your installation is in great shape! The next few steps concentrate on troubleshooting in the case of the Gitlab Omnibus installation. Run the following command: $ sudo gitlab-rake gitlab:check When you have any problem with your GitLab installation, this will be outputted; also, the possible solution will be shown. The solution that is given for this problem is the solution for the source installation. To fix this issue in the Omnibus installation, we need to alter the command a little. You have to replace the $ sudo-u git -h bundle exec rake part with $ sudo gitlab-rake. So, the command will look as follows: $ sudo gitlab-rake gitlab:satellites:create RAILS_ENV=production If everything is green, your installation is in top shape! In case you need to view the logs, you can run the following command: $ sudo gitlab-ctl tail To exit the log flow, use Ctrl + C. How it works… When you think your GitLab installation might not be in a good shape or you think you've found a bug, it's always a good idea to run the self-diagnostics for GitLab. It will tell you whether you've configured GitLab correctly and whether everything is still up to date. Here is a list of what will be checked: Is your database config correct? If your database is still running SQLite instead of PostgreSQL or MySQL, it will give you a warning. Are all the user groups configured correctly? Is the GitLab config present and up to date? Are the logs writable? Is the tmp directory writable? Is the init script present and up to date? Are all the projects namespaced? This is important if you've upgraded from an old version. Are all the satellites present? Is your Redis up to date? Is the correct Ruby version being used? Is the correct Git version being used? This is the first place you should start when walking into problems. If this does not give you the correct answers, you can open an issue on the GitLab repository (https://gitlab.com/gitlab-org/gitlab-ce). Make sure you post the output of this check as well, as you will most likely be asked to post it anyway. Creating a backup It's important that you have your source code secured so that when your laptop breaks down—or even worse, the office got destroyed—the most valuable part of your company (besides the employees, of course) is still intact. You've already taken an important step into the right direction; you set up a Git server so that your source code has multiple places to live. However, what if that server breaks down? This is where backups come into play! GitLab Omnibus makes it really easy to create backups. With just a simple command, everything in your system gets backed up: the repositories as well as the databases. It's all packed in a tar ball, so you can store it elsewhere, for example, Amazon S3 or just another server somewhere else. In this article, we not only created a backup, but also created a schema to automatically back up the creation using crontabs. This way, you can rest assured that all of your code gets backed up every night. How to do it… In the following steps, we will set up the backups for GitLab: First, log in to your server using SSH. The first few steps concentrate on the GitLab source installation. Go to your gitlab folder: cd /home/git/gitlab Run the backup command: bundle exec rake gitlab:backup:create RAILS_ENV=production The backup will now run. This might take a while depending on the number of repositories and the size of each repository. The backups will be stored in the /home/git/gitlab/tmp/backups directory. Having to create a backup by hand everyday is no fun, so let's automate the creation of backups using a cronjob file. Run the following command to open the cronjob file: $ sudo -u git crontab -e Add the following content to the end of the file: 0 2 * * * cd /home/git/gitlab && PATH=/usr/local/bin:/usr/bin:/bin bundle exec rake gitlab:backup:create RAILS_ENV=production Save the file, and the backups will be created everyday at 2 A.M. The next few steps talk about the GitLab Omnibus installation. To create the backup, run the following command: $ sudo gitlab-rake gitlab:backup:create A backup is now created in the $ /var/opt/gitlab/backups directory. Let's verify that our backup is actually there: $ ls /var/opt/gitlab/backups/ You should see at least one filename, such as 1404647816_gitlab_backup.tar. The number in the filename is a timestamp, so this might differ in your case. Now we know how to create the backup. Let's automate this via a cronjob file. Run the following command as the root user: $ crontab -e Add the following code to the end of the file to have the backup run every day at 2 A.M.: 0 2 * * * /opt/gitlab/bin/gitlab-rake gitlab:backup:create After you save the file, a backup will be created every day at 2 A.M. This is great, but there is a tiny catch; if you have the backups run for too long, it will take up all of your disk space. Let's fix this! Open this file location: /etc/gitlab/gitlab.rb. Let's have the backups only last for a week. After that, they will be destroyed. 7 days is 604,800 seconds. Add the following code to the bottom of the file: gitlab_rails['backup_keep_time'] = 604800 To have the changes take effect, we have to tell GitLab to reconfigure itself. Run the following command: $ sudo gitlab-ctl reconfigure Restoring a backup If your server breaks down, it is nice to have a backup. However, it's a pain when it's a full day's work to restore that backup; it's just a waste of time. Luckily, GitLab makes it super easy to restore the backup. Retrieve the backup from your external Amazon S3 storage or just your external hard drive, copy the file to your server, and run the backup restore command. It won't get any easier! Getting ready Make sure you have a recent backup of your GitLab instance. After you restore the backup, all the data created between the backup creation and the restoration of your backup will be lost. How to do it… Let's restore a backup using the following steps: Start with a login to your server using SSH. The next few steps concentrate on the GitLab source installation. Go to your gitlab folder: $ cd /home/git/gitlab Go to the backup folder of GitLab: $ cd /home/git/gitlab/tmp/backups Look at the filename of the most recent file and note the number that the filename starts with. Go back to your GitLab folder: $ cd /home/git/gitlab Now, run the following command and replace the 1234567 part with the number you took from the latest backup filename: $ bundle exec rake gitlab:backup:restore RAILS_ENV=production BACKUP=1234567 The next few steps concentrate on the GitLab Omnibus installation. Make sure your backup file is located at /var/opt/gitlab/backups: $ cp 1407564013_gitlab_backup.tar /var/opt/gitlab/backups/ Before we can restore the backup, we need to stop our instance. First, stop GitLab itself: $ sudo gitlab-ctl stop unicorn Next, stop the background worker: $ sudo gitlab-ctl stop sidekiq Now, we will restore our backup. You need to provide the timestamp of the backup you want to restore. The timestamp is the long number before the filename. Warning: this will overwrite all the data in your database! The following command depicts this: $ sudo gitlab-rake gitlab:backup:restore BACKUP=TIMESTAMP_OF_BACKUP Restoring the actual backup might take a little while depending on the size of your database. Restart your GitLab server again: $ sudo gitlab-ctl start Importing an existing repository It's quite simple to import your repositories from somewhere else. All you need to do is create a new project and select the repository to be imported. In this article, we will take a look at how this is done. For this article, we will import the repository hosted on GitHub at https://github.com/gitlabhq/gitlab-shell. How to do it… In the following steps, we will import a repository: Log in to your GitLab instance. Click on New project. Enter the project name as GitLab Shell. Click on Import existing repository?. Enter the URL for the repository we want to import: https://github.com/gitlabhq/gitlab-shell Now, click on Create Project. Importing the existing repository might take a while depending on the size of the repository. After the importing is done, you will be redirected to the project page. Let's check whether it's actually an imported repository. Click on the Network menu item. If everything is fine, you should see the graph in the following screenshot: How it works… There is really nothing magical about importing a repository. All GitLab does is clone the URL you give it to its own satellite. After this is done, the satellite will be linked to your project, and you're done! What if your repository is private and not publicly accessible? You can import it by adding the user credentials in the URL. Don't worry; this information is not stored anywhere! So, if we have the same repository as the one we used earlier but it has credentials, it will look like what is shown in https://username:password@github.com/gitlabhq/gitlab-shell. Summary In this article, we learned about the update processes for GitLab. We also got a gist of the ways that we can use to troubleshoot problems faced. The article explains how we can restore and back up an existing repository. Resources for Article: Further resources on this subject: Searching and Resolving Conflicts [Article] Issues and Wikis in GitLab [Article] Configuration [Article]
Read more
  • 0
  • 1
  • 3471

article-image-api-mongodb-and-nodejs
Packt
22 Dec 2014
26 min read
Save for later

API with MongoDB and Node.js

Packt
22 Dec 2014
26 min read
In this article by Fernando Monteiro, author of the book Learning Single-page Web Application Development, we will see how to build a solid foundation for our API. Our main aim is to discuss the techniques to build rich web applications, with the SPA approach. We will be covering the following topics in this article: The working of an API Boilerplates and generators The speakers API concept Creating the package.json file The Node server with server.js The model with the Mongoose schema Defining the API routes Using MongoDB in the cloud Inserting data with the Postman Chrome extension (For more resources related to this topic, see here.) The working of an API An API works through communication between different codes, thus defining specific behavior of certain objects on an interface. That is, the API will connect several functions on one website (such as search, images, news, authentications, and so on) to enable it to be used in other applications. Operating systems also have APIs, and they still have the same function. Windows, for example, has APIs such as the Win16 API, Win32 API, or Telephony API, in all its versions. When you run a program that involves some process of the operating system, it is likely that we make a connection with one or more Windows APIs. To clarify the concept of an API, we will give go through some examples of how it works. On Windows, it works on an application that uses the system clock to display the same function within the program. It then associates a behavior to a given clock time in another application, for example, using the Time/Clock API from Windows to use the clock functionality on your own application. Another example, is when you use the Android SDK to build mobile applications. When you use the device GPS, you are interacting with the API (android.location) to display the user location on the map through another API, in this case, Google Maps API. The following is the API example: When it comes to web APIs, the functionality can be even greater. There are many services that provide their code, so that they can be used on other websites. Perhaps, the best example is the Facebook API. Several other websites use this service within their pages, for instance a like button, share, or even authentication. An API is a set of programming patterns and instructions to access a software application based on the Web. So, when you access a page of a beer store in your town, you can log in with your Facebook account. This is accomplished through the API. Using it, software developers and web programmers can create beautiful programs and pages filled with content for their users. Boilerplates and generators On a MEAN stack environment, our ecosystem is infinitely diverse, and we can find excellent alternatives to start the construction of our API. At hand, we have simple boilerplates to complex code generators that can be used with other tools in an integrated way, or even alone. Boilerplates are usually a group of tested code that provides the basic structure to the main goal, that is to create a foundation of a web project. Besides saving us from common tasks such as assembling the basic structure of the code and organizing the files, boilerplates already have a number of scripts to make life easier for the frontend. Let's describe some alternatives that we consider as good starting points for the development of APIs with the Express framework, MongoDB database, Node server, and AngularJS for the frontend. Some more accentuated knowledge of JavaScript might be necessary for the complete understanding of the concepts covered here; so we will try to make them as clearly as possible. It is important to note that everything is still very new when we talk about Node and all its ecosystems, and factors such as scalability, performance, and maintenance are still major risk factors. Bearing in mind also that languages such as Ruby on Rails, Scala, and the Play framework have a higher reputation in building large and maintainable web applications, but without a doubt, Node and JavaScript will conquer your space very soon. That being said, we present some alternatives for the initial kickoff with MEAN, but remember that our main focus is on SPA and not directly on MEAN stack. Hackathon starter Hackathon is highly recommended for a quick start to develop with Node. This is because the boilerplate has the main necessary characteristics to develop applications with the Express framework to build RESTful APIs, as it has no MVC/MVVM frontend framework as a standard but just the Bootstrap UI framework. Thus, you are free to choose the framework of your choice, as you will not need to refactor it to meet your needs. Other important characteristics are the use of the latest version of the Express framework, heavy use of Jade templates and some middleware such as Passport - a Node module to manage authentication with various social network sites such as Twitter, Facebook, APIs for LinkedIn, Github, Last.fm, Foursquare, and many more. They provide the necessary boilerplate code to start your projects very fast, and as we said before, it is very simple to install; just clone the Git open source repository: git clone --depth=1 https://github.com/sahat/hackathon-starter.git myproject Run the NPM install command inside the project folder: npm install Then, start the Node server: node app.js Remember, it is very important to have your local database up and running, in this case MongoDB, otherwise the command node app.js will return the error: Error connecting to database: failed to connect to [localhost: 27017] MEAN.io or MEAN.JS This is perhaps the most popular and currently available boilerplate. MEAN.JS is a fork of the original project MEAN.io; both are open source, with a very peculiar similarity, both have the same author. You can check for more details at http://meanjs.org/. However, there are some differences. We consider MEAN.JS to be a more complete and robust environment. It has a structure of directories, better organized, subdivided modules, and better scalability by adopting a vertical modules development. To install it, follow the same steps as previously: Clone the repository to your machine: git clone https://github.com/meanjs/mean.git Go to the installation directory and type on your terminal: npm install Finally, execute the application; this time with the Grunt.js command: grunt If you are on Windows, type the following command: grunt.cmd Now, you have your app up and running on your localhost. The most common problem when we need to scale a SPA is undoubtedly the structure of directories and how we manage all of the frontend JavaScript files and HTML templates using MVC/MVVM. Later, we will see an alternative to deal with this on a large-scale application; for now, let's see the module structure adopted by MEAN.JS: Note that MEAN.JS leaves more flexibility to the AngularJS framework to deal with the MVC approach for the frontend application, as we can see inside the public folder. Also, note the modules approach; each module has its own structure, keeping some conventions for controllers, services, views, config, and tests. This is very useful for team development, so keep all the structure well organized. It is a complete solution that makes use of additional modules such as passport, swig, mongoose, karma, among others. The Passport module Some things about the Passport module must be said; it can be defined as a simple, unobtrusive authentication module. It is a powerful middleware to use with Node; it is very flexible and also modular. It can also adapt easily within applications that use the Express. It has more than 140 alternative authentications and support session persistence; it is very lightweight and extremely simple to be implemented. It provides us with all the necessary structure for authentication, redirects, and validations, and hence it is possible to use the username and password of social networks such as Facebook, Twitter, and others. The following is a simple example of how to use local authentication: var passport = require('passport'), LocalStrategy = require('passport-local').Strategy, User = require('mongoose').model('User');   module.exports = function() { // Use local strategy passport.use(new LocalStrategy({ usernameField: 'username', passwordField: 'password' }, function(username, password, done) { User.findOne({    username: username }, function(err, user) { if (err) { return done(err); } if (!user) {    return done(null, false, {    message: 'Unknown user'    }); } if (!user.authenticate(password)) {    return done(null, false, {    message: 'Invalid password'    }); } return done(null, user); }); } )); }; Here's a sample screenshot of the login page using the MEAN.JS boilerplate with the Passport module: Back to the boilerplates topic; most boilerplates and generators already have the Passport module installed and ready to be configured. Moreover, it has a code generator so that it can be used with Yeoman, which is another essential frontend tool to be added to your tool belt. Yeoman is the most popular code generator for scaffold for modern web applications; it's easy to use and it has a lot of generators such as Backbone, Angular, Karma, and Ember to mention a few. More information can be found at http://yeoman.io/. Generators Generators are for the frontend as gem is for Ruby on Rails. We can create the foundation for any type of application, using available generators. Here's a console output from a Yeoman generator: It is important to bear in mind that we can solve almost all our problems using existing generators in our community. However, if you cannot find the generator you need, you can create your own and make it available to the entire community, such as what has been done with RubyGems by the Rails community. RubyGem, or simply gem, is a library of reusable Ruby files, labeled with a name and a version (a file called gemspec). Keep in mind the Don't Repeat Yourself (DRY) concept; always try to reuse an existing block of code. Don't reinvent the wheel. One of the great advantages of using a code generator structure is that many of the generators that we have currently, have plenty of options for the installation process. With them, you can choose whether or not to use many alternatives/frameworks that usually accompany the generator. The Express generator Another good option is the Express generator, which can be found at https://github.com/expressjs/generator. In all versions up to Express Version 4, the generator was already pre-installed and served as a scaffold to begin development. However, in the current version, it was removed and now must be installed as a supplement. They provide us with the express command directly in terminal and are quite useful to start the basic settings for utilization of the framework, as we can see in the following commands: create : . create : ./package.json create : ./app.js create : ./public create : ./public/javascripts create : ./public/images create : ./public/stylesheets create : ./public/stylesheets/style.css create : ./routes create : ./routes/index.js create : ./routes/users.js create : ./views create : ./views/index.jade create : ./views/layout.jade create : ./views/error.jade create : ./bin create : ./bin/www   install dependencies:    $ cd . && npm install   run the app:    $ DEBUG=express-generator ./bin/www Very similar to the Rails scaffold, we can observe the creation of the directory and files, including the public, routes, and views folders that are the basis of any application using Express. Note the npm install command; it installs all dependencies provided with the package.json file, created as follows: { "name": "express-generator", "version": "0.0.1", "private": true, "scripts": {    "start": "node ./bin/www" }, "dependencies": {    "express": "~4.2.0",    "static-favicon": "~1.0.0",    "morgan": "~1.0.0",    "cookie-parser": "~1.0.1",    "body-parser": "~1.0.0",    "debug": "~0.7.4",    "jade": "~1.3.0" } } This has a simple and effective package.json file to build web applications with the Express framework. The speakers API concept Let's go directly to build the example API. To be more realistic, let's write a user story similar to a backlog list in agile methodologies. Let's understand what problem we need to solve by the API. The user history We need a web application to manage speakers on a conference event. The main task is to store the following speaker information on an API: Name Company Track title Description A speaker picture Schedule presentation For now, we need to add, edit, and delete speakers. It is a simple CRUD function using exclusively the API with JSON format files. Creating the package.json file Although not necessarily required at this time, we recommend that you install the Webstorm IDE, as we'll use it throughout the article. Note that we are using the Webstorm IDE with an integrated environment with terminal, Github version control, and Grunt to ease our development. However, you are absolutely free to choose your own environment. From now on, when we mention terminal, we are referring to terminal Integrated WebStorm, but you can access it directly by the chosen independent editor, terminal for Mac and Linux and Command Prompt for Windows. Webstorm is very useful when you are using a Windows environment, because Windows Command Prompt does not have the facility to copy and paste like Mac OS X on the terminal window. Initiating the JSON file Follow the steps to initiate the JSON file: Create a blank folder and name it as conference-api, open your terminal, and place the command: npm init This command will walk you through creating a package.json file with the baseline configuration for our application. Also, this file is the heart of our application; we can control all the dependencies' versions and other important things like author, Github repositories, development dependencies, type of license, testing commands, and much more. Almost all commands are questions that guide you to the final process, so when we are done, we'll have a package.json file very similar to this: { "name": "conference-api", "version": "0.0.1", "description": "Sample Conference Web Application", "main": "server.js", "scripts": {    "test": "test" }, "keywords": [    "api" ], "author": "your name here", "license": "MIT" } Now, we need to add the necessary dependencies, such as Node modules, which we will use in our process. You can do this in two ways, either directly via terminal as we did here, or by editing the package.json file. Let's see how it works on the terminal first; let's start with the Express framework. Open your terminal in the api folder and type the following command: npm install express@4.0.0 –-save This command installs the Express module, in this case, Express Version 4, and updates the package.json file and also creates dependencies automatically, as we can see: { "name": "conference-api", "version": "0.0.1", "description": "Sample Conference Web Application", "main": "server.js", "scripts": {    "test": "test" }, "keywords": [    "api" ], "author": "your name here", "license": "MIT", "dependencies": {    "express": "^4.0.0" } } Now, let's add more dependencies directly in the package.json file. Open the file in your editor and add the following lines: { "name": "conference-api", "version": "0.0.1", "description": "Sample Conference Web Application", "main": "server.js", "scripts": {    "test": "test" }, "keywords": [    "api" ], "author": "your name here", "license": "MIT", "engines": {        "node": "0.8.4",        "npm": "1.1.49" }, "dependencies": {    "body-parser": "^1.0.1",    "express": "^4.0.0",    "method-override": "^1.0.0",    "mongoose": "^3.6.13",    "morgan": "^1.0.0",    "nodemon": "^1.2.0" }, } It's very important when you deploy your application using some services such as Travis Cl or Heroku hosting company. It's always good to set up the Node environment. Open the terminal again and type the command: npm install You can actually install the dependencies in two different ways, either directly into the directory of your application or globally with the -g command. This way, you will have the modules installed to use them in any application. When using this option, make sure that you are the administrator of the user machine, as this command requires special permissions to write to the root directory of the user. At the end of the process, we'll have all Node modules that we need for this project; we just need one more action. Let's place our code over a version control, in our case Git. More information about the Git can be found at http://git-scm.com however, you can use any version control as subversion or another. We recommend using Git, as we will need it later to deploy our application in the cloud, more specificly, on Heroku cloud hosting. At this time, our project folder must have the same structure as that of the example shown here: We must point out the utilization of an important module called the Nodemon module. Whenever a file changes it restarts the server automatically; otherwise, you will have to restart the server manually every time you make a change to a file, especially in a development environment that is extremely useful, as it constantly updates our files. Node server with server.js With this structure formed, we will start the creation of the server itself, which is the creation of a main JavaScript file. The most common name used is server.js, but it is also very common to use the app.js name, especially in older versions. Let's add this file to the root folder of the project and we will start with the basic server settings. There are many ways to configure our server, and probably you'll find the best one for yourself. As we are still in the initial process, we keep only the basics. Open your editor and type in the following code: // Import the Modules installed to our server var express   = require('express'); var bodyParser = require('body-parser');   // Start the Express web framework var app       = express();   // configure app app.use(bodyParser());   // where the application will run var port     = process.env.PORT || 8080;   // Import Mongoose var mongoose   = require('mongoose');   // connect to our database // you can use your own MongoDB installation at: mongodb://127.0.0.1/databasename mongoose.connect('mongodb://username:password@kahana.mongohq.com:10073/node-api');   // Start the Node Server app.listen(port); console.log('Magic happens on port ' + port); Realize that the line-making connection with MongoDB on our localhost is commented, because we are using an instance of MongoDB in the cloud. In our case, we use MongoHQ, a MongoDB-hosting service. Later on, will see how to connect with MongoHQ. Model with the Mongoose schema Now, let's create our model, using the Mongoose schema to map our speakers on MongoDB. // Import the Mongoose module. var mongoose     = require('mongoose'); var Schema       = mongoose.Schema;   // Set the data types, properties and default values to our Schema. var SpeakerSchema   = new Schema({    name:           { type: String, default: '' },    company:       { type: String, default: '' },    title:         { type: String, default: '' },    description:   { type: String, default: '' },    picture:       { type: String, default: '' },    schedule:       { type: String, default: '' },    createdOn:     { type: Date,   default: Date.now} }); module.exports = mongoose.model('Speaker', SpeakerSchema); Note that on the first line, we added the Mongoose module using the require() function. Our schema is pretty simple; on the left-hand side, we have the property name and on the right-hand side, the data type. We also we set the default value to nothing, but if you want, you can set a different value. The next step is to save this file to our project folder. For this, let's create a new directory named server; then inside this, create another folder called models and save the file as speaker.js. At this point, our folder looks like this: The README.md file is used for Github; as we are using the Git version control, we host our files on Github. Defining the API routes One of the most important aspects of our API are routes that we take to create, read, update, and delete our speakers. Our routes are based on the HTTP verb used to access our API, as shown in the following examples: To create record, use the POST verb To read record, use the GET verb To update record, use the PUT verb To delete records, use the DELETE verb So, our routes will be as follows: Routes Verb and Action /api/speakers GET retrieves speaker's records /api/speakers/ POST inserts speakers' record /api/speakers/:speaker_id GET retrieves a single record /api/speakers/:speaker_id PUT updates a single record /api/speakers/:speaker_id DELETE deletes a single record Configuring the API routes: Let's start defining the route and a common message for all requests: var Speaker     = require('./server/models/speaker');   // Defining the Routes for our API   // Start the Router var router = express.Router();   // A simple middleware to use for all Routes and Requests router.use(function(req, res, next) { // Give some message on the console console.log('An action was performed by the server.'); // Is very important using the next() function, without this the Route stops here. next(); });   // Default message when access the API folder through the browser router.get('/', function(req, res) { // Give some Hello there message res.json({ message: 'Hello SPA, the API is working!' }); }); Now, let's add the route to insert the speakers when the HTTP verb is POST: // When accessing the speakers Routes router.route('/speakers')   // create a speaker when the method passed is POST .post(function(req, res) {   // create a new instance of the Speaker model var speaker = new Speaker();   // set the speakers properties (comes from the request) speaker.name = req.body.name; speaker.company = req.body.company; speaker.title = req.body.title; speaker.description = req.body.description; speaker.picture = req.body.picture; speaker.schedule = req.body.schedule;   // save the data received speaker.save(function(err) {    if (err)      res.send(err);   // give some success message res.json({ message: 'speaker successfully created!' }); }); }) For the HTTP GET method, we need this: // get all the speakers when a method passed is GET .get(function(req, res) { Speaker.find(function(err, speakers) {    if (err)      res.send(err);      res.json(speakers); }); }); Note that in the res.json() function, we send all the object speakers as an answer. Now, we will see the use of different routes in the following steps: To retrieve a single record, we need to pass speaker_id, as shown in our previous table, so let's build this function: // on accessing speaker Route by id router.route('/speakers/:speaker_id')   // get the speaker by id .get(function(req, res) { Speaker.findById(req.params.speaker_id, function(err,    speaker) {    if (err)      res.send(err);      res.json(speaker);    }); }) To update a specific record, we use the PUT HTTP verb and then insert the function: // update the speaker by id .put(function(req, res) { Speaker.findById(req.params.speaker_id, function(err,     speaker) {      if (err)      res.send(err);   // set the speakers properties (comes from the request) speaker.name = req.body.name; speaker.company = req.body.company; speaker.title = req.body.title; speaker.description = req.body.description; speaker.picture = req.body.picture; speaker.schedule = req.body.schedule;   // save the data received speaker.save(function(err) {    if (err)      res.send(err);      // give some success message      res.json({ message: 'speaker successfully       updated!'}); });   }); }) To delete a specific record by its id: // delete the speaker by id .delete(function(req, res) { Speaker.remove({    _id: req.params.speaker_id }, function(err, speaker) {    if (err)      res.send(err);   // give some success message res.json({ message: 'speaker successfully deleted!' }); }); }); Finally, register the Routes on our server.js file: // register the route app.use('/api', router); All necessary work to configure the basic CRUD routes has been done, and we are ready to run our server and begin creating and updating our database. Open a small parenthesis here, for a quick step-by-step process to introduce another tool to create a database using MongoDB in the cloud. There are many companies that provide this type of service but we will not go into individual merits here; you can choose your preference. We chose Compose (formerly MongoHQ) that has a free sandbox for development, which is sufficient for our examples. Using MongoDB in the cloud Today, we have many options to work with MongoDB, from in-house services to hosting companies that provide Platform as a Service (PaaS) and Software as a Service (SaaS). We will present a solution called Database as a Service (DbaaS) that provides database services for highly scalable web applications. Here's a simple step-by-step process to start using a MongoDB instance with a cloud service: Go to https://www.compose.io/. Create your free account. On your dashboard panel, click on add Database. On the right-hand side, choose Sandbox Database. Name your database as node-api. Add a user to your database. Go back to your database title, click on admin. Copy the connection string. The string connection looks like this: mongodb://<user>:<password>@kahana.mongohq.com:10073/node-api. Let's edit the server.js file using the following steps: Place your own connection string to the Mongoose.connect() function. Open your terminal and input the command: nodemon server.js Open your browser and place http://localhost:8080/api. You will see a message like this in the browser: { Hello SPA, the API is working! } Remember the api folder was defined on the server.js file when we registered the routes: app.use('/api', router); But, if you try to access http://localhost:8080/api/speakers, you must have something like this: [] This is an empty array, because we haven't input any data into MongoDB. We use an extension for the Chrome browser called JSONView. This way, we can view the formatted and readable JSON files. You can install this for free from the Chrome Web Store. Inserting data with Postman To solve our empty database and before we create our frontend interface, let's add some data with the Chrome extension Postman. By the way, it's a very useful browser interface to work with RESTful APIs. As we already know that our database is empty, our first task is to insert a record. To do so, perform the following steps: Open Postman and enter http://localhost:8080/api/speakers. Select the x-www-form-urlencoded option and add the properties of our model: var SpeakerSchema   = new Schema({ name:           { type: String, default: '' }, company:       { type: String, default: '' }, title:         { type: String, default: '' }, description:   { type: String, default: '' }, picture:       { type: String, default: '' }, schedule:       { type: String, default: '' }, createdOn:     { type: Date,   default: Date.now} }); Now, click on the blue button at the end to send the request. With everything going as expected, you should see message: speaker successfully created! at the bottom of the screen, as shown in the following screenshot: Now, let's try http://localhost:8080/api/speakers in the browser again. Now, we have a JSON file like this, instead of an empty array: { "_id": "53a38ffd2cd34a7904000007", "__v": 0, "createdOn": "2014-06-20T02:20:31.384Z", "schedule": "10:20", "picture": "fernando.jpg", "description": "Lorem ipsum dolor sit amet, consectetur     adipisicing elit, sed do eiusmod...", "title": "MongoDB", "company": "Newaeonweb", "name": "Fernando Monteiro" } When performing the same action on Postman, we see the same result, as shown in the following screenshot: Go back to Postman, copy _id from the preceding JSON file and add to the end of the http://localhost:8080/api/speakers/53a38ffd2cd34a7904000005 URL and click on Send. You will see the same object on the screen. Now, let's test the method to update the object. In this case, change the method to PUT on Postman and click on Send. The output is shown in the following screenshot: Note that on the left-hand side, we have three methods under History; now, let's perform the last operation and delete the record. This is very simple to perform; just keep the same URL, change the method on Postman to DELETE, and click on Send. Finally, we have the last method executed successfully, as shown in the following screenshot: Take a look at your terminal, you can see four messages that are the same: An action was performed by the server. We configured this message in the server.js file when we were dealing with all routes of our API. router.use(function(req, res, next) { // Give some message on the console console.log('An action was performed by the server.'); // Is very important using the next() function, without this the Route stops here. next(); }); This way, we can monitor all interactions that take place at our API. Now that we have our API properly tested and working, we can start the development of the interface that will handle all this data. Summary In this article, we have covered almost all modules of the Node ecosystem to develop the RESTful API. Resources for Article: Further resources on this subject: Web Application Testing [article] A look into responsive design frameworks [article] Top Features You Need to Know About – Responsive Web Design [article]
Read more
  • 0
  • 0
  • 4259

article-image-what-continuous-delivery-and-devops
Packt
22 Dec 2014
10 min read
Save for later

What is continuous delivery and DevOps?

Packt
22 Dec 2014
10 min read
In this article, by Paul Swartout, author of Continuous Delivery and DevOps: A Quickstart guide Second Edition, we learn that, Continuous delivery (or CD as it’s more commonly referred to) and DevOps are fast becoming the next big thing in relation to the delivery and support of software. Strictly speaking that should be the next big things as CD and DevOps are actually two complementary yet separate approaches. (For more resources related to this topic, see here.) Continuous deliver, as the name suggests, is a way of working whereby quality products, normally software assets, can be built, tested and shipped in quick succession - thus delivering value much sooner than traditional approaches. DevOps is a way of working whereby developers and IT system operators work closely, collaboratively and in harmony towards a common goal with little or no organizational barriers or boundaries between them. Let's focus on each separately: CD is a refinement on traditional agile techniques such as scrum and lean. The main refinements being the ability to decrease the time between software deliveries and the ability to do this many many times - continuously in fact. Agile delivery techniques encourage you to time-box the activities / steps needed to deliver software: definition, analysis, development, testing and sign-off but this doesn't necessarily mean that you deliver as frequently as you would like. Most agile techniques encourage you to have "potentially shippable code" at the end of your iteration / sprint. CD encourages you to ship the finished code, in small chunks, as frequently as possible. To achieve this ability one must look at the process of writing software in a slightly different way. Instead of focusing on delivery of a feature or a project in one go, you need to look at how you can deliver it in smaller, incremental chunks. This can be easier said than done. As stated above DevOps is a highly collaborative way of working whereby developers and system operators work in harmony with little or no organizational barriers between them towards a common goal. This may not sound too radical however when you consider how mainstream organizations function - especially those within government, corporate or financial sectors - and take into account the many barriers that are in place between the development and the operational teams, this can be quite a radical shift. A simple way to imagine this is to consider how a small start-up software house operates. They have a limited amount of resource and a minimum level of bureaucracy. Developing, shipping and supporting software is normally seamless - sometimes being done be the same people. Now look at a corporate business with R&D and operations teams. These two parts of the organization are normally separated by rules, regulation and red tape - in some cases they will also be separated by other part of the organization (QA teams, transition teams, etc). DevOps encourages one to break through these barriers and create an environment whereby those creating the software and those supporting the software work together in close synergy and the lines between them become blurred. DevOps is more to do with hearts and minds than anything else so can be quite tough to implement when long established bad habits are in place. Do I have to use both? You don't necessarily need to have implemented both CD and DevOps to speed up the delivery of quality software but it is true to say that both complement each other very well. If you want to continuously deliver quality software you need to have a pretty slick and streamlined process for getting the software into the production environment. This will need very close working relationships between the development teams and the operations teams. The more they work as one unit, the more seamless the process. CD will give you the tools and best of breed practices to deliver quality software quickly. DevOps will help you establish the behaviors, culture and ways of working to fully utilize CD. If you have the opportunity to implement both, it would be foolish not to at least try. Can CD and DevOps bring quantifiable business benefit? As with any business the idea is that you make more money than you spend. The quicker you can ship and sell your service / software / widget, the quicker you generate income. If you use this same model and apply it to a typical software project you may well notice that you are investing a vast amount of effort and cash simply getting your software complete and ready to ship – this is all cost. That's not the only thing that costs, releasing the software into the production environment is not free either. It can sometimes be more costly – especially if you need large QA teams, and release teams and change management teams and project coordination teams and change control boards staffed by senior managers and bug fixing teams and … I think you get the idea. In simple terms being able to develop, test and ship small changes frequently (say many times per day or week) will drastically reduce the cost of delivery and allow the business to start earning money sooner. Implementing CD and / or DevOps will help you in realizing this goal. Other business benefits include; reduction in risk of release failures, reduction in delivering redundant changes (those that are no longer needed but released anyway), increased productivity (people spend less time releasing product and more time creating products), increases competitiveness and many others. How does it fit with other product delivery methodologies? CD and DevOps should be seen as a set of tools that can be utilized to best fit your business and needs. This is nothing new, any mature agile methodology should be used in this way. As such you should be able to “bolt” elements of CD and DevOps into your existing process. For example if you currently use scrum as your product delivery methodology and have implemented CD, you could tweak your definition of done (DoD) to be “it is live” and incorporate checks against automated testing, continuous integration, and automated deployment within your acceptance criteria for each story. This therefore encourages the scrum team to work in such a way as to want to deliver their product to the live platform rather than have it sat potentially shippable waiting for someone to pick it up and release it – eventually. There may be some complexity when it comes to more traditional waterfall product delivery methodologies such as PRINCE2 or similar however there is always room for improvement – for example the requirements gathering stage could include developers and operational team members to flesh out and understand any issues around releasing and supporting the solution when it goes live. It may not seem much but these sorts of behaviors encourage closer working and collaboration. All in all CD and DevOps should be seen as complementary to your ways of working rather than a direct replacement – at least to begin with. Is it too radical for most businesses? For some organizations being able to fully implement the CD and DevOps utopia may be very complex and complicated, especially where rules, regulations and corporate governance are strongly embedded. That isn't to say that it can't be done. Even the most cumbersome and sloth like institutions can and do benefit from utilizing CD and DevOps – the UK government for example. It does need more time and effort to prove the value of implementing all / some elements of CD and DevOps and those that are proposing such a shift really need to do their homework as there may well be high degrees of resistance to overcome. However as more and more corporates, government organizations and other large business embrace CD and DevOps, the body of proof to convince the doubters becomes easier to find. Look at your business and examine the pain points in the process for delivering your goods / services / software. If you can find a specific problem (or problems) which can be solved by implementing CD or DevOps then you will have a much clearer and more convincing business case. Where do I start? As with any change, you should start small and build up – incrementally. If you see an opportunity, say a small project or some organizational change within your IT department, you should look into introducing elements of CD and DevOps. For some this may be in the form of kicking off a project to examine / build tooling to allow for automated testing, continuous integration or automated deployments. For some it may be to break down some of the barriers between development and operational teams during the development of a specific project / product. You may be able to introduce CD and / or DevOps in one big bang but just consider how things go when you try and do that with a large software release – you don't want CD and DevOps to fail so maybe big bang is not the best way. All in all it depends on your needs, the maturity of your organization and what problems you need CD and DevOps to solve. If you are convinced CD and DevOps will bring value and solve specific problems then the most important thing to do is start looking at ways to convince others. Do I just need to implement some tooling to get the benefits? This is a misconception which is normally held by those management types who have lost their connection to reality. They may have read something about CD and DevOps in some IT management periodical and decided that you just need to implement Jenkins and some automated tests. CD and DevOps do utilize tools but only where they bring value or solve a specific problem. Tooling alone will not remove ingrained business issues. Being able to build and test software quickly is good but without open, honest and collaborative ways of working to ensure the code can be shipped into production just as quickly you will end up with another problem – a backlog of software which needs to be released. Sometimes tweaking an existing process through face to face discussion is all the tooling that is needed – for example changes to the change process to allow for manual release of software as soon as it's ready. Where can I find more information? There are many forms of information available in relation to CD and DevOps one of which is a small book entitled Continuous Delivery and DevOps: A Quickstart guide Second Edition which is written by yours truly and available here (https://www.packtpub.com/application-development/continuous-delivery-and-devops-quickstart-guide-second-edition). Summary This article, like all good things, as pointed out numerous times earlier, has covered quite a lot in a few pages. This book is, by no means, the definitive opus for CD and DevOps; it is merely a collection of suggestions based on experience and observations. If you think it's not for you, you are deluded. You'll circumvent elephant-filled rivers and other unforeseen challengers as you near the journey's end. Resources for Article: Further resources on this subject: Continuous Delivery and DevOps FAQs [Article] Ansible – An Introduction [Article] Choosing the right flavor of Debian (Simple) [Article]
Read more
  • 0
  • 0
  • 10832
article-image-recursive-directives
Packt
22 Dec 2014
13 min read
Save for later

Recursive directives

Packt
22 Dec 2014
13 min read
In this article by Matt Frisbie, the author of AngularJS Web Application Development Cookbook, we will see recursive directives. The power of directives can also be effectively applied when consuming data in a more unwieldy format. Consider the case in which you have a JavaScript object that exists in some sort of recursive tree structure. The view that you will generate for this object will also reflect its recursive nature and will have nested HTML elements that match the underlying data structure. (For more resources related to this topic, see here.) Recursive directives In this article by Matt Frisbie, the author of AngularJS Web Application Development Cookbook, we will see recursive directives. The power of directives can also be effectively applied when consuming data in a more unwieldy format. Consider the case in which you have a JavaScript object that exists in some sort of recursive tree structure. The view that you will generate for this object will also reflect its recursive nature and will have nested HTML elements that match the underlying data structure. Getting ready Suppose you had a recursive data object in your controller as follows: (app.js)   angular.module('myApp', []) .controller('MainCtrl', function($scope) { $scope.data = {    text: 'Primates',    items: [      {        text: 'Anthropoidea',        items: [          {            text: 'New World Anthropoids'          },          {            text: 'Old World Anthropoids',            items: [              {                text: 'Apes',                items: [                 {                    text: 'Lesser Apes'                  },                  {                    text: 'Greater Apes'                  }                ]              },              {                text: 'Monkeys'              }            ]          }        ]      },      {        text: 'Prosimii'      }    ] }; }); How to do it… As you might imagine, iteratively constructing a view or only partially using directives to accomplish this will become extremely messy very quickly. Instead, it would be better if you were able to create a directive that would seamlessly break apart the data recursively, and define and render the sub-HTML fragments cleanly. By cleverly using directives and the $compile service, this exact directive functionality is possible. The ideal directive in this scenario will be able to handle the recursive object without any additional parameters or outside assistance in parsing and rendering the object. So, in the main view, your directive will look something like this: <recursive value="nestedObject"></recursive> The directive is accepting an isolate scope = binding to the parent scope object, which will remain structurally identical as the directive descends through the recursive object. The $compile service You will need to inject the $compile service in order to make the recursive directive work. The reason for this is that each level of the directive can instantiate directives inside it and convert them from an uncompiled template to real DOM material. The angular.element() method The angular.element() method can be thought of as the jQuery $() equivalent. It accepts a string template or DOM fragment and returns a jqLite object that can be modified, inserted, or compiled for your purposes. If the jQuery library is present when the application is initialized, AngularJS will use that instead of jqLite. If you use the AngularJS template cache, retrieved templates will already exist as if you had called the angular.element() method on the template text. The $templateCache Inside a directive, it's possible to create a template using angular.element() and a string of HTML similar to an underscore.js template. However, it's completely unnecessary and quite unwieldy to use compared to AngularJS templates. When you declare a template and register it with AngularJS, it can be accessed through the injected $templateCache, which acts as a key-value store for your templates. The recursive template is as follows: <script type="text/ng-template" id="recursive.html"> <span>{{ val.text }}</span> <button ng-click="delSubtree()">delete</button> <ul ng-if="isParent" style="margin-left:30px">    <li ng-repeat="item in val.items">      <tree val="item" parent-data="val.items"></tree>    </li> </ul> </script> The <span> and <button> elements are present at each instance of a node, and they present the data at that node as well as an interface to the click event (which we will define in a moment) that will destroy it and all its children. Following these, the conditional <ul> element renders only if the isParent flag is set in the scope, and it repeats through the items array, recursing the child data and creating new instances of the directive. Here, you can see the full template definition of the directive: <tree val="item" parent-data="val.items"></tree> Not only does the directive take a val attribute for the local node data, but you can also see its parent-data attribute, which is the point of scope indirection that allows the tree structure. To make more sense of this, examine the following directive code: (app.js)   .directive('tree', function($compile, $templateCache) { return {    restrict: 'E',    scope: {      val: '=',      parentData: '='    },    link: function(scope, el, attrs) {      scope.isParent = angular.isArray(scope.val.items)      scope.delSubtree = function() {        if(scope.parentData) {            scope.parentData.splice(            scope.parentData.indexOf(scope.val),            1          );        }        scope.val={};      }        el.replaceWith(        $compile(          $templateCache.get('recursive.html')        )(scope)      );      } }; }); With all of this, if you provide the recursive directive with the data object provided at the beginning of this article, it will result in the following (presented here without the auto-added AngularJS comments and directives): (index.html – uncompiled)   <div ng-app="myApp"> <div ng-controller="MainCtrl">    <tree val="data"></tree> </div>    <script type="text/ng-template" id="recursive.html">    <span>{{ val.text }}</span>    <button ng-click="deleteSubtree()">delete</button>    <ul ng-if="isParent" style="margin-left:30px">      <li ng-repeat="item in val.items">        <tree val="item" parent-data="val.items"></tree>      </li>    </ul> </script> </div> The recursive nature of the directive templates enables nesting, and when compiled using the recursive data object located in the wrapping controller, it will compile into the following HTML: (index.html - compiled)   <div ng-controller="MainController"> <span>Primates</span> <button ng-click="delSubtree()">delete</button> <ul ng-if="isParent" style="margin-left:30px">    <li ng-repeat="item in val.items">      <span>Anthropoidea</span>      <button ng-click="delSubtree()">delete</button>      <ul ng-if="isParent" style="margin-left:30px">        <li ng-repeat="item in val.items">          <span>New World Anthropoids</span>          <button ng-click="delSubtree()">delete</button>        </li>        <li ng-repeat="item in val.items">          <span>Old World Anthropoids</span>          <button ng-click="delSubtree()">delete</button>          <ul ng-if="isParent" style="margin-left:30px">            <li ng-repeat="item in val.items">              <span>Apes</span>              <button ng-click="delSubtree()">delete</button>              <ul ng-if="isParent" style="margin-left:30px">                <li ng-repeat="item in val.items">                  <span>Lesser Apes</span>                  <button ng-click="delSubtree()">delete</button>                </li>                <li ng-repeat="item in val.items">                  <span>Greater Apes</span>                  <button ng-click="delSubtree()">delete</button>                </li>              </ul>            </li>            <li ng-repeat="item in val.items">              <span>Monkeys</span>              <button ng-click="delSubtree()">delete</button>            </li>          </ul>         </li>      </ul>    </li>    <li ng-repeat="item in val.items">      <span>Prosimii</span>      <button ng-click="delSubtree()">delete</button>    </li> </ul> </div> JSFiddle: http://jsfiddle.net/msfrisbie/ka46yx4u/ How it works… The definition of the isolate scope through the nested directives described in the previous section allows all or part of the recursive objects to be bound through parentData to the appropriate directive instance, all the while maintaining the nested connectedness afforded by the directive hierarchy. When a parent node is deleted, the lower directives are still bound to the data object and the removal propagates through cleanly. The meatiest and most important part of this directive is, of course, the link function. Here, the link function determines whether the node has any children (which simply checks for the existence of an array in the local data node) and declares the deleting method, which simply removes the relevant portion from the recursive object and cleans up the local node. Up until this point, there haven't been any recursive calls, and there shouldn't need to be. If your directive is constructed correctly, AngularJS data binding and inherent template management will take care of the template cleanup for you. This, of course, leads into the final line of the link function, which is broken up here for readability: el.replaceWith( $compile(    $templateCache.get('recursive.html') )(scope) ); Recall that in a link function, the second parameter is the jqLite-wrapped DOM object that the directive is linking—here, the <tree> element. This exposes to you a subset of jQuery object methods, including replaceWith(), which you will use here. The top-level instance of the directive will be replaced by the recursively-defined template, and this will carry down through the tree. At this point, you should have an idea of how the recursive structure is coming together. The element parameter needs to be replaced with a recursively-compiled template, and for this, you will employ the $compile service. This service accepts a template as a parameter and returns a function that you will invoke with the current scope inside the directive's link function. The template is retrieved from $templateCache by the recursive.html key, and then it's compiled. When the compiler reaches the nested <tree> directive, the recursive directive is realized all the way down through the data in the recursive object. Summary This article demonstrates the power of constructing a directive to convert a complex data object into a large DOM object. Relevant portions can be broken into individual templates, handled with distributed directive logic, and combined together in an elegant fashion to maximize modularity and reusability. Resources for Article:  Further resources on this subject: Working with Live Data and AngularJS [article] Angular Zen [article] AngularJS Project [article]
Read more
  • 0
  • 0
  • 4959

article-image-combining-vector-and-raster-datasets
Packt
22 Dec 2014
12 min read
Save for later

Combining Vector and Raster Datasets

Packt
22 Dec 2014
12 min read
This article by Michael Dorman, the author of Learning R for Geospatial Analysis, explores the interplay between vector and raster layers, and the way it is implemented in the raster package. The way rasters and vector layers can be interchanged and queried one according to the other will be demonstrated through examples. (For more resources related to this topic, see here.) Creating vector layers from a raster The opposite operation to rasterization, which has been presented in the previous section, is the creation of vector layers from raster data. The procedure of extracting features of interest out of rasters, in the form of vector layers, is often necessary for analogous reasons underlying rasterization—when the data held in a raster is better represented using a vector layer, within the context of specific subsequent analysis or visualization tasks. Scenarios where we need to create points, lines, and polygons from a raster can all be encountered. In this section, we are going to see an example of each. Raster-to-points conversion In raster-to-points conversion, each raster cell center (excluding NA cells) is converted to a point. The resulting point layer has an attribute table with the values of the respective raster cells in it. Conversion to points can be done with the rasterToPoints function. This function has a parameter named spatial that determines whether the returned object is going to be SpatialPointsDataFrame or simply a matrix holding the coordinates, and the respective cell values (spatial=FALSE, the default value). For our purposes, it is thus important to remember to specify spatial=TRUE. As an example of a raster, let's create a subset of the raster r, with only layers 1-2, rows 1-3, and columns 1-3: > u = r[[1:2]][1:3, 1:3, drop = FALSE] To make the example more instructive, we will place NA in some of the cells and see how this affects the raster-to-points conversion: > u[2, 3] = NA > u[[1]][3, 2] = NA Now, we will apply rasterToPoints to create a SpatialPointsDataFrame object named u_pnt out of u: > u_pnt = rasterToPoints(u, spatial = TRUE) Let's visually examine the result we got with the first layer of u serving as the background: > plot(u[[1]]) > plot(u_pnt, add = TRUE) The graphical output is shown in the following screenshot: We can see that a point has been produced at the center of each raster cell, except for the cell at position (2,3), where we assigned NA to both layers. However, at the (3,2) position, NA has been assigned to only one of the layers (the first one); therefore, a point feature has been generated there nevertheless. The attribute table of u_pnt has eight rows (since there are eight points) and two columns (corresponding to the raster layers). > u_pnt@data layer.1 layer.2 1 0.4242 0.4518 2 0.3995 0.3334 3 0.4190 0.3430 4 0.4495 0.4846 5 0.2925 0.3223 6 0.4998 0.5841 7     NA 0.5841 8 0.7126 0.5086 We can see that the seventh point feature, the one corresponding to the (3,2) raster position, indeed contains an NA value corresponding to layer 1. Raster-to-contours conversion Creating points (see the previous section) and polygons (see the next section) from a raster is relatively straightforward. In the former case, points are generated at cell centroids, while in the latter, rectangular polygons are drawn according to cell boundaries. On the other hand, lines can be created from a raster using various different algorithms designed for more specific purposes. Two common procedures where lines are generated based on a raster are constructing contours (lines connecting locations of equal value on the raster) and finding least-cost paths (lines going from one location to another along the easiest route when cost of passage is defined by raster values). In this section, we will see an example of how to create contours (readers interested in least-cost path calculation can refer to the gdistance package, which provides this capability in R). As an example, we will create contours from the DEM of Haifa (dem). Creating contours can be done using the rasterToContour function. This function accepts a RasterLayer object and returns a SpatialLinesDataFrame object with the contour lines. The rasterToContour function internally uses the base function contourLines, and arguments can be passed to the latter as part of the rasterToContour function call. For example, using the levels parameter, we can specify the breaks where contours will be generated (rather than letting them be determined automatically). The raster dem consists of elevation values ranging between -14 meters and 541 meters: > range(dem[], na.rm = TRUE) [1] -14 541 Therefore, we may choose to generate six contour lines, at 0, 100, 200, …, 500 meter levels: > dem_contour = rasterToContour(dem, levels = seq(0, 500, 100)) Now, we will plot the resulting SpatialLinesDataFrame object on top of the dem raster: > plot(dem) > plot(dem_contour, add = TRUE) The graphical output is shown in the following screenshot: Mount Carmel is densely covered with elevation contours compared to the plains surrounding it, which are mostly within the 0-100 meter elevation range and thus, have only few a contour lines. Let's take a look at the attribute table of dem_contour: > dem_contour@data    level C_1     0 C_2   100 C_3   200 C_4   300 C_5   400 C_6   500 Indeed, the layer consists of six line features—one for each break we specified with the levels argument. Raster-to-polygons conversion As mentioned previously, raster-to-polygons conversion involves the generation of rectangular polygons in the place of each raster cell (once again, excluding NA cells). Similar to the raster-to-points conversion, the resulting attribute table contains the respective raster values for each polygon created. The conversion to polygons is most useful with categorical rasters when we would like to generate polygons defining certain areas in order to exploit the analysis tools this type of data is associated with (such as extraction of values from other rasters, geometry editing, and overlay). Creation of polygons from a raster can be performed with a function whose name the reader may have already guessed, rasterToPolygons. A useful option in this function is to immediately dissolve the resulting polygons according to their attribute table values; that is, all polygons having the same value are dissolved into a single feature. This functionality internally utilizes the rgeos package and it can be triggered by specifying dissolve=TRUE. In our next example, we will visually compare the average NDVI time series of Lahav and Kramim forests (see earlier), based on all of our Landsat (three dates) and MODIS (280 dates) satellite images. In this article, we will only prepare the necessary data by going through the following intermediate steps: Creating the Lahav and Kramim forests polygonal layer. Extracting NDVI values from the satellite images. Creating a data.frame object that can be passed to graphical functions later. Commencing with the first step, using l_rec_focal_clump, we will first create a polygonal layer holding all NDVI>0.2 patches, then subset only those two polygons corresponding to Lahav and Kramim forests. The former is achieved using rasterToPolygons with dissolve=TRUE, converting the patches in l_rec_focal_clumpto 507 individual polygons in a new SpatialPolygonsDataFrame that we hereby name pol: > pol = rasterToPolygons(l_rec_focal_clump, dissolve = TRUE) Plotting pol will show that we have quite a few large patches and many small ones. Since the Lahav and Kramim forests are relatively large, to make things easier, we can omit all polygons with area less than or equal to 1 km2: > pol$area = gArea(pol, byid = TRUE) / 1000^2 > pol = pol[pol$area > 1, ] The attribute table shows that we are left with eight polygons, with area sizes of 1-10 km2. The clumps column, by the way, is where the original l_rec_focal_clump raster value (the clump ID) has been kept ("clumps" is the name of the l_rec_focal_clump raster layer from which the values came). > pol@data    clumps   area 112     2 1.2231 114   200 1.3284 137   221 1.9314 203   281 9.5274 240   314 6.7842 371   432 2.0007 445     5 10.2159 460     56 1.0998 Let's make a map of pol: > plotRGB(l_00, r = 3, g = 2, b = 1, stretch = "lin") > plot(pol, border = "yellow", lty = "dotted", add = TRUE) The graphical output is shown in the following screenshot: The preceding screenshot shows the continuous NDVI>0.2 patches, which are 1 km2 or larger, within the studied area. Two of these, as expected, are the forests we would like to examine. How can we select them? Obviously, we could export pol to a Shapefile and select the features of interest interactively in a GIS software (such as QGIS), then import the result back into R to continue our analysis. The raster package also offers some capabilities for interactive selection (that we do not cover here); for example, a function named click can be used to obtain the properties of the pol features we click in a graphical window such as the one shown in the preceding screenshot. However, given the purpose of this book, we will try to write a code to make the selection automatically without further user input. To write a code that makes the selection, we must choose a certain criterion (either spatial or nonspatial) that separates the features of interest. In this case, for example, we can see that the pol features we wish to select are those closest to Lahav Kibbutz. Therefore, we can utilize the towns point layer (see earlier) to find the distance of each polygon from Lahav Kibbutz, and select the two most proximate ones. Using the gDistance function, we will first find out the distances between each polygon in pol and each point in towns: > dist_towns = gDistance(towns, pol, byid = TRUE) > dist_towns              1         2 112 14524.94060 12697.151 114 5484.66695 7529.195 137 3863.12168 5308.062 203   29.48651 1119.090 240 1910.61525 6372.634 371 11687.63594 11276.683 445 12751.21123 14371.268 460 14860.25487 12300.319 The returned matrix, named dist_towns, contains the pairwise distances, with rows corresponding to the pol feature and columns corresponding to the towns feature. Since Lahav Kibbutz corresponds to the first towns feature (column "1"), we can already see that the fourth and fifth pol features (rows "203" and "240") are the most proximate ones, thus corresponding to the Lahav and Kramim forests. We could subset both forests by simply using their IDs—pol[c("203","240"),]. However, as always, we are looking for general code that will select, in this case, the two closest features irrespective of the specific IDs or row indices. For this purpose, we can use the order function, which we have not encountered so far. This function, given a numeric vector, returns the element indices in an increasing order according to element values. For example, applying order to the first column of dist_towns, we can see that the smallest element in this column is in the fourth row, the second smallest is in the fifth row, the third smallest is in the third row, and so on: > dist_order = order(dist_towns[, 1]) > dist_order [1] 4 5 3 2 6 7 1 8 We can use this result to select the relevant features of pol as follows: > forests = pol[dist_order[1:2], ] The subset SpatialPolygonsDataFrame, named forests, now contains only the two features from pol corresponding to the Lahav and Kramim forests. > forests@data    clumps   area 203   281 9.5274 240   314 6.7842 Let's visualize forests within the context of the other data we have by now. We will plot, once again, l_00 as the RGB background and pol on top of it. In addition, we will plot forests (in red) and the location of Lahav Kibbutz (as a red point). We will also add labels for each feature in pol, corresponding to its distance (in meters) from Lahav Kibbutz: > plotRGB(l_00, r = 3, g = 2, b = 1, stretch = "lin") > plot(towns[1, ], col = "red", pch = 16, add = TRUE) > plot(pol, border = "yellow", lty = "dotted", add = TRUE) > plot(forests, border = "red", lty = "dotted", add = TRUE) > text(gCentroid(pol, byid = TRUE), + round(dist_towns[,1]), + col = "White") The graphical output is shown in the following screenshot: The preceding screenshot demonstrates that we did indeed correctly select the features of interest. We can also assign the forest names to the attribute table of forests, relying on our knowledge that the first feature of forests (ID "203") is larger and more proximate to Lahav Kibbutz and corresponds to the Lahav forest, while the second feature (ID "240") corresponds to Kramim. > forests$name = c("Lahav", "Kramim") > forests@data    clumps   area   name 203   281 9.5274 Lahav 240   314 6.7842 Kramim We now have a polygonal layer named forests, with two features delineating the Lahav and Kramim forests, named accordingly in the attribute table. In the next section, we will proceed with extracting the NDVI data for these forests. Summary In this article, we closed the gap between the two main spatial data types (rasters and vector layers). We now know how to make the conversion from a vector layer to raster and vice versa, and we can transfer the geometry and data components from one data model to another when the need arises. We also saw how raster values can be extracted from a raster according to a vector layer, which is a fundamental step in many analysis tasks involving raster data. Resources for Article:  Further resources on this subject: Data visualization[article] Machine Learning in Bioinformatics[article] Specialized Machine Learning Topics[article]
Read more
  • 0
  • 0
  • 3646
Modal Close icon
Modal Close icon