How-To Tutorials

article-image-using-location-data-phonegap

30 Oct 2013

11 min read

Using Location Data with PhoneGap

30 Oct 2013

(For more resources related to this topic, see here.) An introduction to Geolocation The term geolocation is used in order to refer to the identification process of the real-world geographic location of an object. Devices that are able to detect the user's position are becoming more common each day and we are now used to getting content based on our location ( geo targeting ). Using the Global Positioning System (GPS )—a space-based satellite navigation system that provides location and time information consistently across the globe—you can now get the accurate location of a device. During the early 1970s, the US military created Navstar, a defense navigation satellite system. Navstar was the system that created the basis for the GPS infrastructure used today by billions of devices. Since 1978 more than 60 GPS satellites have been successfully placed in the orbit around the Earth (refer to http://en.wikipedia.org/wiki/List_of_GPS_satellite_launches for a detailed report about the past and planned launches). The location of a device is represented through a point. This point is comprised of two components: latitude and longitude. There are many methods for modern devices to determine the location information, these include: Global Positioning System (GPS) IP address GSM/CDMA cell IDs Wi-Fi and Bluetooth MAC address Each approach delivers the same information; what changes is the accuracy of the device's position. The GPS satellites continuously transmit information that can parse, for example, the general health of the GPS array, roughly, where all of the satellites are in orbit, information on the precise orbit or path of the transmitting satellite, and the time of the transmission. The receiver calculates its own position by timing the signals sent by any of the satellites in the array that are visible. The process of measuring the distance from a point to a group of satellites to locate a position is known as trilateration . The distance is determined using the speed of light as a constant along with the time that the signal left the satellites. The emerging trend in mobile development is GPS-based "people discovery" apps such as Highlight, Sonar, Banjo, and Foursquare. Each app has different features and has been built for different purposes, but all of them share the same killer feature: using location as a piece of metadata in order to filter information according to the user's needs. The PhoneGap Geolocation API The Geolocation API is not a part of the HTML5 specification but it is tightly integrated with mobile development. The PhoneGap Geolocation API and the W3C Geolocation API mirror each other; both define the same methods and relative arguments. There are several devices that already implement the W3C Geolocation API; for those devices you can use native support instead of the PhoneGap API. As per the HTML specification, the user has to explicitly allow the website or the app to use the device's current position. The Geolocation API is exposed through the geolocation object child of the navigator object and consists of the following three methods: getCurrentPosition() returns the device position. watchPosition() watches for changes in the device position. clearWatch() stops the watcher for the device's position changes. The watchPosition() and clearWatch() methods work in the same way that the setInterval() and clearInterval() methods work; in fact the first one returns an identifier that is passed in to the second one. The getCurrentPosition() and watchPosition() methods mirror each other and take the same arguments: a success and a failure callback function and an optional configuration object. The configuration object is used in order to specify the maximum age of a cached value of the device's position, to set a timeout after which the method will fail and to specify whether the application requires only accurate results. var options = {maximumAge: 3000, timeout: 5000, enableHighAccuracy: true }; navigator.geolocation.watchPosition(onSuccess, onFailure, options); Only the first argument is mandatory; but it's recommended to handle always the failure use case. The success handler function receives as argument, a Position object. Accessing its properties you can read the device's coordinates and the creation timestamp of the object that stores the coordinates. function onSuccess(position) { console.log('Coordinates: ' + position.coords); console.log('Timestamp: ' + position.timestamp); } The coords property of the Position object contains a Coordinates object; so far the most important properties of this object are longitude and latitude. Using those properties it's possible to start to integrate positioning information as relevant metadata in your app. The failure handler receives as argument, a PositionError object. Using the code and the message property of this object you can gracefully handle every possible error. function onError(error) { console.log('message: ' + error.message); console.log ('code: ' + error.code); } The message property returns a detailed description of the error, the code property returns an integer; the possible values are represented through the following pseudo constants: PositionError.PERMISSION_DENIED, the user denies the app to use the device's current position PositionError.POSITION_UNAVAILABLE, the position of the device cannot be determined If you want to recover the last available position when the POSITION_UNAVAILABLE error is returned, you have to write a custom plugin that uses the platform-specific API. Android and iOS have this feature. You can find a detailed example at http://stackoverflow.com/questions/10897081/retrieving-last-known-geolocation-phonegap. PositionError.TIMEOUT, the specified timeout has elapsed before the implementation could successfully acquire a new Position object JavaScript doesn't support constants such as Java and other object-oriented programming languages. With the term "pseudo constants", I refer to those values that should never change in a JavaScript app. One of the most common tasks to perform with the device position information is to show the device location on a map. You can quickly perform this task by integrating Google Maps in your app; the only requirement is a valid API key. To get the key, use the following steps: Visit the APIs console at https://code.google.com/apis/console and log in with your Google account. Click the Services link on the left-hand menu. Activate the Google Maps API v3 service. Time for action – showing device position with Google Maps Get ready to add a map renderer to the PhoneGap default app template. Refer to the following steps: Open the command-line tool and create a new PhoneGap project named MapSample. $ cordova create ~/the/path/to/your/source/ mapmample com.gnstudio.pg.MapSample MapSample Add the Geolocation API plugin using the command line. $ cordova plugins add https: //git-wip-us.apache.org /repos/asf/cordova-plugin-geolocation.git Go to the www folder, open the index.html file, and add a div element with the id value #map inside the main div of the app below the #deviceready one. <div id='map'></div> Add a new script tag to include the Google Maps JavaScript library. <script type="text/javascript" src ="https: //maps.googleapis.com/maps/api/js?key= YOUR_API_KEY &sensor=true"> </script> Go to the css folder and define a new rule inside the index.css file to give to the div element and its content an appropriate size. #map{ width: 280px; height: 230px; display: block; margin: 5px auto; position: relative; } Go to the js folder, open the index.js file, and define a new function named initMap. initMap: function(lat, long){ // The code needed to show the map and the // device position will be added here } In the body of the function, define an options object in order to specify how the map has to be rendered. var options = { zoom: 8, center: new google.maps.LatLng(lat, long), mapTypeId: google.maps.MapTypeId.ROADMAP }; Add to the body of the initMap function the code to initialize the rendering of the map, and to show a marker representing the current device's position over it. var map = new google.maps.Map(document.getElementById('map'), options); var markerPoint = new google.maps.LatLng(lat, long); var marker = new google.maps.Marker({ position: markerPoint, map: map, title: 'Device's Location' }); Define a function to use as the success handler and call from its body the initMap function previously defined. onSuccess: function(position){ var coords = position.coords; app.initMap(coords.latitude, coords.longitude); } Define another function in order to have a failure handler able to notify the user that something went wrong. onFailure: function(error){ navigator.notification.alert(error.message, null); } Go into the deviceready function and add as the last statement the call to the Geolocation API needed to recover the device's position. navigator.geolocation.getCurrentPosition(app.onSuccess, app.onError, {timeout: 5000, enableAccuracy: false}); Open the command-line tool, build the app, and then run it on your testing devices. $ cordova build $ cordova run android What just happened? You integrated Google Maps inside an app. The map is an interactive map most users are familiar with—the most common gestures are already working and the Google Street View controls are already enabled. To successfully load the Google Maps API on iOS, it's mandatory to whitelist the googleapis.com and gstatic.com domains. Open the .plist file of the project as source code (right-click on the file and then Open As | Source Code ) and add the following array of domains: <key>ExternalHosts</key> <array> <string>*.googleapis.com</string> <string>*.gstatic.com</string> </array> Other Geolocation data In the previous example, you only used the latitude and longitude properties of the position object that you received. There are other attributes that can be accessed as properties of the Coordinates object: altitude, the height of the device, in meters, above the sea level. accuracy, the accuracy level of the latitude and longitude, in meters; it can be used to show a radius of accuracy when mapping the device's position. altitudeAccuracy, the accuracy of the altitude in meters. heading, the direction of the device in degrees clockwise from true north. speed, the current ground speed of the device in meters per second. Latitude and longitude are the best supported of these properties, and the ones that will be most useful when communicating with remote APIs. The other properties are mainly useful if you're developing an application for which Geolocation is a core component of its standard functionality, such as apps that make use of this data to create a flow of information contextualized to the geolocation data. The accuracy property is the most important of these additional features, because as an application developer, you typically won't know which particular sensor is giving you the location and you can use the accuracy property as a range in your queries to external services. There are several APIs that allow you to discover interesting data related to a place; among these the most interesting are the Google Places API and the Foursquare API. The Google Places and Foursquare online documentation is very well organized and it's the right place to start if you want to dig deeper into these topics. You can access the Google Places docs at https://developers.google.com/maps/documentation/javascript/places and Foursquare at https://developer.foursquare.com/. The itinero reference app for this article implements both the APIs. In the next example, you will look at how to integrate Google Places inside the RequireJS app. In order to include the Google Places API inside an app, all you have to do is add the libraries parameter to the Google Maps API call. The resulting URL should look similar to http://maps.google.com/maps/api/js?key=SECRET_KEY&sensor=true&libraries=places. The itinero app lets users create and plan a trip with friends. Once the user provides the name of the trip, the name of the country to be visited, and the trip mates and dates, it's time to start selecting the travel, eat, and sleep options. When the user selects the Eat option, the Google Places data provider will return bakeries, take-out places, groceries, and so on, close to the trip's destination. The app will show on the screen a list of possible places the user can select to plan the trip. For a complete list of the types of place searches supported by the Google API, refer to the online documentation at https://developers.google.com/places/documentation/supported_types.

0
0
10599

Packt

30 Oct 2013

7 min read

The DHTMLX Grid

Packt

30 Oct 2013

7 min read

0
0
13091

Packt

30 Oct 2013

14 min read

The Dialog Widget

Packt

30 Oct 2013

14 min read

(For more resources related to this topic, see here.) Wijmo additions to the dialog widget at a glance By default, the dialog window includes the pin, toggle, minimize, maximize, and close buttons. Pinning the dialog to a location on the screen disables the dragging feature on the title bar. The dialog can still be resized. Maximizing the dialog makes it take up the area inside the browser window. Toggling it expands or collapses it so that the dialog contents are shown or hidden with the title bar remaining visible. If these buttons cramp your style, they can be turned off with the captionButtons option. You can see how the dialog is presented in the browser from the following screenshot: Wijmo features additional API compared to jQuery UI for changing the behavior of the dialog. The new API is mostly for the buttons in the title bar and managing window stacking. Window stacking determines which windows are drawn on top of other ones. Clicking on a dialog raises it above other dialogs and changes their window stacking settings. The following table shows the options added in Wijmo. Options Events Methods captionButtons contentUrl disabled expandingAnimation stack zIndex blur buttonCreating stateChanged disable enable getState maximize minimize pin refresh reset restore toggle widget The contentUrl option allows you to specify a URL to load within the window. The expandingAnimation option is applied when the dialog is toggled from a collapsed state to an expanded state. The stack and zIndex options determine whether the dialog sits on top of other dialogs. Similar to the blur event on input elements, the blur event for dialog is fired when the dialog loses focus. The buttonCreating method is called when buttons are created and can modify the buttons on the title bar. The disable method disables the event handlers for the dialog. It prevents the default button actions and disables dragging and resizing. The widget method returns the dialog HTML element. The methods maximize, minimize, pin, refresh, reset, restore, and toggle, are available as buttons on the title bar. The best way to see what they do is play around with them. In addition, the getState method is used to find the dialog state and returns either maximized, minimized, or normal. Similarly, the stateChanged event is fired when the state of the dialog changes. The methods are called as a parameter to the wijdialog method. To disable button interactions, pass the string disable: $("#dialog").wijdialog ("disable"); Many of the methods come as pairs, and enable and disable are one of them. Calling enable enables the buttons again. Another pair is restore/minimize. minimize hides the dialog in a tray on the left bottom of the screen. restore sets the dialog back to its normal size and displays it again. The most important option for usability is the captionButtons option. Although users are likely familiar with the minimize, resize, and close buttons; the pin and toggle buttons are not featured in common desktop environments. Therefore, you will want to choose the buttons that are visible depending on your use of the dialog box in your project. To turn off a button on the title bar, set the visible option to false. A default jQuery UI dialog window with only the close button can be created with: $("#dialog").wijdialog({captionButtons: { pin: { visible: false }, refresh: { visible: false }, toggle: { visible: false }, minimize: { visible: false }, maximize: { visible: false } } }); The other options for each button are click, iconClassOff, and iconClassOn. The click option specifies an event handler for the button. Nevertheless, the buttons come with default actions and you will want to use different icons for custom actions. That's where iconClass comes in. iconClassOn defines the CSS class for the button when it is loaded. iconClassOff is the class for the button icon after clicking. For a list of available jQuery UI icons and their classes, see http://jquery-ui.googlecode.com/svn/tags/1.6rc5/tests/static/icons.html. Our next example uses ui-icon-zoomin, ui-icon-zoomout, and ui-icon-lightbulb. They can be found by toggling the text for the icons on the web page as shown in the preceding screenshot. Adding custom buttons jQuery UI's dialog API lacks an option for configuring the buttons shown on the title bar. Wijmo not only comes with useful default buttons, but also lets you override them easily. <!DOCTYPE HTML> <html> <head> ... <style> .plus { font-size: 150%; } </style> <script id="scriptInit" type="text/javascript"> $(document).ready(function () { $('#dialog').wijdialog({ autoOpen: true, captionButtons: { pin: { visible: false }, refresh: { visible: false }, toggle: {visible: true, click: function () { $('#dialog').toggleClass('plus') }, iconClassOn: 'ui-icon-zoomin', iconClassOff: 'ui-icon-zoomout'} , minimize: { visible: false }, maximize: {visible: true, click: function () { alert('To enloarge text, click the zoom icon.') }, iconClassOn: 'ui-icon-lightbulb' }, close: {visible: true, click: self.close, iconClassOn:'ui-icon-close'} } }); }); </script> </head> <body> <div id="dialog" title="Basic dialog"> <p>Loremipsum dolor sitamet, consectetueradipiscingelit. Aeneancommodo ligula eget dolor.Aeneanmassa. Cum sociisnatoquepenatibusetmagnis dis parturient montes, nasceturridiculus mus. Donec quam felis, ultriciesnec, pellentesqueeu, pretiumquis, sem. Nullaconsequatmassaquisenim. Donecpedejusto, fringillavel, aliquetnec, vulputate</p> </div> </body> </html> We create a dialog window passing in the captionButtons option. The pin, refresh, and minimize buttons have visible set to false so that the title bar is initialized without them. The final output looks as shown in the following screenshot: In addition, the toggle and maximize buttons are modified and given custom behaviors. The toggle button toggles the font size of the text by applying or removing a CSS class. Its default icon, set with iconClassOn, indicates that clicking on it will zoom in on the text. Once clicked, the icon changes to a zoom out icon. Likewise, the behavior and appearance of the maximize button have been changed. In the position where the maximize icon was displayed in the title bar previously, there is now a lightbulb icon with a tip. Although this method of adding new buttons to the title bar seems clumsy, it is the only option that Wijmo currently offers. Adding buttons in the content area is much simpler. The buttons option specifies the buttons to be displayed in the dialog window content area below the title bar. For example, to display a simple confirmation button: $('#dialog').wijdialog({buttons: {ok: function () { $(this).wijdialog('close') }}}); The text displayed on the button is ok and clicking on the button hides the dialog. Calling $('#dialog').wijdialog('open') will show the dialog again. Configuring the dialog widget's appearance Wijmo offers several options that change the dialog's appearance including title, height, width, and position. The title of the dialog can be changed either by setting the title attribute of the div element of the dialog, or by using the title option. To change the dialog's theme, you can use CSS styling on the wijmo-wijdialog and wijmo-wijdialog-captionbutton classes: <!DOCTYPE HTML> <html> <head> ... <style> .wijmo-wijdialog { /*rounded corners*/ -webkit-border-radius: 12px; border-radius: 12px; background-clip: padding-box; /*shadow behind dialog window*/ -moz-box-shadow: 3px 3px 5px 6px #ccc; -webkit-box-shadow: 3px 3px 5px 6px #ccc; box-shadow: 3px 3px 5px 6px #ccc; /*fade contents from dark gray to gray*/ background-image: -webkit-gradient(linear, left top, left bottom, from(#444444), to(#999999)); background-image: -webkit-linear-gradient(top, #444444, #999999); background-image: -moz-linear-gradient(top, #444444, #999999); background-image: -o-linear-gradient(top, #444444, #999999); background-image: linear-gradient(to bottom, #444444, #999999); background-color: transparent; text-shadow: 1px 1px 3px #888; } </style> <script id="scriptInit" type="text/javascript"> $(document).ready(function () { $('#dialog').wijdialog({width: 350}); }); </script> </head> <body> <div id="dialog" title="Subtle gradients"> <p>Loremipsum dolor sitamet, consectetueradipiscingelit. Aeneancommodo ligula eget dolor.Aeneanmassa. Cum sociisnatoquepenatibusetmagnis dis parturient montes, nasceturridiculus mus. Donec quam felis, ultriciesnec, pellentesqueeu, pretiumquis, sem. Nullaconsequatmassaquisenim. Donecpedejusto, fringillavel, aliquetnec, vulputate </p> </div> </body> </html> We now add rounded boxes, a box shadow, and a text shadow to the dialog box. This is done with the .wijmo-wijdialog class. Since many of the CSS3 properties have different names on different browsers, the browser specific properties are used. For example, -webkit-box-shadow is necessary on Webkit-based browsers. The dialog width is set to 350 px when initialized so that the title text and buttons all fit on one line. Loading external content Wijmo makes it easy to load content in an iFrame. Simply pass a URL with the contentUrl option: $(document).ready(function () { $("#dialog").wijdialog({captionButtons: { pin: { visible: false }, refresh: { visible: true }, toggle: { visible: false }, minimize: { visible: false }, maximize: { visible: true }, close: { visible: false } }, contentUrl: "http://wijmo.com/demo/themes/" }); }); This will load the Wijmo theme explorer in a dialog window with refresh and maximize/restore buttons. This output can be seen in the following screenshot: The refresh button reloads the content in the iFrame, which is useful for dynamic content. The maximize button resizes the dialog window. Form Components Wijmo form decorator widgets for radio button, checkbox, dropdown, and textbox elements give forms a consistent visual style across all platforms. There are separate libraries for decorating the dropdown and other form elements, but Wijmo gives them a consistent theme. jQuery UI lacks form decorators, leaving the styling of form components to the designer. Using Wijmo form components saves time during development and presents a consistent interface across all browsers. Checkbox The checkbox widget is an excellent example of the style enhancements that Wijmo provides over default form controls. The checkbox is used if multiple choices are allowed. The following screenshot shows the different checkbox states: Wijmo adds rounded corners, gradients, and hover highlighting to the checkbox. Also, the increased size makes it more usable. Wijmo checkboxes can be initialized to be checked. The code for this purpose is as follows: <!DOCTYPE HTML> <html> <head> ... <script id="scriptInit" type="text/javascript"> $(document).ready(function () { $("#checkbox3").wijcheckbox({checked: true}); $(":input[type='checkbox']:not(:checked)").wijcheckbox(); }); </script> <style> div { display: block; margin-top: 2em; } </style> </head> <body> <div><input type='checkbox' id='checkbox1' /><label for='checkbox1'>Unchecked</label></div> <div><input type='checkbox' id='checkbox2' /><label for='checkbox2'>Hover</label></div> <div><input type='checkbox' id='checkbox3' /><label for='checkbox3'>Checked</label></div> </body> </html>. In this instance, checkbox3 is set to Checked as it is initialized. You will not get the same result if one of the checkboxes is initialized twice. Here, we avoid that by selecting the checkboxes that are not checked after checkbox3 is set to be Checked. Radio buttons Radio buttons, in contrast with checkboxes, allow only one of the several options to be selected. In addition, they are customized through the HTML markup rather than a JavaScript API. To illustrate, the checked option is set by the checked attribute: <input type="radio" checked /> jQuery UI offers a button widget for radio buttons, as shown in the following screenshot, which in my experience causes confusion as users think that they can select multiple options: The Wijmo radio buttons are closer in appearance to regular radio buttons so that users would expect the same behavior, as shown in the following screenshot: Wijmo radio buttons are initialized by calling the wijradiomethod method on radio button elements: <!DOCTYPE html> <html> <head> ... <script id="scriptInit" type="text/javascript">$(document).ready(function () { $(":input[type='radio']").wijradio({ changed: function (e, data) { if (data.checked) { alert($(this).attr('id') + ' is checked') } } }); }); </script> </head> <body> <div id="radio"> <input type="radio" id="radio1" name="radio"/><label for="radio1">Choice 1</label> <input type="radio" id="radio2" name="radio" checked="checked"/><label for="radio2">Choice 2</label> <input type="radio" id="radio3" name="radio"/><label for="radio3">Choice 3</label> </div> </body> </html> In this example, the changed option, which is also available for checkboxes, is set to a handler. The handler is passed a jQuery.Event object as the first argument. It is just a JavaScript event object normalized for consistency across browsers. The second argument exposes the state of the widget. For both checkboxes and radio buttons, it is an object with only the checked property. Dropdown Styling a dropdown to be consistent across all browsers is notoriously difficult. Wijmo offers two options for styling the HTML select and option elements. When there are no option groups, the ComboBox is the better widget to use. For a dropdown with nested options under option groups, only the wijdropdown widget will work. As an example, consider a country selector categorized by continent: <!DOCTYPE HTML> <html> <head> ... <script id="scriptInit" type="text/javascript"> $(document).ready(function () { $('select[name=country]').wijdropdown(); $('#reset').button().click(function(){ $('select[name=country]').wijdropdown('destroy') }); $('#refresh').button().click(function(){ $('select[name=country]').wijdropdown('refresh') }) }); </script> </head> <body> <button id="reset"> Reset </button> <button id="refresh"> Refresh </button> <select name="country" style="width:170px"> <optgroup label="Africa"> <option value="gam">Gambia</option> <option value="mad">Madagascar</option> <option value="nam">Namibia</option> </optgroup> <optgroup label="Europe"> <option value="fra">France</option> <option value="rus">Russia</option> </optgroup> <optgroup label="North America"> <option value="can">Canada</option> <option value="mex">Mexico</option> <option selected="selected" value="usa">United States</option> </optgroup> </select> </body> </html> The select element's width is set to 170 pixels so that when the dropdown is initialized, both the dropdown menu and items have a width of 170 pixels. This allows the North America option category to be displayed on a single line, as shown in the following screenshot. Although the dropdown widget lacks a width option, it takes the select element's width when it is initialized. To initialize the dropdown, call the wijdropdown method on the select element: $('select[name=country]').wijdropdown(); The dropdown element uses the blind animation to show the items when the menu is toggled. Also, it applies the same click animation as on buttons to the slider and menu: To reset the dropdown to a select box, I've added a reset button that calls the destroy method. If you have JavaScript code that dynamically changes the styling of the dropdown, the refresh method applies the Wijmo styles again. Summary The Wijmo dialog widget is an extension of the jQuery UI dialog. In this article, the features unique to Wijmo's dialog widget are explored and given emphasis. I showed you how to add custom buttons, how to change the dialog appearance, and how to load content from other URLs in the dialog. We also learned about Wijmo's form components. A checkbox is used when multiple items can be selected. Wijmo's checkbox widget has style enhancements over the default checkboxes. Radio buttons are used when only one item is to be selected. While jQuery UI only supports button sets on radio buttons, Wijmo's radio buttons are much more intuitive. Wijmo's dropdown widget should only be used when there are nested or categorized <select> options. The ComboBox comes with more features when the structure of the options is flat. Resources for Article: Further resources on this subject: Wijmo Widgets [Article] jQuery Animation: Tips and Tricks [Article] Building a Custom Version of jQuery [Article]

0
0
11508

Packt

30 Oct 2013

10 min read

Highlights of Greenplum

Packt

30 Oct 2013

10 min read

(For more resources related to this topic, see here.) Big Data analytics – platform requirements Organizations are striving towards becoming more data driven and leverage data to gain the competitive advantage. It is inevitable that any current business intelligence infrastructure needs to be upgraded to include Big Data technologies and analytics needs to be embedded into every core business process. The following diagram depicts a matrix that connects requirements from low storage/cost to high storage/cost information management systems and analytics applications. The following section lists all the capabilities that an integrated platform for Big Data analytics should have: A data integration platform that can integrate data from any source, of any type, and highly voluminous in nature. This includes efficient data extraction, data cleansing, transformation, and loading capabilities. A data storage platform that can hold structured, unstructured, and semistructured data with a capability to slice and dice data to any degree, discarding the format. In short, while we store data, we should be able to use the best suited platform for a given data format (for example: structured data to use relational store, semi-structured data to use NoSQL store, and unstructured data to use a file store) and still be able to join data across platforms to run analytics. Support for running standard analytics functions and standard analytical tools on data that has characteristics described previously. Modular and elastically scalable hardware that wouldn't force changes to architecture/design with growing needs to handle bigger data and more complex processing requirements. A centralized management and monitoring system. Highly available and fault tolerant platform that can repair itself in times of any hardware failure seamlessly. Support for advanced visualizations to communicate insights in an effective way. A collaboration platform that can help end users perform the functions of loading, exploring, and visualizing data, and other workflow aspects as an end-to-end process. Core components The following figure depicts core software components of Greenplum UAP: In this section, we will take a brief look at what each component is and take a deep dive into their functions in the sections to follow. Greenplum Database Greenplum Database is a shared nothing, massively parallel processing solution built to support next generation data warehousing and Big Data analytics processing. It stores and analyzes voluminous structured data. It comes in a software-only version that works on commodity servers (this being its unique selling point) and additionally also is available as an appliance (DCA) that can take advantage of large clusters of powerful servers, storage, and switches. GPDB (Greenplum Database) comes with a parallel query optimizer that uses a cost-based algorithm to evaluate and select optimal query plans. Its high-speed interconnection supports continuous pipelining for data processing. In its new distribution under Pivotal, Greenplum Database is called Pivotal (Greenplum) Database. Shared nothing, massive parallel processing (MPP) systems, and elastic scalability Until now, our applications have been benchmarked for certain performance and the core hardware and its architecture determined its readiness for further scalability that came at a cost, be it in terms of changes to the design or hardware augmentation. With growing data volumes, scalability and total cost of ownership is becoming a big challenge and the need for elastic scalability has become prime. This section compares shared disk, shared memory, and shared nothing data architectures and introduces the concept of massive parallel processing. Greenplum Database and HD components implement shared nothing data architecture with master/worker paradigm demonstrating massive parallel processing capabilities. Shared disk data architecture Have a look at the following figure which gives an idea about shared disk data architecture: Shared disk data architecture refers to an architecture where there is a data disk that holds all the data and each node in the cluster accesses this data for processing. Any data operations can be performed by any node at a given point in time and in case two nodes attempt persisting/writing a tuple at the same time, to ensure consistency, a disk-based lock or intended lock communication is passed on thus affecting the performance. Further with increase in the number of nodes, contention at the database level increases. These architectures are write limited as there is a need to handle the locks across the nodes in the cluster. Even in case of the reads, partitioning should be implemented effectively to avoid complete table scans. Shared memory data architecture Have a look at the following figure which gives an idea about shared memory data architecture: In memory, data grids come under the shared memory data architecture category. In this architecture paradigm, data is held in memory that is accessible to all the nodes within the cluster. The major advantage with this architecture is that there would be no disk I/O involved and data access is very quick. This advantage comes with an additional need for loading and synchronizing data in memory with the underlying data store. The memory layer seen in the following figure can be distributed and local to the compute nodes or can exist as data node. Shared nothing data architecture Though an old paradigm, shared nothing data architecture is gaining traction in the context of Big Data. Here the data is distributed across the nodes in the cluster and every processor operates on the data local to itself. The location where data resides is referred to as data node and where the processing logic resides is called compute node. It can happen that both nodes, compute and data, are physically one. These nodes within the cluster are connected using high-speed interconnects. The following figure depicts two aspects of the architecture, the one on the left represents data and computes decoupled processes and the other to the right represents data and computes processes co-located: One of the most important aspects of shared nothing data architecture is the fact that there will not be any contention or locks that would need to be addressed. Data is distributed across the nodes within the cluster using a distribution plan that is defined as a part of the schema definition. Additionally, for higher query efficiency, partitioning can be done at the node level. Any requirement for a distributed lock would bring in complexity and an efficient distribution and partitioning strategy would be a key success factor. Reads are usually the most efficient relative to shared disk databases. Again, the efficiency is determined by the distribution policy, if a query needs to join data across the nodes in the cluster, users would see a temporary redistribution step that would bring required data elements together into another node before the query result is returned. Shared nothing data architecture thus supports massive parallel processing capabilities. Some of the features of shared nothing data architecture are as follows: It can scale extremely well on general purpose systems It provides automatic parallelization in loading and querying any database It has optimized I/O and can scan and process nodes in parallel It supports linear scalability, also referred to as elastic scalability, by adding a new node to the cluster, additional storage, and processing capability, both in terms of load performance and query performance is gained The Greenplum high-availability architecture In addition to primary Greenplum system components, we can also optionally deploy redundant components for high availability and avoiding single point of failure. The following components need to be implemented for data redundancy: Mirror segment instances: A mirror segment always resides on a different host than its primary segment. Mirroring provides you with a replica of the database contained in a segment. This may be useful in the event of disk/hardware failure. The metadata regarding the replica is stored on the master server in system catalog tables. Standby master host: For a fully redundant Greenplum Database system, a mirror of the Greenplum master can be deployed. A backup Greenplum master host serves as a warm standby in cases when the primary master host becomes unavailable. The standby master host is synchronized periodically and kept up-to-date using transaction replication log process that runs on the standby master to keep the master host and standby in sync. In the event of master host failure the standby master is activated and constructed using the transaction logs. Dual interconnect switches: A highly available interconnect can be achieved by deploying redundant network interfaces on all Greenplum hosts and a dual Gigabit Ethernet. The default configuration is to have one network interface per primary segment instance on a segment host (both the interconnects are by default 10Gig in DCA). External tables External tables in Greenplum refer to those database tables that help Greenplum Database access data from a source that is outside of the database. We can have different external tables for different formats. Greenplum supports fast, parallel, as well as nonparallel data loading and unloading. The external tables act as an interfacing point to external data source and give an impression of a local data source to the accessing function. File-based data sources are supported by external tables. The following file formats can be loaded onto external tables: Regular file-based source (supports Text, CSV, and XML data formats): file:// or gpfdist:// protocol Web-based file source (supports Text, CSV, OS commands, and scripts): http:// protocol Hadoop-based file source (supports Text and custom/user-defined formats): gphdfs:// protocol Following is the syntax for the creation and deletion of readable and writable external tables: To create a read-only external table: CREATE EXTERNAL (WEB) TABLE LOCATION (<<file paths>>) | EXECUTE '<<query>>' FORMAT '<<Format name for example: 'TEXT'>>' (DELIMITER, '<<name the delimiter>>'); To create a writable external table: CREATE WRITABLE EXTERNAL (WEB) TABLE LOCATION (<<file paths>>) | EXECUTE '<<query>>' FORMAT '<<Format name for example: 'TEXT'>>' (DELIMITER, '<<name the delimiter>>'); To drop an external table: DROP EXTERNAL (WEB) TABLE; Following are the examples on using file:// and gphdfs:// protocol: CREATE EXTERNAL TABLE test_load_file ( id int, name text, date date, description text ) LOCATION ( 'file://filehost:6781/data/folder1/*', 'file://filehost:6781/data/folder2/*' 'file://filehost:6781/data/folder3/*.csv' ) FORMAT 'CSV' (HEADER); In the preceding example, data is loaded from three different file server locations; also, as you can see, the wild card notation for each of the locations can be different. Now, in case where the files are located on HDFS, the following notation needs to be used (in the following example, the file is '|' delimited): CREATE EXTERNAL TABLE test_load_file ( id int, name text, date date, description text ) LOCATION ( 'gphdfs://hdfshost:8081/data/filename.txt' ) FORMAT 'TEXT' (DELIMITER '|'); Summary In this article, we have learned about Greenplum UAP and also Greenplum Database. This article also gives information about the core components of Greenplum UAP. Resources for Article: Further resources on this subject: Making Big Data Work for Hadoop and Solr [Article] Big Data Analysis [Article] Core Data iOS: Designing a Data Model and Building Data Objects [Article]

0
0
2388

Packt

30 Oct 2013

16 min read

IBM SPSS Modeler – Pushing the Limits

Packt

30 Oct 2013

16 min read

(For more resources related to this topic, see here.) Using the Feature Selection node creatively to remove or decapitate perfect predictors In this recipe, we will identify perfect or near perfect predictors in order to insure that they do not contaminate our model. Perfect predictors earn their name by being correct 100 percent of the time, usually indicating circular logic and not a prediction of value. It is a common and serious problem. When this occurs we have accidentally allowed information into the model that could not possibly be known at the time of the prediction. Everyone 30 days late on their mortgage receives a late letter, but receiving a late letter is not a good predictor of lateness because their lateness caused the letter, not the other way around. The rather colorful term decapitate is borrowed from the data miner Dorian Pyle. It is a reference to the fact that perfect predictors will be found at the top of any list of key drivers ("caput" means head in Latin). Therefore, to decapitate is to remove the variable at the top. Their status at the top of the list will be capitalized upon in this recipe. The following table shows the three time periods; the past, the present, and the future. It is important to remember that, when we are making predictions, we can use information from the past to predict the present or the future but we cannot use information from the future to predict the future. This seems obvious, but it is common to see analysts use information that was gathered after the date for which predictions are made. As an example, if a company sends out a notice after a customer has churned, you cannot say that the notice is predictive of churning. Past Now Future Contract Start Expiration Outcome Renewal Date Joe January 1, 2010 January 1, 2012 Renewed January 2, 2012 Ann February 15, 2010 February 15, 2012 Out of Contract Null Bill March 21, 2010 March 21, 2012 Churn NA Jack April 5, 2010 April 5, 2012 Renewed April 9, 2012 New Customer 24 Months Ago Today ??? ??? Getting ready We will start with a blank stream, and will be using the cup98lrn reduced vars2.txt data set. How to do it... To identify perfect or near-perfect predictors in order to insure that they do not contaminate our model: Build a stream with a Source node, a Type node, and a Table then force instantiation by running the Table node. Force TARGET_B to be flag and make it the target. Add a Feature Selection Modeling node and run it. Edit the resulting generated model and examine the results. In particular, focus on the top of the list. Review what you know about the top variables, and check to see if any could be related to the target by definition or could possibly be based on information that actually postdates the information in the target. Add a CHAID Modeling node, set it to run in Interactive mode, and run it. Examine the first branch, looking for any child node that might be perfectly predicted; that is, look for child nodes whose members are all found in one category. Continue steps 6 and 7 for the first several variables. Variables that are problematic (steps 5 and/or 7) need to be set to None in the Type node. How it works... Which variables need decapitation? The problem is information that, although it was known at the time that you extracted it, was not known at the time of decision. In this case, the time of decision is the decision that the potential donor made to donate or not to donate. Was the amount, Target_D known before the decision was made to donate? Clearly not. No information that dates after the information in the target variable can ever be used in a predictive model. This recipe is built of the following foundation—variables with this problem will float up to the top of the Feature Selection results. They may not always be perfect predictors, but perfect predictors always must go. For example, you might find that, if a customer initially rejects or postpones a purchase, there should be a follow up sales call in 90 days. They are recorded as rejected offer in the campaign, and as a result most of them had a follow up call in 90 days after the campaign. Since a couple of the follow up calls might not have happened, it won't be a perfect predictor, but it still must go. Note that variables such as RFA_2 and RFA_2A are both very recent information and highly predictive. Are they a problem? You can't be absolutely certain without knowing the data. Here the information recorded in these variables is calculated just prior to the campaign. If the calculation was made just after, they would have to go. The CHAID tree almost certainly would have shown evidence of perfect prediction in this case. There's more... Sometimes a model has to have a lot of lead time; predicting today's weather is a different challenge than next year's prediction in the farmer's almanac. When more lead time is desired you could consider dropping all of the _2 series variables. What would the advantage be? What if you were buying advertising space and there was a 45 day delay for the advertisement to appear? If the _2 variables occur between your advertising deadline and your campaign you might have to use information attained in the _3 campaign. Next-Best-Offer for large datasets Association models have been the basis for next-best-offer recommendation engines for a long time. Recommendation engines are widely used for presenting customers with cross-sell offers. For example, if a customer purchases a shirt, pants, and a belt; which shoes would he also likely buy? This type of analysis is often called market-basket analysis as we are trying to understand which items customers purchase in the same basket/transaction. Recommendations must be very granular (for example, at the product level) to be usable at the check-out register, website, and so on. For example, knowing that female customers purchase a wallet 63.9 percent of the time when they buy a purse is not directly actionable. However, knowing that customers that purchase a specific purse (for example, SKU 25343) also purchase a specific wallet (for example, SKU 98343) 51.8 percent of the time, can be the basis for future recommendations. Product level recommendations require the analysis of massive data sets (that is, millions of rows). Usually, this data is in the form of sales transactions where each line item (that is, row of data) represents a single product. The line items are tied together by a single transaction ID. IBM SPSS Modeler association models support both tabular and transactional data. The tabular format requires each product to be represented as column. As most product level recommendations would contain thousands of products, this format is not practical. The transactional format uses the transactional data directly and requires only two inputs, the transaction ID and the product/item. Getting ready This example uses the file stransactions.sav and scoring.csv. How to do it... To recommend the next best offer for large datasets: Start with a new stream by navigating to File | New Stream. Go to File | Stream Properties from the IBM SPSS Modeler menu bar. On the Options tab change the Maximum members for nominal fields to 50000. Click on OK. Add a Statistics File source node to the upper left of the stream. Set the file field by navigating to transactions.sav. On the Types tab, change the Product_Code field to Nominal and click on the Read Values button. Click on OK. Add a CARMA Modeling node connected to the Statistics File source node in step 3. On the Fields tab, click on the Use custom settings and check the Use transactional format check box. Select Transaction_ID as the ID field and Product_Code as the Content field. On the Model tab of the CARMA Modeling node, change the Minimum rule support (%) to 0.0 and the Minimum rule confidence (%) to 5.0. Click on the Run button to build the model. Double-click the generated model to ensure that you have approximately 40,000 rules. Add a Var File source node to the middle left of the stream. Set the file field by navigating to scoring.csv. On the Types tab, click on the Read Values button. Click on the Preview button to preview the data. Click on OK to dismiss all dialogs. Add a Sort node connected to the Var File node in step 6. Choose Transaction_ID and Line_Number (with Ascending sort) by clicking the down arrow on the right of the dialog. Click on OK. Connect the Sort node in step 7 to the generated model (replacing the current link). Add an Aggregate node connected to the generated model. Add a Merge node connected to the generated model. Connect the Aggregate node in step 9 to the Merge node. On the Merge tab, choose Keys as the Merge Method, select Transaction_ID, and click on the right arrow. Click on OK. Add a Select node connected to the Merge node in step 10. Set the condition to Record_Count = Line_Number. Click on OK. At this point, the stream should look as follows: Add a Table node connected to the Select node in step 11. Right-click on the Table node and click on Run to see the next-best-offer for the input data. How it works... In steps 1-5, we set up the CARMA model to use the transactional data (without needing to restructure the data). CARMA was selected over A Priori for its improved performance and stability with large data sets. For recommendation engines, the settings for the Model tab are somewhat arbitrary and are driven by the practical limitations of the number of rules generated. Lowering the thresholds for confidence and rule support generates more rules. Having more rules can have a negative impact on scoring performance but will result in more (albeit weaker) recommendations. Rule Support How many transactions contain the entire rule (that is, both antecedents ("if" products) and consequents ("then" products)) Confidence If a transaction contains all the antecedents ("if" products), what percentage of the time does it contain the consequents ("then" products) In step 5, when we examine the model we see the generated Association Rules with the corresponding rules support and confidences. In the remaining steps (7-12), we score a new transaction and generate 3 next-best-offers based on the model containing the Association Rules. Since the model was built with transactional data, the scoring data must also be transactional. This means that each row is scored using the current row and the prior rows with the same transaction ID. The only row we generally care about is the last row for each transaction where all the data has been presented to the model. To accomplish this, we count the number of rows for each transaction and select the line number that equals the total row count (that is, the last row for each transaction). Notice that the model returns 3 recommended products, each with a confidence, in order of decreasing confidence. A next-best-offer engine would present the customer with the best option first (or potentially all three options ordered by decreasing confidence). Note that, if there is no rule that applies to the transaction, nulls will be returned in some or all of the corresponding columns. There's more... In this recipe, you'll notice that we generate recommendations across the entire transactional data set. By using all transactions, we are creating generalized next-best-offer recommendations; however, we know that we can probably segment (that is, cluster) our customers into different behavioral groups (for example, fashion conscience, value shoppers, and so on.). Partitioning the transactions by behavioral segment and generating separate models for each segment will result in rules that are more accurate and actionable for each group. The biggest challenge with this approach is that you will have to identify the customer segment for each customer before making recommendations (that is, scoring). A unified approach would be to use the general recommendations for a customer until a customer segment can be assigned then use segmented models. Correcting a confusion matrix for an imbalanced target variable by incorporating priors Classification models generate probabilities and a classification predicted class value. When there is a significant imbalance in the proportion of True values in the target variable, the confusion matrix as seen in the Analysis node output will show that the model has all predicted class values equal to the False value, leading an analyst to conclude the model is not effective and needs to be retrained. Most often, the conventional wisdom is to use a Balance node to balance the proportion of True and False values in the target variable, thus eliminating the problem in the confusion matrix. However, in many cases, the classifier is working fine without the Balance node; it is the interpretation of the model that is biased. Each model generates a probability that the record belongs to the True class and the predicted class is derived from this value by applying a threshold of 0.5. Often, no record has a propensity that high, resulting in every predicted class value being assigned False. In this recipe we learn how to adjust the predicted class for classification problems with imbalanced data by incorporating the prior probability of the target variable. Getting ready This recipe uses the datafile cup98lrn_reduced_vars3.sav and the stream Recipe – correct with priors.str. How to do it... To incorporate prior probabilities when there is an imbalanced target variable: Open the stream Recipe – correct with priors.str by navigating to File | Open Stream. Make sure the datafile points to the correct path to the datafile cup98lrn_reduced_vars3.sav. Open the generated model TARGET_B, and open the Settings tab. Note that compute Raw Propensity is checked. Close the generated model. Duplicate the generated model by copying and pasting the node in the stream. Connect the duplicated model to the original generated model. Add a Type node to the stream and connect it to the generated model. Open the Type node and scroll to the bottom of the list. Note that the fields related to the two models have not yet been instantiated. Click on Read Values so that they are fully instantiated. Insert a Filler node and connect it to the Type node. Open the Filler node and, in the variable list, select $N1-TARGET_B. Inside the Condition section, type $RP1-TARGET_B' >= TARGET_B_Mean, Click on OK to dismiss the Filler node (after exiting the Expression Builder). Insert an Analysis node to the stream. Open the Analysis node and click on the check box for Coincidence Matrices. Click on OK. Run the stream to the Analysis node. Notice that the coincidence matrix (confusion matrix) for $N-TARGET_B has no predictions with value = 1, but the coincidence matrix for the second model, the one adjusted by step 7 ($N1-TARGET_B), has more than 30 percent of the records labeled as value = 1. How it works... Classification algorithms do not generate categorical predictions; they generate probabilities, likelihoods, or confidences. For this data set, the target variable, TARGET_B, has two values: 1 and 0. The classifier output from any classification algorithm will be a number between 0 and 1. To convert the probability to a 1 or 0 label, the probability is thresholded, and the default in Modeler (and all predictive analytics software) is the threshold at 0.5. This recipe changes that default threshold to the prior probability. The proportion of TARGET_B = 1 values in the data is 5.1 percent, and therefore this is the classic imbalanced target variable problem. One solution to this problem is to resample the data so that the proportion of 1s and 0s are equal, normally achieved through use of the Balance node in Modeler. Moreover, one can create the Balance node from running a Distribution node for TARGET_B, and using the Generate | Balance node (reduce) option. The justification for balancing the sample is that, if one doesn't do it, all the records will be classified with value = 0. The reason for all the classification decisions having value 0 is not because the Neural Network isn't working properly. Consider the histogram of predictions from the Neural Network shown in the following screenshot. Notice that the maximum value of the predictions is less than 0.4, but the center of density is about 0.05. The actual shape of the histogram and the maximum predicted value depend on the Neural Network; some may have maximum values slightly above 0.5. If the threshold for the classification decision is set to 0.5, since no neural network predicted confidence is greater than 0.5, all of the classification labels will be 0. However, if one sets the threshold to the TARGET_B prior probability, 0.051, many of the predictions will exceed that value and be labeled as 1. We can see the result of the new threshold by color-coding the histogram of the previous figure with the new class label, in the following screenshot. This recipe used a Filler node to modify the existing predicted target value. The categorical prediction from the Neural Network whose prediction is being changed is $N1-TARGET_B. The $ variables are special field names that are used automatically in the Analysis node and Evaluation node. It's possible to construct one's own $ fields with a Derive node, but it is safer to modify the one that's already in the data. There's more... This same procedure defined in this recipe works for other modeling algorithms as well, including logistic regression. Decision trees are a different matter. Consider the following screenshot. This result, stating that the C5 tree didn't split at all, is the result of the imbalanced target variable. Rather than balancing the sample, there are other ways to get a tree built. For C&RT or Quest trees, go to the Build Options, select the Costs & Priors item, and select Equal for all classes for priors: equal priors. This option forces C&RT to treat the two classes mathematically as if their counts were equal. It is equivalent to running the Balance node to boost samples so that there are equal numbers of 0s and 1s. However, it's done without adding additional records to the data, slowing down training; equal priors is purely a mathematical reweighting. The C5 tree doesn't have the option of setting priors. An alternative, one that will work not only with C5 but also with C&RT, CHAID, and Quest trees, is to change the Misclassification Costs so that the cost of classifying a one as a zero is 20, approximately the ratio of the 95 percent 0s to 5 percent 1s.

0
0
7730

Packt

30 Oct 2013

3 min read

What is Drupal?

Packt

30 Oct 2013

3 min read

(For more resources related to this topic, see here.) Currently Drupal is being used as a CMS in below listed domains Arts Banking and Financial Beauty and Fashion Blogging Community E-Commerce Education Entertainment Government Health Care Legal Industry Manufacturing and Energy Media Music Non-Profit Publishing Social Networking Small business Diversity that is being offered by Drupal is the reason of its growing popularity. Drupal is written in PHP.PHP is open source server side scripting language and it has changed the technological landscape to great extent. The Economist, Examiner.com and The White house websites have been developed in Drupal. System requirements Disk space A minimum installation requires 15 Megabytes. 60 MB is needed for a website with many contributed modules and themes installed. Keep in mind you need much more for the database, files uploaded by the users, media, backups and other files. Web server Apache, Nginx, or Microsoft IIS. Database Drupal 6: MySQL 4.1 or higher, PostgreSQL 7.1, Drupal 7: MySQL 5.0.15 or higher with PDO, PostgreSQL 8.3 or higher with PDO, SQLite 3.3.7 or higher Microsoft SQL Server and Oracle are supported by additional modules. PHP Drupal 6: PHP 4.4.0 or higher (5.2 recommended). Drupal 7: PHP 5.2.5 or higher (5.3 recommended). Drupal 8: PHP 5.3.10 or higher. How to create multiple websites using Drupal Multi-site allows you to share a single Drupal installation (including core code, contributed modules, and themes) among several sites One of the greatest features of Drupal is Multi-site feature. Using this feature a single Drupal installation can be used for various websites. Multisite feature is helpful in managing code during the code upgradation.Each site will have will have its own content, settings, enabled modules, and enabled theme. When to use multisite feature? If the sites are similar in functionallity (use same modules or use the same drupal distribution) you should use multisite feature. If the functionality is different don't use multisite. To create a new site using a shared Drupal code base you must complete the following steps: Create a new database for the site (if there is already an existing database you can also use this by defining a prefix in the installation procedure). Create a new subdirectory of the 'sites' directory with the name of your new site (see below for information on how to name the subdirectory). Copy the file sites/default/default.settings.php into the subdirectory you created in the previous step. Rename the new file to settings.php. Adjust the permissions of the new site directory. Make symbolic links if you are using a subdirectory such as packtpub.com/subdir and not a subdomain such as subd.example.com. In a Web browser, navigate to the URL of the new site and continue with the standard Drupal installation procedure. Summary This article discusses in brief about the Drupal platform and also the requirements for installing it. Resources for Article: Further resources on this subject: Drupal Web Services: Twitter and Drupal [Article] Drupal and Ubercart 2.x: Install a Ready-made Drupal Theme [Article] Drupal 7 Module Development: Drupal's Theme Layer [Article]

0
0
12318

Packt

30 Oct 2013

17 min read

APEX Plug-ins

Packt

30 Oct 2013

17 min read

0
0
10237

Packt

30 Oct 2013

11 min read

Advanced Data Operations

Packt

30 Oct 2013

11 min read

(For more resources related to this topic, see here.) Recipe 1 – handling multi-valued cells It is a common problem in many tables: what do you do if multiple values apply to a single cell? For instance, consider a Clients table with the usual name, address, and telephone fields. A typist is adding new contacts to this table, when he/she suddenly discovers that Mr. Thompson has provided two addresses with a different telephone number for each of them. There are essentially three possible reactions to this: Adding only one address to the table: This is the easiest thing to do, as it eliminates half of the typing work. Unfortunately, this implies that half of the information is lost as well, so the completeness of the table is in danger. Adding two rows to the table: While the table is now complete, we now have redundant data. Redundancy is also dangerous, because it leads to error: the two rows might accidentally be treated as two different Mr. Thompsons, which can quickly become problematic if Mr. Thompson is billed twice for his subscription. Furthermore, as the rows have no connection, information updated in one of them will not automatically propagate to the other. Adding all information to one row: In this case, two addresses and two telephone numbers are added to the respective fields. We say the field is overloaded with regard to its originally envisioned definition. At first sight, this is both complete yet not redundant, but a subtle problem arises. While humans can perfectly make sense of this information, automated processes cannot. Imagine an envelope labeler, which will now print two addresses on a single envelope, or an automated dialer, which will treat the combined digits of both numbers as a single telephone number. The field has indeed lost its precise semantics. Note that there are various technical solutions to deal with the problem of multiple values, such as table relations. However, if you are not in control of the data model you are working with, you'll have to choose any of the preceding solutions. Luckily, OpenRefine is able to offer the best of both worlds. Since it is also an automated piece of software, it needs to be informed whether a field is multi-valued before it can perform sensible operations on it. In the Powerhouse Museum dataset, the Categories field is multi-valued, as each object in the collection can belong to different categories. Before we can perform meaningful operations on this field, we have to tell OpenRefine to somehow treat it a little different. Suppose we want to give the Categories field a closer look to check how many different categories are there and which categories are the most prominent. First, let's see what happens if we try to create a text facet on this field by clicking on the dropdown next to Categories and navigating to Facet| Text Facet as shown in the following screenshot. This doesn't work as expected because there are too many combinations of individual categories. OpenRefine simply gives up, saying that there are 14,805 choices in total, which is above the limit for display. While you can increase the maximum value by clicking on Set choice count limit, we strongly advise against this. First of all, it would make OpenRefine painfully slow as it would offer us a list of 14,805 possibilities, which is too large for an overview anyway. Second, it wouldn't help us at all because OpenRefine would only list the combined field values (such as Hen eggs | Sectional models | Animal Samples and Products). This does not allow us to inspect the individual categories, which is what we're interested in. To solve this, leave the facet open, but go to the Categories dropdown again and select Edit Cells| Split multi-valued cells…as shown in the following screenshot: OpenRefine now asks What separator currently separates the values?. As we can see in the first few records, the values are separated by a vertical bar or pipe character, as the horizontal line tokens are called. Therefore, enter a vertical bar |in the dialog. If you are not able to find the corresponding key on your keyboard, try selecting the character from one of the Categories cells and copying it so you can paste it in the dialog. Then, click on OK. After a few seconds, you will see that OpenRefine has split the cell values, and the Categories facet on the left now displays the individual categories. By default, it shows them in alphabetical order, but we will get more valuable insights if we sort them by the number of occurrences. This is done by changing the Sort by option from name to count, revealing the most popular categories. One thing we can do now, which we couldn't do when the field was still multi-valued is changing the name of a single category across all records. For instance, to change the name of Clothing and Dress, hover over its name in the created Categories facet and click on the edit link, as you can see in the following screenshot: Enter a new name such as Clothing and click on Apply. OpenRefine changes all occurrences of Clothing and Dress into Clothing, and the facet is updated to reflect this modification. Once you are done editing the separate values, it is time to merge them back together. Go to the Categories dropdown, navigate to Edit cells| Join multi-valued cells…, and enter the separator of your choice. This does not need to be the same separator as before, and multiple characters are also allowed. For instance, you could opt to separate the fields with a comma followed by a space. Recipe 3 – clustering similar cells Thanks to OpenRefine, you don't have to worry about inconsistencies that slipped in during the creation process of your data. If you have been investigating the various categories after splitting the multi-valued cells, you might have noticed that the same category labels do not always have the same spelling. For instance, there is Agricultural Equipment and Agricultural equipment(capitalization differences), Costumes and Costume(pluralization differences), and various other issues. The good news is that these can be resolved automatically; well, almost. But, OpenRefine definitely makes it a lot easier. The process of finding the same items with slightly different spelling is called clustering. After you have split multi-valued cells, you can click on the Categories dropdown and navigate to Edit cells| Cluster and edit…. OpenRefine presents you with a dialog box where you can choose between different clustering methods, each of which can use various similarity functions. When the dialog opens, key collision and fingerprint have been chosen as default settings. After some time (this can take a while, depending on the project size), OpenRefine will execute the clustering algorithm on the Categories field. It lists the found clusters in rows along with the spelling variations in each cluster and the proposed value for the whole cluster, as shown in the following screenshot: Note that OpenRefine does not automatically merge the values of the cluster. Instead, it wants you to confirm whether the values indeed point to the same concept. This avoids similar names, which still have a different meaning, accidentally ending up as the same. Before we start making decisions, let's first understand what all of the columns mean. The Cluster Size column indicates how many different spellings of a certain concept were thought to be found. The Row Count column indicates how many rows contain either of the found spellings. In Values in Cluster, you can see the different spellings and how many rows contain a particular spelling. Furthermore, these spellings are clickable, so you can indicate which one is correct. If you hover over the spellings, a Browse this cluster link appears, which you can use to inspect all items in the cluster in a separate browser tab. The Merge column contains a checkbox. If you check it, all values in that cluster will be changed to the value in the New Cell Value column when you click on one of the Merge Selected buttons. You can also manually choose a new cell value if the automatic value is not the best choice. So, let's perform our first clustering operation. I strongly advise you to scroll carefully through the list to avoid clustering values that don't belong together. In this case, however, the algorithm hasn't acted too aggressively: in fact, all suggested clusters are correct. Instead of manually ticking the Merge? checkbox on every single one of them, we can just click on Select All at the bottom. Then, click on the Merge Selected & Re-Cluster button, which will merge all the selected clusters but won't close the window yet, so we can try other clustering algorithms as well. OpenRefine immediately reclusters with the same algorithm, but no other clusters are found since we have merged all of them. Let's see what happens when we try a different similarity function. From the Keying Function menu, click on ngram fingerprint. Note that we get an additional parameter, Ngram Size, which we can experiment with to obtain less or more aggressive clustering. We see that OpenRefine has found several clusters again. It might be tempting to click on the Select All button again, but remember we warned to carefully inspect all rows in the list. Can you spot the mistake? Have a closer look at the following screenshot: Indeed, the clustering algorithm has decided that Shirts and T-shirts are similar enough to be merged. Unfortunately, this is not true. So, either manually select all correct suggestions, or deselect the ones that are not. Then, click on the Merge Selected & Re-Cluster button. Apart from trying different similarity functions, we can also try totally different clustering methods. From the Method menu, click on nearest neighbor. We again see new clustering parameters appear (Radius and Block Chars, but we will use their default settings for now). OpenRefine again finds several clusters, but now, it has been a little too aggressive. In fact, several suggestions are wrong, such as the Lockets / Pockets / Rockets cluster. Some other suggestions, such as "Photocopiers" and "Photocopier", are fine. In this situation, it might be best to manually pick the few correct ones among the many incorrect clusters. Assuming that all clusters have been identified, click on the Merge Selected & Close button, which will apply merging to the selected items and take you back into the main OpenRefine window. If you look at the data now or use a text facet on the Categories field, you will notice that the inconsistencies have disappeared. What are clustering methods? OpenRefine offers two different clustering methods, key collision and nearest neighbor, which fundamentally differ in how they function. With key collision, the idea is that a keying function is used to map a field value to a certain key. Values that are mapped to the same key are placed inside the same cluster. For instance, suppose we have a keying function which removes all spaces; then, A B C, AB C, and ABC will be mapped to the same key: ABC. In practice, the keying functions are constructed in a more sophisticated and helpful way. Nearest neighbor, on the other hand, is a technique in which each unique value is compared to every other unique value using a distance function. For instance, if we count every modification as one unit, the distance between Boot and Bots is 2: one addition and one deletion. This corresponds to an actual distance function in OpenRefine, namely levenshtein. In practice, it is hard to predict which combination of method and function is the best for a given field. Therefore, it is best to try out the various options, each time carefully inspecting whether the clustered values actually belong together. The OpenRefine interface helps you by putting the various options in the order they are most likely to help: for instance, trying key collision before nearest neighbor. Summary In this article we learned about how to handle multi-valued cells and clustering of similar cells in OpenRefine. Multi-valued cells are a common problem in many tables. This article showed us what to do if multiple values apply to a single cell. Since OpenRefine is an automated piece of software, it needs to be informed whether a field is multi-valued before it can perform sensible operations on it. This article also showed an example of how to go about it. It also shed light on clustering methods. OpenRefine offers two different clustering methods, key collision and nearest neighbor , which fundamentally differ in how they function. With key collision, the idea is that a keying function is used to map a field value to a certain key. Values that are mapped to the same key are placed inside the same cluster. Resources for Article : Further resources on this subject: Business Intelligence and Data Warehouse Solution - Architecture and Design [Article] Self-service Business Intelligence, Creating Value from Data [Article] Oracle Business Intelligence : Getting Business Information from Data [Article]

0
0
1891

article-image-getting-started-pentaho-data-integration

Packt

30 Oct 2013

16 min read

Getting Started with Pentaho Data Integration

Packt

30 Oct 2013

16 min read

(For more resources related to this topic, see here.) Pentaho Data Integration and Pentaho BI Suite Before introducing PDI, let’s talk about Pentaho BI Suite. The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are: Analysis: The analysis engine serves multidimensional analysis. It’s provided by the Mondrian OLAP server. Reporting: The reporting engine allows designing, creating, and distributing reports in various known formats (HTML, PDF, and so on), from different kinds of sources. Data Mining: Data mining is used for running data through algorithms in order to understand the business and do predictive analysis. Data mining is possible thanks to the Weka Project. Dashboards: Dashboards are used to monitor and analyze Key Performance Indicators (KPIs). The Community Dashboard Framework (CDF), a plugin developed by the community and integrated in the Pentaho BI Suite, allows the creation of interesting dashboards including charts, reports, analysis views, and other Pentaho content, without much effort. Data Integration: Data integration is used to integrate scattered information from different sources (applications, databases, files, and so on), and make the integrated information available to the final user. All of this functionality can be used standalone but also integrated. In order to run analysis, reports, and so on, integrated as a suite, you have to use the Pentaho BI Platform. The platform has a solution engine, and offers critical services, for example, authentication, scheduling, security, and web services. This set of software and services form a complete BI Platform, which makes Pentaho Suite the world’s leading open source Business Intelligence Suite. Exploring the Pentaho Demo The Pentaho BI Platform Demo is a pre-configured installation that allows you to explore several capabilities of the Pentaho platform. It includes sample reports, cubes, and dashboards for Steel Wheels. Steel Wheels is a fictional store that sells all kind of scale replicas of vehicles. The following screenshot is a sample dashboard available in the demo: The Pentaho BI Platform Demo is free and can be downloaded from http://sourceforge.net/projects/pentaho/files/. Under the Business Intelligence Server folder, look for the latest stable version. You can find out more about Pentaho BI Suite Community Edition at http://community.pentaho.com/projects/bi_platform. There is also an Enterprise Edition of the platform with additional features and support. You can find more on this at www.pentaho.org. Pentaho Data Integration Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is not an exception—Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle. The name Kettle didn’t come from the recursive acronym Kettle Extraction, Transportation, Transformation, and Loading Environment it has now. It came from KDE Extraction, Transportation, Transformation, and Loading Environment, since the tool was planned to be written on top of KDE, a Linux desktop environment, as mentioned in the introduction of the article. In April 2006, the Kettle project was acquired by the Pentaho Corporation and Matt Casters, the Kettle founder, also joined the Pentaho team as a Data Integration Architect. When Pentaho announced the acquisition, James Dixon, Chief Technology Officer said: We reviewed many alternatives for open source data integration, and Kettle clearly had the best architecture, richest functionality, and most mature user interface. The open architecture and superior technology of the Pentaho BI Platform and Kettle allowed us to deliver integration in only a few days, and make that integration available to the community. By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project. From that moment, the tool has grown with no pause. Every few months a new release is available, bringing to the users improvements in performance, existing functionality, new functionality, ease of use, and great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho: June 2006: PDI 2.3 is released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. The version included among other changes, enhancements for large-scale environments and multilingual capabilities. February 2007: Almost seven months after the last major revision, PDI 2.4 is released including remote execution and clustering support, enhanced database support, and a single designer for jobs and transformations, the two main kind of elements you design in Kettle. May 2007: PDI 2.5 is released including many new features; the most relevant being the advanced error handling. November 2007: PDI 3.0 emerges totally redesigned. Its major library changed to gain massive performance. The look and feel had also changed completely. October 2008: PDI 3.1 arrives, bringing a tool which was easier to use, and with a lot of new functionality as well. April 2009: PDI 3.2 is released with a really large amount of changes for a minor version: new functionality, visualization and performance improvements, and a huge amount of bug fixes. The main change in this version was the incorporation of dynamic clustering. June 2010: PDI 4.0 was released, delivering mostly improvements with regard to enterprise features, for example, version control. In the community version, the focus was on several visual improvements such as the mouseover assistance that you will experiment with soon. November 2010: PDI 4.1 is released with many bug fixes. August 2011: PDI 4.2 comes to light not only with a large amount of bug fixes, but also with a lot of improvements and new features. In particular, several of them were related to the work with repositories. April 2012: PDI 4.3 is released also with a lot of fixes, and a bunch of improvements and new features. November 2012: PDI 4.4 is released. This version incorporates a lot of enhancements and new features. In this version there is a special emphasis on Big Data—the ability of reading, searching, and in general transforming large and complex collections of datasets. 2013: PDI 5.0 will be released, delivering interesting low-level features such as step load balancing, job transactions, and restartability. Using PDI in real-world scenarios Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data. In fact, PDI not only serves as a data integrator or an ETL tool. PDI is such a powerful tool, that it is common to see it used for these and for many other purposes. Here you have some examples. Loading data warehouses or datamarts The loading of a data warehouse or a datamart involves many steps, and there are many variants depending on business area, or business rules. But in every case, no exception, the process involves the following steps: Extracting information from one or different databases, text files, XML files and other sources. The extract process may include the task of validating and discarding data that doesn’t match expected patterns or rules. Transforming the obtained data to meet the business and technical needs required on the target. Transformation implies tasks as converting data types, doing some calculations, filtering irrelevant data, and summarizing. Loading the transformed data into the target database. Depending on the requirements, the loading may overwrite the existing information, or may add new information each time it is executed. Kettle comes ready to do every stage of this loading process. The following screenshot shows a simple ETL designed with Kettle: Integrating data Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main ERP (Enterprise Resource Planning) application and a CRM (Customer Relationship Management) application, though they’re not connected. These are just two of hundreds of examples where data integration is needed. The integration is not just a matter of gathering and mixing data. Some conversions, validation, and transport of data have to be done. Kettle is meant to do all of those tasks. Data cleansing It’s important and even critical that data be correct and accurate for the efficiency of business, to generate trust conclusions in data mining or statistical studies, to succeed when integrating data. Data cleansing is about ensuring that the data is correct and precise. This can be achieved by verifying if the data meets certain rules, discarding or correcting those which don’t follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform minimum and maximum values, and so on. These are tasks that Kettle makes possible thanks to its vast set of transformation and validation capabilities. Migrating information Think of a company, any size, which uses a commercial ERP application. One day the owners realize that the licenses are consuming an important share of its budget. So they decide to migrate to an open source ERP. The company will no longer have to pay licenses, but if they want to change, they will have to migrate the information. Obviously, it is not an option to start from scratch, nor type the information by hand. Kettle makes the migration possible thanks to its ability to interact with most kind of sources and destinations such as plain files, commercial and free databases, and spreadsheets, among others. Exporting data Data may need to be exported for numerous reasons: To create detailed business reports To allow communication between different departments within the same company To deliver data from your legacy systems to obey government regulations, and so on Kettle has the power to take raw data from the source and generate these kind of ad-hoc reports. Integrating PDI along with other Pentaho tools The previous examples show typical uses of PDI as a standalone application. However, Kettle may be used embedded as part of a process or a dataflow. Some examples are pre-processing data for an online report, sending mails in a scheduled fashion, generating spreadsheet reports, feeding a dashboard with data coming from web services, and so on. The use of PDI integrated with other tools is beyond the scope of this article. If you are interested, you can find more information on this subject in the Pentaho Data Integration 4 Cookbook by Packt Publishing at http://www.packtpub.com/pentaho-data-integration-4-cookbook/book. Installing PDI In order to work with PDI, you need to install the software. It’s a simple task, so let’s do it now. Time for action – installing PDI These are the instructions to install PDI, for whatever operating system you may be using. The only prerequisite to install the tool is to have JRE 6.0 installed. If you don’t have it, please download it from www.javasoft.com and install it before proceeding. Once you have checked the prerequisite, follow these steps: Go to the download page at http://sourceforge.net/projects/pentaho/files/Data Integration. Choose the newest stable release. At this time, it is 4.4.0, as shown in the following screenshot: Download the file that matches your platform. The preceding screenshot should help you. Unzip the downloaded file in a folder of your choice, that is, c:/util/kettle or /home/pdi_user/kettle. If your system is Windows, you are done. Under Unix-like environments, you have to make the scripts executable. Assuming that you chose /home/pdi_user/kettle as the installation folder, execute: cd /home/pdi_user/kettle chmod +x *.sh In Mac OS you have to give execute permissions to the JavaApplicationStub file. Look for this file; it is located in Data Integration 32-bit.appContentsMacOS, or Data Integration 64-bit.appContentsMacOS depending on your system. What just happened? You have installed the tool in just a few minutes. Now, you have all you need to start working. Launching the PDI graphical designer – Spoon Now that you’ve installed PDI, you must be eager to do some stuff with data. That will be possible only inside a graphical environment. PDI has a desktop designer tool named Spoon. Let’s launch Spoon and see what it looks like. Time for action – starting and customizing Spoon In this section, you are going to launch the PDI graphical designer, and get familiarized with its main features. Start Spoon. If your system is Windows, run Spoon.bat You can just double-click on the Spoon.bat icon, or Spoon if your Windows system doesn’t show extensions for known file types. Alternatively, open a command window—by selecting Run in the Windows start menu, and executing cmd, and run Spoon.bat in the terminal. In other platforms such as Unix, Linux, and so on, open a terminal window and type spoon.sh If you didn’t make spoon.sh executable, you may type sh spoon.sh Alternatively, if you work on Mac OS, you can execute the JavaApplicationStub file, or click on the Data Integration 32-bit.app, or Data Integration 64-bit.app icon As soon as Spoon starts, a dialog window appears asking for the repository connection data. Click on the Cancel button. A small window labeled Spoon tips... appears. You may want to navigate through various tips before starting. Eventually, close the window and proceed. Finally, the main window shows up. A Welcome! window appears with some useful links for you to see. Close the window. You can open it later from the main menu. Click on Options... from the menu Tools. A window appears where you can change various general and visual characteristics. Uncheck the highlighted checkboxes, as shown in the following screenshot: Select the tab window Look & Feel. Change the Grid size and Preferred Language settings as shown in the following screenshot: Click on the OK button. Restart Spoon in order to apply the changes. You should not see the repository dialog, or the Welcome! window. You should see the following screenshot full of French words instead: What just happened? You ran for the first time Spoon, the graphical designer of PDI. Then you applied some custom configuration. In the Option… tab, you chose not to show the repository dialog or the Welcome! window at startup. From the Look & Feel configuration window, you changed the size of the dotted grid that appears in the canvas area while you are working. You also changed the preferred language. These changes were applied as you restarted the tool, not before. The second time you launched the tool, the repository dialog didn’t show up. When the main window appeared, all of the visible texts were shown in French which was the selected language, and instead of the Welcome! window, there was a blank screen. You didn’t see the effect of the change in the Grid option. You will see it only after creating or opening a transformation or job, which will occur very soon! Spoon Spoon, the tool you’re exploring in this section, is the PDI’s desktop design tool. With Spoon, you design, preview, and test all your work, that is, Transformations and Jobs. When you see PDI screenshots, what you are really seeing are Spoon screenshots. Setting preferences in the Options window In the earlier section, you changed some preferences in the Options window. There are several look and feel characteristics you can modify beyond those you changed. Feel free to experiment with these settings. Remember to restart Spoon in order to see the changes applied. In particular, please take note of the following suggestion about the configuration of the preferred language. If you choose a preferred language other than English, you should select a different language as an alternative. If you do so, every name or description not translated to your preferred language, will be shown in the alternative language. One of the settings that you changed was the appearance of the Welcome! window at startup. The Welcome! window has many useful links, which are all related with the tool: wiki pages, news, forum access, and more. It’s worth exploring them. You don’t have to change the settings again to see the Welcome! window. You can open it by navigating to Help | Welcome Screen. Storing transformations and jobs in a repository The first time you launched Spoon, you chose not to work with repositories. After that, you configured Spoon to stop asking you for the Repository option. You must be curious about what the repository is and why we decided not to use it. Let’s explain it. As we said, the results of working with PDI are transformations and jobs. In order to save the transformations and jobs, PDI offers two main methods: Database repository: When you use the database repository method, you save jobs and transformations in a relational database specially designed for this purpose. Files: The files method consists of saving jobs and transformations as regular XML files in the filesystem, with extension KJB and KTR respectively. It’s not allowed to mix the two methods in the same project. That is, it makes no sense to mix jobs and transformations in a database repository with jobs and transformations stored in files. Therefore, you must choose the method when you start the tool. By clicking on Cancel in the repository window, you are implicitly saying that you will work with the files method. Why did we choose not to work with repositories? Or, in other words, to work with the files method? Mainly for two reasons: Working with files is more natural and practical for most users. Working with a database repository requires minimal database knowledge, and that you have access to a database engine from your computer. Although it would be an advantage for you to have both preconditions, maybe you haven’t got both of them. There is a third method called File repository, that is a mix of the two above—it’s a repository of jobs and transformations stored in the filesystem. Between the File repository and the files method, the latest is the most broadly used. Therefore, throughout this article we will use the files method. Creating your first transformation Until now, you’ve seen the very basic elements of Spoon. You must be waiting to do some interesting task beyond looking around. It’s time to create your first transformation.

0
0
6774

article-image-performance-testing-and-load-balancing

Packt

30 Oct 2013

17 min read

Performance Testing and Load Balancing

Packt

30 Oct 2013

17 min read

(For more resources related to this topic, see here.) Initial and on-going performance measurement Performance measurements begin prior to system deployment. In terms of a failover cluster of Hyper-V systems, it begins prior to creating any virtual machines. Your first goal is to obtain baselines. The term baseline has different meanings in different contexts; in this case it means gathering data on a system during a known healthy period. Its purpose is to serve as a point of comparison for later data gathering operations. The first set of performance measurements you take will be with no virtual machines. Once you have reached your target deployment level, you will obtain another. These will be your baselines. All future performance measurements will be compared to these in order to determine how your systems are working. Microsoft provides a thorough document for performance tuning of Windows Server 2012. These concepts carry forward to R2 and many apply to Hyper-V Server as well. Download it from the following site: http://download.microsoft.com/download/0/0/B/00BE76AF-D340-4759-8ECD-C80BC53B6231/performance-tuning-guidelines-windows-server-2012.docx General performance measurement Baselines and ongoing performance evaluations tend to be fairly generic in nature. They can be carried out in a number of ways. This section will examine two others. The first is the free Server Performance Advisor ( SPA ) provided by Microsoft. The second is the Performance Monitor tool in-built in Windows operating systems. Server Performance Advisor This tool can be run quickly to determine the performance characteristics of a new system and on a schedule to track the performance trends of an active system. Do not install or run Server Performance Advisor directly on a Hyper-V host or any guests that are to be measured. Doing so adds a load that will make the results inaccurate. The following instructions can be used to quickly set up SPA to run in a basic environment. They assume that you'll be running the application with a domain account that has administrative privileges on the systems to be measured. To scan a system that has an active firewall, run the following cmdlet: Enable-NetFirewallRule -DisplayName "Performance Logs and Alerts (TCP‑In)" Service Performance Advisor is published on the developer center, which is accessible at http://msdn.microsoft.com/en-us/library/windows/hardware/hh367834.aspx. For best results, this tool should be run from a remote computer that's not on the host being measured. It can be run from any modern Windows system. It requires a connection to an installation of Microsoft SQL Server 2008 R2 or newer. The Express edition is perfectly acceptable. The latest version can be obtained at no charge from the Microsoft download center at http://search.microsoft.com/en-us/downloadresults.aspx?q=sql%20express. There is another requirement that's listed on the download page but not in the included documentation. The CAB file that SPA is delivered in must be extracted with its directory structure intact. If you use Windows Explorer to open the CAB, it will not extract the files properly. Use the built-in extrac32 tool according to the directions (they're on the download page) or use another extraction application that can reproduce the proper folder structure. The final prerequisite you must satisfy is the creation of a folder to hold the results. This folder can be in any location on the system you'll be running SPA from, and it can have any name. This folder must be shared. Determine the domain account that you'll be running SPA with and give that account full permissions to the folder and its share. All that's left is to run SPA. In the folder where you extracted the CAB's contents, run SPAConsole.exe. When it opens, choose File and then New Project to get started. The first screen is just a basic introductory screen. Click on Next and you'll see the following screen, which has been filled in with examples: The previous entries direct the application to create a database on the local computer, in this case an instance of SQL Server Express. For a large environment with many systems to scan, it is recommended to use SQL Server Standard instead. The database name can be anything you like; this one has been named to reflect that it will contain data on the first Hyper-V cluster in the sample organization. Be aware that this will create a new database on the selected server. Once you have selected the database server, instance, and name, and then click on Next to move to the following screen: This screen allows you to select the advisor packs that you'd like to make available in this project. Even though you only need the Hyper-V advisor and perhaps the CoreOS advisor, it's best to select all three. The interface sometimes hangs if only a subset is selected. You won't be required to use all three during a scan. Click Next . This will bring you to the final screen: On this screen, enter the servers that you want to scan. The File Share Location is a file share that will hold the results of the scan. As with the SQL database, it's not required to be on the same system as the scanner. Servers can be added to the list later. You can use Test Configuration to ensure that the indicated servers are reachable. Once you're happy with the entries, click on Finish . You'll be returned to the main screen of SPA. Now, you should see the host(s) that you selected for this project. Select their checkboxes, and then press the Run Analysis button in the lower-right corner. Here, you'll be able to select the actual advisor packs that you want to use. At the bottom of the screen, you'll be able to enter how long you want the scan to run, and if you wish to collect numerous data points over a period of time—how often you want it to run. Click on OK when you're satisfied with your selections and the data collection process will begin. Once it is complete, you can click on the small down arrow in the Analysis Result column of one of the hosts. This will show three buttons, indicated in the following screenshot: These buttons are, from left to right: View Latest Report : See the report from the latest analysis. This is the screen you're most likely to be interested in after a one-time scan of a new system. It will show warnings for any items and settings it finds that might impede optimal performance of Hyper-V. It can also compare one report against another and export result sets to XML. Find Reports : Search through all result sets for this host according to the criteria that you choose. View Charts : These are detailed charts that examine and graph very specific performance metrics of the host. The wording of the Logical Processor count limit when Hyper-V is enabled warning is misleading. The management operating system is restricted to using 64 logical processors, but Hyper-V itself can still schedule guest processes up to the maximum of 320 logical processors. The first two buttons are very simple to understand and you should have no trouble navigating them on your own. Do remember to check the various tabs inside the report. The third button, View Charts , brings you to a tool that isn't as easy to decipher. You'll begin by picking a range of dates, and assuming that you've got more than one report to chart, you'll get a screen that looks something like the following screenshot: The sheer amount of data shown can make this difficult to interpret. In the lower section, you'll notice that there is a large number of performance counters. Select only those that you're actually interested in viewing and you'll find that the chart becomes much easier to understand. To deselect all items, select the first item and press Ctrl + A , and then press the Space bar. The items marked as 90% remove all utilization above the 90 percent mark. These are assumed to be momentary spikes that can skew the outcome in a way that makes the data meaningless. Compare these to the same metrics marked as Max . Use the Pick Series button at the bottom of the window if you wish to reduce the number of selectable items. This button is more useful on the other two tabs; in fact, they'll have no data to display if you do not select an item. As indicated, these two tabs show the way that the selected metrics have been trending over a specified period of time. These can show you how your systems behave differently during the day or across a week. Comparing these reports against those generated by other servers can help you to determine how your guests should be load balanced. Performance Monitor The built-in Performance Monitor tool is much more powerful than most others; but it's up to you to choose what to measure. One of its major strengths is that there's nothing to install. All you need is a Windows system with a GUI. As with Server Performance Advisor, it is not recommended that you run this on a Hyper-V host or guest that you are going to measure. There are two ways to run Performance Monitor. One is as a real-time tool that graphs the monitored performance counters as they occur. The second is as a collector that gathers metrics and stores them for trend analysis. The differentiating features of Performance Monitor from Server Performance Advisor are: Real-time graphing Precise selection of metrics No software downloads required No database system required Performance logs can be opened on any Windows system Performance Monitor is found in Administrative Tools. Depending on how your system is configured, Administrative Tools may be found on the Start screen or menu. It's available in the Control Panel in all versions of Windows. It's also available under the Performance node of Computer Management . If you will be running it for real-time graphing, ensure that you start it with an account that has administrative privileges on the target system. For collectors, you'll be able to specify the account to run it under. You may also need to modify the firewall as indicated under the Server Performance Advisor section mentioned earlier. Real-time monitoring with Performance Monitor To start a real-time monitoring session, expand the Monitoring Tools node and click on Performance Monitor . In the center pane, click on the button with the green plus, which will open the Add Counters window. In this window, you'll want to change the counters' source to the target computer. Your screen should look like the following screenshot: Navigate through the various counters in the upper list box. When you click on one, it will show the instances of that counter that are available to be monitored. Double-click on an instance or highlight it and click on the Add >> button to move it to the list box on the right. These are the objects that will be tracked. When you are satisfied with your selection, click on OK . See Step 4 in the next section for a screenshot of this window and more information about its contents. You will be returned to the main screen. The display will be updated every second. Each counter you picked will be displayed as a line of a various color. The legend will be shown at the bottom. You can uncheck an item to hide it from the running display; however, its counter will still be monitored. Using the buttons across the top of the graph pane, you can modify the output. Most of the options are self-explanatory; change them until the display suits your desires. You have the ability to modify the graph from its default line output to a histogram or to a running digital display. Click on the Highlight button and then select a counter to make it stand out against the others. Several of the buttons open various tabs on the Performance Monitor Properties window where you can change many settings, such as the delay between samples. Of interest here is the Source tab, which will be used in the next section. Trend tracking with Performance Monitor The second use for Performance Monitor is to pull performance statistics across a span of time. In active deployments, it can be used to track the performance of Hyper-V hosts. You'll create scheduled gathering of data collector sets for this. What makes Performance Monitor especially useful for this is that a single collector set can gather from all the hosts in your cluster simultaneously. Before you start, ensure that the Performance Monitor console is not connected to the target computer system as it would be for a real-time monitor. For instance, if you are using Computer Management as shown in the first screenshot in this article, the tree root should say Computer Management (Local) and not contain the name of another system. The first reason is that running and managing the collector sets creates a small drain on the system's resources. Second, you're going to be running collectors against multiple systems and it's better to use a single remote computer for those purposes. Third, it's easiest to look at the results of performance logs on the system that took them. Otherwise, you have to move them around. Look under the Data Collector Sets tree item. There are a number of predefined collector sets and you can add more. Just right-click on the User Defined node and choose New and Data Collector Set. The following steps will take you through the creation of a collector set: On the first screen, come up with a name for the set, then choose to manually create the set, then click on Next : This wizard will create a data collector named DataCollector01 which cannot be renamed. If you wish, you can skip through the wizard to the end, delete the generic collector, and then create new ones with friendlier names. On the second screen, you want to create performance counter data logs: On the third screen, you can change how often the collector polls for data. As you can see in the following screenshot, the default is every 15 seconds: Click on the Add… button in the previous screen to pick the counters that you want to poll. This is the same screen that you see when selecting counters in the real-time screen. Enter the name of the computer you want to poll data from in the Select counters from computer text box. Upon pressing Tab or Enter or clicking on another control, it will load the counters from that system. Select the counters and instances that you desire and click on Add >> . You can monitor counters from multiple computer systems in the same collector set if you like, but you may also choose to use one collector per computer per set. Remember that you'll want to select Hyper-V related counters for CPU, memory, and networking or you'll be retrieving collectors from the parent partition only. Physical disk counters are read from the management operating system. You cannot retrieve statistics for pass-through disks by setting performance counters on the management operating system. If you click on the Add >> button and nothing happens, it is because instances are required but didn't load. Click on another counter and then back on the desired counter until the instances are displayed. On clicking OK , you'll be returned to the previous screen that will now be populated with the counters that you chose. Ensure everything looks as you wish and click on Next . You'll now be asked for a location to save the logs to. Although it will allow you to enter a UNC, logfile creation is usually unsuccessful anywhere but on the local system. You may place them in a local folder that is shared for easy accessibility from other systems, if you wish. The final screen will have you provide the credentials that the set will use. If you leave it on its default setting, it will use the Local System account that will not have the necessary rights to run the collection on the target computer. You have two choices: you can add the computer account of the collector machine to the Performance Log Users group on all target machines or you can use an account that is a member of that group on all machines. For the purposes of this step-through, we're just going to use the domain administrator account: Before clicking on Finish , you are encouraged to set the radio button to Open properties for this data collector set . This will allow you to jump straight to the properties window where you can schedule the scan. Alternately, you can open the properties window by right-clicking on the completed collector set and clicking on Properties . In the properties window, change the options as you like. The Schedule tab is where you establish the Start and End times. You can create multiple schedules for a collector set: If you want to use a separate collector in this set for another host, right-click on the new Collector Set in the left pane and click on New and Data Collector . The wizard is very nearly identical to the one you just completed. You aren't required to follow a schedule. You can manually start and stop collector sets by right-clicking on the menu in the left pane. Once the collector has begun its work, you can go back to the real-time monitor screen and open the Performance Monitor Properties window to the Source tab. Select the logfile that you instructed the collector set to use. The display will switch to the static output of the logfile. However, it will be blank because by default, no counters are selected. Add counters with the green plus button just as you did with the real-time display. This time, you'll only be able to choose from counters that are contained in the logfiles. You can now manipulate the log contents as you did with the live display. Note that you can view a log of an actively running collector, but the screen will not update in real time. Selecting counters practically If you use the exact counters as shown in the example, you'll notice that some of them aren't very useful. For instance, the number of processors in a host is highly unlikely to change during a monitoring session, although the number of virtual processors might. Not all of the available counters are well documented, but there is a Show description check box on the counter selection screen that provides a bit of information. Also, some of the counters you can pull don't compare well from one host to another. In the sample, we instructed SV-HYPERV1 to monitor the amount of data traveling across the virtual adapter in SV-DC1. This is useful data in its own right, but probably in isolation, not as a comparator. Of course, if the virtual machine migrates to another host, it will no longer be readable. You may find the aggregate counters to be more useful than specific virtual machine counters. The counters that are truly useful are simply too numerous to make a meaningful list out of, and not all counters are universally useful in all organizations. The four generic categories you're likely to be interested in are CPU, disk, memory, and networking. Be judicious about selecting counters that look at specific highly available virtual machines. Alternative ways to read performance logs Performance logs can be confusing, especially when you first encounter them. There are a number of tools on the market designed to aid you. One free and popular tool is the Performance Analysis of Logs ( PAL ) Tool. It is a free and open source tool downloadable from Codeplex at http://pal.codeplex.com.

0
0
4883

Packt

29 Oct 2013

9 min read

Introducing BeagleBoard

Packt

29 Oct 2013

9 min read

(For more resources related to this topic, see here.) We'll first have a quick overview of the features of BeagleBoard (with focus on the latest xM version) —an open source hardware platform, borne for audio, video, and digital signal processing. Then we will introduce the concept of rapid prototyping and explain what we can do with the BeagleBoard support tools from MATLAB® and Simulink® by MathWorks®. Finally, this article ends with a summary. Different from most approaches that involve coding and compiling at a Linux PC and require intensive manual configuration in command-line manner, the rapid prototyping approach presented in this article is a Windows-based approach that features a Windows PC for embedded software development through user-friendly graphic interaction and relieves the developer from intensive coding so that you can concentrate on your application and algorithms and have the BeagleBoard run your inspiration. First of all, let's begin with a quick overview of this article. A quick overview of the BeagleBoard's functionality We can create a number of exciting projects to demonstrate how to build a prototype of an embedded audio, video, and digital signal processing system rapidly without intensive programming and coding. The main projects include: Installing Linux for BeagleBoard from a Windows PC Developing C/C++ with Eclipse on a Windows PC Automatic embedded code generation for BeagleBoard Serial communication and digital I/O application: Infrared motion detection Audio application: voice recognition Video application: motion detection These projects provide the workflow of building an embedded system. With the help of various online documents you can learn about setting up the development environment, writing software at a host PC running Microsoft Windows, and compiling the code for standalone ARM-executables at the BeagleBoard running Linux. Then you can learn the skills of rapid prototyping embedded audio and video systems via the BeagleBoard support tools from Simulink by MathWorks. The main features of these techniques include: Open source hardware A Windows-based friendly development environment Rapid prototyping and easy learning without intensive coding These features will save you from intensive coding and will also relieve the pressure on you to build an embedded audio/video processing system without learning the complicated embedded Linux. The rapid prototyping techniques presented allow you to concentrate on your brilliant concept and algorithm design, rather than being distracted by the complicated embedded system and low-level manual programming. This is beneficial for students and academics who are primarily interested in the development of audio/video processing algorithms, and want to build an embedded prototype for proof-of-concept quickly. BeagleBoard-xM BeagleBoard, the brainchild of a small group of Texas Instruments (TI) engineers and volunteers, is a pocket-sized, low-cost, fan-less, single-board computer containing TI Open Multimedia Application Platform 3 (OMAP3) System on a chip (SoC) processor, which integrates a 1 GHz ARM core and a TI's Digital Signal Processor (DSP) together. Since many consumer electronics devices nowadays run some form of embedded Linux-based environment and usually are on an ARM-based platform, the BeagleBoard was proposed as an inexpensive development kit for hobbyists, academics, and professionals for high-performance, ARM-based embedded system learning and evaluation. As an open hardware embedded computer with open source software development in mind, the BeagleBoard was created for audio, video, and digital signal processing with the purpose of meeting the demands of those who want to get involved with embedded system development and build their own embedded devices or solutions. Furthermore, by utilizing standard interfaces, the BeagleBoard comes with all of the expandability of today's desktop machines. The developers can easily bring their own peripherals and turn the pocket-sized BeagleBoard into a single-board computer with many additional features. The following figure shows the PCB layout and major components of the latest xM version of the BeagleBoard. The BeagleBoard-xM (referred to as BeagleBoard in this article unless specified otherwise) is an 8.25 x 8.25cm (3.25" x 3.25") circuit board that includes the following components: CPU: TI's DM3730 processor, which houses a 1 GHz ARM Cortex-A8 superscalar core and a TI's C64x+ DSP core. The power of the 32-bit ARM and C64+ DSP, plus a large amount of onboard DDR RAM arm the BeagleBoard with the capacity to deal with computational intensive tasks, such as audio and video processing. Memory: 512 MB MDDR SDRAM working 166MHz. The processor and the 512 MB RAM comes in a .44 mm (Package on Package) POP package, where the memory is mounted on top of the processor. microSD card slot: being provided as a means for the main nonvolatile memory. The SD cards are where we install our operating system and will act as a hard disk. The BeagleBoard is shipped with a 4GB microSD card containing factory-validated software (actually, an Angstrom distribution of embedded Linux tailored for BeagleBoard). Of course, this storage can be easily expanded by using, for example, an USB portable hard drive. USB2.0 On-The-Go (OTG) mini port: This port can be used as a communication link to a host PC and the power source deriving power from the PC over the USB cable. 4-port USB-2.0 hub: These four USB Type A connectors with full LS/FS/HS support. Each port can provide power on/off control and up to 500 mA as long as the input DC to the BeagleBoard is at least 3 A. RS232 port: A single RS232 port via UART3 of DM3730 processor is provided by a DB9 connector on BeagleBoard-xM. A USB-to-serial cable can be plugged directly into the DB9 connector. By default, when the BeagleBoard boots, system information will be sent to the RS232 port and you can log in to the BeagleBoard through it. 10/100 M Ethernet: The Ethernet port features auto-MDIX, which works for both crossover cable and straight-through cable. Stereo audio output and input: BeagleBoard has a hardware accelerated audio encoding and decoding (CODEC) chip and provides stereo in and out ports via two 3.5 mm jacks to support external audio devices, such as headphones, powered speakers, and microphones (either stereo or mono). Video interfaces: It includes S-video and Digital Visual Interface (DVI)-D output, LCD port, a Camera port. Joint Test Action Group (JTAG) connector: reset button, a user button, and many developer-friendly expansion connectors. The user button can be used as an application button. To get going, we need to power the BeagleBoard by either the USB OTG mini port, which just provides current of up to 500 mA to run the board alone, or a 5V power source to run with external peripherals. The BeagleBoard boots from the microSD card once the power is on. Various alternative software images are available on the BeagleBoard website, so we can replace the factory default images and have the BeagleBoard run with many other popular embedded operating systems (like Andria and Windows CE). The off-the-shelf expansion via standard interfaces on the BeagleBoard allows developers to choose various components and operating systems they prefer to build their own embedded solutions or a desktop-like system as shown below: BeagleBoard for rapid prototyping A rapid prototyping approach allows you to quickly create a working implementation of your proof-of-concept and verify your audio or video applications on hardware early, which overcomes barriers in the design-implementation-validation loops and helps you find the right solution for your applications. Rapid prototyping not only reduces the development time from concept to product, but also allows you to identify defects and mistakes in system and algorithm design at an early stage. Prototyping your concept and evaluating its performance on a target hardware platform gives you confidence in your design, and promotes its success in applications. The powerful BeagleBoard equipped with many standard interfaces provides a good hardware platform for rapid embedded system prototyping. On the other hand, the rapid prototyping tool, the BeagleBoard Support from Simulink package, provided by MathWorks with graphic user interface (GUI) allows developers to easily implement their concept and algorithm graphically in Simulink, and then directly run the algorithms at the BeagleBoard. In short, you design algorithms in MATLAB/Simulink and see them perform as a standalone application on the BeagleBoard. In this way, you can concentrate on your brilliant concept and algorithm design, rather than being distracted by the complicated embedded system and low-level manual programming. The prototyping tool reduces the steep learning curve of embedded systems and helps hobbyists, students, and academics who have a great idea, but have little background knowledge of embedded systems. This feature is particularly useful to those who want to build a prototype of their applications in a short time. MathWorks introduced the BeagleBoard support package for rapid prototyping in 2010. Since the release of MATLAB 2012a, support for the BeagleBoard-xM has been integrated into Simulink and is also available in the student version of MATLAB and Simulink. Your rapid prototyping starts with modeling your systems and implementing algorithms in MATLAB and Simulink. From your models, you can automatically generate algorithmic C code along with processor-specific, real-time scheduling code and peripheral drivers, and run them as standalone executables on embedded processors in real time. The following steps provide an overview of the work flow for BeagleBoard rapid prototyping in MATLAB/Simulink: Create algorithms for various applications in Simulink and MATLAB with a user-friendly GUI. The applications can be audio processing (for example, digital amplifiers), computer vision applications (for example, object tracking), control systems (for example, flight control), and so on. Verify and improve the algorithm work by simulation. With intensive simulation, it is expected that most defects, errors, and mistakes in algorithms will be identified. Then the algorithms are easily modified and updated to fix the identified issues. Run the algorithms as standalone applications on the BeagleBoard. Interactive parameter turning, signal monitoring, and performance optimization of applications running on the BeagleBoard. Summary In this article, we have familiarized ourselves with the BeagleBoard and rapid prototyping by using MATLAB/Simulink. We have also looked at some of the features of rapid prototyping and the basic steps in rapid prototyping in MATLAB/Simulink. Resources for Article: Further resources on this subject: 2-Dimensional Image Filtering [Article] Creating Interactive Graphics and Animation [Article] Advanced Matplotlib: Part 1 [Article]

0
0
13528

How-To Tutorials

Packt

29 Oct 2013

7 min read

Multiserver Installation

Packt

29 Oct 2013

7 min read

(For more resources related to this topic, see here.) The prerequisites for Zimbra Let us dive into the prerequisites for Zimbra: Zimbra supports only 64-bit LTS versions of Ubuntu, release 10.04 and above. If you would like to use a 32-bit version, you should use Ubuntu 8.04.x LTS with Zimbra 7.2.3. Having a clean and freshly installed system is preferred for Zimbra; it requires a dedicated system and there is no need to install components such as Apache and MySQL since the Zimbra server contains all the components it needs. Note that installing Zimbra with another service (such as a web server) on the same server can cause operational issues. The dependencies (libperl5.14, libgmp3c2, build-essential, sqlite3, sysstat, and ntp) should be installed beforehand. Configure a fixed IP address on the server. Have a domain name and a well-configured DNS (A and MX entries) that points to the server. The system clocks should be synced on all servers. Configure the file /etc/resolv.conf on all servers to point at the server on which we installed the bind (it can be installed on any Zimbra server or on a separate server). We will explain this point in detail later. Preparing the environment Before starting the Zimbra installation process, we should prepare the environment. In the first part of this section, we will see the different possible configurations and then, in the second part, we will present the needed assumptions to apply the chosen configuration. Multiserver configuration examples One of the greatest advantages of Zimbra is its scalability; we can deploy it for a small business with few mail accounts as well as for a huge organization with thousands of mail accounts. There are many possible configuration options; the following are the most used out of those: Small configuration: All Zimbra components are installed on only one server. Medium configuration: Here, LDAP and message store are installed on one server and Zimbra MTA on a separate server. Note here that we can use more Zimbra MTA servers so we can scale easier for large incoming or outgoing e-mail volume. Large configuration: In this case, LDAP will be installed on a dedicated server and we will have multiple mailbox and MTA servers, so we can scale easier for a large number of users. Very large configuration: The difference between this configuration and large one is the existence of an additional LDAP server, so we will have a Master LDAP and its replica. We choose the medium configuration; so, we will install LDAP and mailbox in one server and MTA on the other server. Install different servers in the following order (for medium configuration, 1 and 2 are combined in only one step): 1. First of all, install and configure the LDAP server. 2. Then, install and configure Zimbra mailbox servers. 3. Finally, install Zimbra MTA servers and finish the whole installation configuration. New installations of Zimbra limit spam/ham training to the first installed MTA. If you uninstall or move this MTA, you should enable spam/ham training on another MTA as one host should have this enabled to run zmtrainsa --cleanup. To do this, execute the following command: zmlocalconfig -e zmtrainsa_cleanup_host=TRUE Assumptions In this article, we will use some specific information as input in the Zimbra installation process, which, in most cases, will be different for each user. Therefore, we will note some of the most redundant ones in this section. Remember that you should specify your own values rather than using the arbitrary values that I have provided. The following is the list of assumptions used : OS version: ubuntu-12.04.2-server-amd64 Zimbra version: zcs-8.0.3_GA_5664.UBUNTU12_64.20130305090204 MTA server name: mta MTA hostname: mta.zimbra-essentials.com Internet domain: zimbra-essentials.com MTA server IP address: 172.16.126.141 MTA server IP subnet mask: 255.255.255.0 MTA server IP gateway: 172.16.126.1 Internal DNS server: 172.16.126.11 External DNS server: 8.8.8.8 MTA admin ID: abdelmonam MTA admin Password: Z!mbra@dm1n Zimbra admin Password: zimbrabook MTA server name: ldap MTA hostname: ldap.zimbra-essentials.com LDAP server IP address: 172.16.126.140 LDAP server IP subnet mask: 255.255.255.0 LDAP server IP gateway: 172.16.126.1 Internal DNS server: 172.16.126.11 External DNS server: 8.8.8.8 LDAP admin ID: abdelmonam LDAP admin password: Z!mbra@dm1n To be able to follow the steps described in the next sections, especially each time we need to perform a configuration, the reader should know how to harness the vi editor. If not, you should develop your skill set for using the vi editor or use another editor instead. You can find good basic training for the vi editor at http://www.cs.colostate.edu/helpdocs/vi.html System requirements For the various system requirements, please refer to the following link: http://www.zimbra.com/docs/os/8.0.0/multi_server_install/wwhelp/wwhimpl/common/html/wwhelp.htm#href=ZCS_Multiserver_Open_8.0.System_Requirements_for_VMware_Zimbra_Collaboration_Server_8.0.html&single=true If you are using another version of Zimbra, please check the correct requirements on the Zimbra website. Ubuntu server installation First of all, choose the appropriate language. Choose Install Ubuntu Server and then press Enter. When the installation prompts you to provide a hostname, configure only a one-word hostname; in the Assumptions section, we've chosen ldap for the LDAP and mailstore server and mta for the MTA server—don't give the fully qualified domain name (for example, mta.zimbra-essentials.com). On the next screen that calls for the domain name, assign it zimbra-essentials.com (without the hostname). The hard disk setup is simple if you are using a single drive; however, in the case of a server, it's not the best way to do things. There are a lot of options for partitioning your drives. In our case, we just make a little partition (2x RAM) for swapping, and what remains will be used for the whole system. Others can recommend separate partitions for mailstore, system, and so on. Feel free to use the recommendation you want depending on your IT architecture; use your own judgment here or ask your IT manager. After finishing the partitioning task, you will be asked to enter the username and password; you can choose what you want except admin and zimbra. When asked if you want to encrypt the home directory, select No and then press Enter. Press Enter to accept an empty entry for the HTTP proxy. Choose Install security updates automatically and then press Enter. On the Software Selection screen, you must select the DNS Server and the OpenSSH Server choices for installation; no other options. This will authorize remote administration (SSH) and mandatorily set up bind9 for a split DNS. For bind9, you can install it on only one server, which is what we've done in this article. Select Yes and then press Enter to install the GRUB boot loader to the master boot record. The installation should have completed successfully. Preparing Ubuntu for Zimbra installation In order to prepare the Ubuntu for the Zimbra installation, the following steps need to be performed: Log in to the newly installed system and update and upgrade Ubuntu using the following commands: sudo apt-get update sudo apt-get upgrade Install the dependencies as follows: sudo apt-get install libperl5.14 libgmp3c2 build-essential sqlite3 sysstat ntp Zimbra recommends (but there's no obligation) to disable and remove Apparmor. sudo /etc/init.d/apparmor stop sudo /etc/init.d/apparmor teardown sudo update-rc.d -f apparmor remove sudo aptitude remove apparmor apparmor-utils Set the static IP for your server as follows: Open the network interfaces file using the following command: sudo vi /etc/network/interfaces Then replace the following line: iface eth0 inet dhcp With: iface eth0 inet static address 172.16.126.14 netmask 255.255.255.0 gateway 172.16.126.1 network 172.16.126.0 broadcast 172.16.126.255 Restart the network process by typing in the following: sudo /etc/init.d/networking restart Sanity test! To verify that your network configuration is configured properly, type in ifconfig and ensure that the settings are correct. Then try to ping any working website (such as google.com) to see if that works. On each server, pay attention when you set the static IP address (172.16.126.140 for the LDAP server and 172.16.126.141 for the MTA server). Summary In this article, we learned the prerequisites for Zimbra multiserver installation and preparing the environment for the installation of the Zimbra server in a multiserver environment. Resources for Article : Further resources on this subject: Routing Rules in AsteriskNOW - The Calling Rules Tables [Article] Users, Profiles, and Connections in Elgg [Article] Integrating Zimbra Collaboration Suite with Microsoft Outlook [Article]

0
0
4467

Packt

29 Oct 2013

5 min read

The OpenFlow Controllers

Packt

29 Oct 2013

5 min read

(For more resources related to this topic, see here.) SDN controllers The decoupled control and data plane architecture of software-defined networking ( SDN ), as depicted in the following figure, and in particular OpenFlow can be compared with an operating system and computer hardware. The OpenFlow controller (similar to the operating system) provides a programmatic interface to the OpenFlow switches (similar to the computer hardware). Using this programmatic interface, network applications, referred to as Net Apps, can be written to perform control and management tasks and offer new functionalities. The control plane in SDN and OpenFlow in particular is logically centralized and Net Apps are written as if the network is a single system. With a reactive control model, the OpenFlow switches must consult an OpenFlow controller each time a decision must be made, such as when a new packet flow reaches an OpenFlow switch (that is, Packet_in event). In the case of flow-based control granularity, there will be a small performance delay as the first packet of each new flow is forwarded to the controller for decision (for example, forward or drop), after which future traffic within that flow will be forwarded at line rate within the switching hardware. While the first-packet delay is negligible in many cases, it may be a concern if the central OpenFlow controller is geographically remote or if most flows are short-lived (for example, as single-packet flows). An alternative proactive approach is also possible in OpenFlow to push policy rules out from the controller to the switches. While this simplifies the control, management, and policy enforcement tasks, the bindings must be closely maintained between the controller and OpenFlow switches. The first important concern of this centralized control is the scalability of the system and the second one is the placement of controllers. A recent study of the several OpenFlow controller implementations (NOX-MT, Maestro, and Beacon), conducted on a large emulated network with 100,000 hosts and up to 256 switches, revealed that all OpenFlow controllers were able to handle at least 50,000 new flow requests per second in each of the experimental scenarios. Furthermore, new OpenFlow controllers under development, such as Mc-Nettle (http://haskell.cs.yale.edu/nettle/mcnettle/) target powerful multicore servers and are being designed to scale up to large data center workloads (for example, 20 million flow requests per second and up to 5,000 switches). In packet switching networks, traditionally, each packet contains the required information for a network switch to make individual routing decisions. However, most applications send data as a flow of many individual packets. The control granularity in OpenFlow is in the scale of flows, not packets. When controlling individual flows, the decision made for the first packet of the flow can be applied to all the subsequent packets of the flow within the data plane (OpenFlow switches). The overhead may be further reduced by grouping the flows together, such as all traffic between two hosts, and performing control decisions on the aggregated flows. The role of controller in SDN approach Multiple controllers may be used to reduce the latency or increase the scalability and fault tolerance of the OpenFlow (SDN) deployment. OpenFlow allows the connection of multiple controllers to a switch, which would allow backup controllers to take over in the event of a failure. Onix and HyperFlow take the idea further by attempting to maintain a logically centralized, but physically distributed control plane. This decreases the lookup overhead by enabling communication with local controllers, while still allowing applications to be written with a simplified central view of the network. The potential main downside of this approach is maintaining the consistent state in the overall distributed system. This may cause Net Apps, that believe they have an accurate view of the network, to act incorrectly due to inconsistency in the global network state. Recalling the operating system analogy, an OpenFlow controller acts as a network operating system and should implement at least two interfaces: a southbound interface that allows OpenFlow switches to communicate with the controller, and a northbound interface that presents a programmable application programming interface (API) to network control and management applications (that is, Net Apps). The existing southbound interface is OpenFlow protocol as an early SDN southbound interface implementation. External control and management systems/software or network services may wish to extract information about the underlying network or enforce policies, or control an aspect of the network behavior. Besides, a primary OpenFlow controller may need to share policy information with a backup controller, or to communicate with other controllers across multiple control domains. While the southbound interface (for example, OpenFlow or ForCES, http://datatracker.ietf.org/wg/forces/charter/) is well defined and can be considered as a de facto standard, there is no widely accepted standard for northbound interactions, and they are more likely to be implemented on a use-case basis for particular applications.

0
0
12105

article-image-understanding-websockets-and-server-sent-events-detail

Packt

29 Oct 2013

10 min read

Understanding WebSockets and Server-sent Events in Detail

Packt

29 Oct 2013

10 min read

(For more resources related to this topic, see here.) Encoders and decoders in Java API for WebSockets As seen in the previous chapter, the class-level annotation @ServerEndpoint indicates that a Java class is a WebSocket endpoint at runtime. The value attribute is used to specify a URI mapping for the endpoint. Additionally the user can add encoder and decoder attributes to encode application objects into WebSocket messages and WebSocket messages into application objects. The following table summarizes the @ServerEndpoint annotation and its attributes: Annotation Attribute Description @ServerEndpoint This class-level annotation signifies that the Java class is a WebSockets server endpoint. value The value is the URI with a leading '/.' encoders The encoders contains a list of Java classes that act as encoders for the endpoint. The classes must implement the Encoder interface. decoders The decoders contains a list of Java classes that act as decoders for the endpoint. The classes must implement the Decoder interface. configurator The configurator attribute allows the developer to plug in their implementation of ServerEndpoint.Configurator that is used when configuring the server endpoint. subprotocols The sub protocols attribute contains a list of sub protocols that the endpoint can support. In this section we shall look at providing encoder and decoder implementations for our WebSockets endpoint. The preceding diagram shows how encoders will take an application object and convert it to a WebSockets message. Decoders will take a WebSockets message and convert to an application object. Here is a simple example where a client sends a WebSockets message to a WebSockets java endpoint that is annotated with @ServerEndpoint and decorated with encoder and decoder class. The decoder will decode the WebSockets message and send back the same message to the client. The encoder will convert the message to a WebSockets message. This sample is also included in the code bundle for the book. Here is the code to define ServerEndpoint with value for encoders and decoders: @ServerEndpoint(value="/book", encoders={MyEncoder.class}, decoders = {MyDecoder.class} ) public class BookCollection { @OnMessage public void onMessage(Book book,Session session) { try { session.getBasicRemote().sendObject(book); } catch (Exception ex) { ex.printStackTrace(); } } @OnOpen public void onOpen(Session session) { System.out.println("Opening socket" +session.getBasicRemote() ); } @OnClose public void onClose(Session session) { System.out.println("Closing socket" + session.getBasicRemote()); } } In the preceding code snippet, you can see the class BookCollection is annotated with @ServerEndpoint. The value=/book attribute provides URI mapping for the endpoint. The @ServerEndpoint also takes the encoders and decoders to be used during the WebSocket transmission. Once a WebSocket connection has been established, a session is created and the method annotated with @OnOpen will be called. When the WebSocket endpoint receives a message, the method annotated with @OnMessage will be called. In our sample the method simply sends the book object using the Session.getBasicRemote() which will get a reference to the RemoteEndpoint and send the message synchronously. Encoders can be used to convert a custom user-defined object in a text message, TextStream, BinaryStream, or BinaryMessage format. An implementation of an encoder class for text messages is as follows: public class MyEncoder implements Encoder.Text<Book> { @Override public String encode(Book book) throws EncodeException { return book.getJson().toString(); } } As shown in the preceding code, the encoder class implements Encoder.Text<Book>. There is an encode method that is overridden and which converts a book and sends it as a JSON string. (More on JSON APIs is covered in detail in the next chapter) Decoders can be used to decode WebSockets messages in custom user-defined objects. They can decode in text, TextStream, and binary or BinaryStream format. Here is a code for a decoder class: public class MyDecoder implements Decoder.Text<Book> { @Override public Book decode(String string) throws DecodeException { javax.json.JsonObject jsonObject = javax.json.Json.createReader(new StringReader(string)).readObject(); return new Book(jsonObject); } @Override public boolean willDecode(String string) { try { javax.json.Json.createReader(new StringReader(string)).readObject(); return true; } catch (Exception ex) { } return false; } In the preceding code snippet, the Decoder.Text needs two methods to be overridden. The willDecode() method checks if it can handle this object and decode it. The decode() method decodes the string into an object of type Book by using the JSON-P API javax.json.Json.createReader(). The following code snippet shows the user-defined class Book: public class Book { public Book() {} JsonObject jsonObject; public Book(JsonObject json) { this.jsonObject = json; } public JsonObject getJson() { return jsonObject; } public void setJson(JsonObject json) { this.jsonObject = json; } public Book(String message) { jsonObject = Json.createReader(new StringReader(message)).readObject(); } public String toString () { StringWriter writer = new StringWriter(); Json.createWriter(writer).write(jsonObject); return writer.toString(); } } The Book class is a user-defined class that takes the JSON object sent by the client. Here is an example of how the JSON details are sent to the WebSockets endpoints from JavaScript. var json = JSON.stringify({ "name": "Java 7 JAX-WS Web Services", "author":"Deepak Vohra", "isbn": "123456789" }); function addBook() { websocket.send(json); } The client sends the message using websocket.send() which will cause the onMessage() of the BookCollection.java to be invoked. The BookCollection.java will return the same book to the client. In the process, the decoder will decode the WebSockets message when it is received. To send back the same Book object, first the encoder will encode the Book object to a WebSockets message and send it to the client. The Java WebSocket Client API WebSockets and Server-sent Events , covered the Java WebSockets client API. Any POJO can be transformed into a WebSockets client by annotating it with @ClientEndpoint. Additionally the user can add encoders and decoders attributes to the @ClientEndpoint annotation to encode application objects into WebSockets messages and WebSockets messages into application objects. The following table shows the @ClientEndpoint annotation and its attributes: Annotation Attribute Description @ClientEndpoint This class-level annotation signifies that the Java class is a WebSockets client that will connect to a WebSockets server endpoint. value The value is the URI with a leading /. encoders The encoders contain a list of Java classes that act as encoders for the endpoint. The classes must implement the encoder interface. decoders The decoders contain a list of Java classes that act as decoders for the endpoint. The classes must implement the decoder interface. configurator The configurator attribute allows the developer to plug in their implementation of ClientEndpoint.Configurator, which is used when configuring the client endpoint. subprotocols The sub protocols attribute contains a list of sub protocols that the endpoint can support. Sending different kinds of message data: blob/binary Using JavaScript we can traditionally send JSON or XML as strings. However, HTML5 allows applications to work with binary data to improve performance. WebSockets supports two kinds of binary data Binary Large Objects (blob) arraybuffer A WebSocket can work with only one of the formats at any given time. Using the binaryType property of a WebSocket, you can switch between using blob or arraybuffer: websocket.binaryType = "blob"; // receive some blob data websocket.binaryType = "arraybuffer"; // now receive ArrayBuffer data The following code snippet shows how to display images sent by a server using WebSockets. Here is a code snippet for how to send binary data with WebSockets: websocket.binaryType = 'arraybuffer'; The preceding code snippet sets the binaryType property of websocket to arraybuffer. websocket.onmessage = function(msg) { var arrayBuffer = msg.data; var bytes = new Uint8Array(arrayBuffer); var image = document.getElementById('image'); image.src = 'data:image/png;base64,'+encode(bytes); } When the onmessage is called the arrayBuffer is initialized to the message.data. The Uint8Array type represents an array of 8-bit unsigned integers. The image.src value is in line using the data URI scheme. Security and WebSockets WebSockets are secured using the web container security model. A WebSockets developer can declare whether the access to the WebSocket server endpoint needs to be authenticated, who can access it, or if it needs an encrypted connection. A WebSockets endpoint which is mapped to a ws:// URI is protected under the deployment descriptor with http:// URI with the same hostname,port path since the initial handshake is from the HTTP connection. So, WebSockets developers can assign an authentication scheme, user roles, and a transport guarantee to any WebSockets endpoints. We will take the same sample as we saw in , WebSockets and Server-sent Events , and make it a secure WebSockets application. Here is the web.xml for a secure WebSocket endpoint: <web-app version="3.0" xsi_schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_0.xsd"> <security-constraint> <web-resource-collection> <web-resource-name>BookCollection</web-resource-name> <url-pattern>/index.jsp</url-pattern> <http-method>PUT</http-method> <http-method>POST</http-method> <http-method>DELETE</http-method> <http-method>GET</http-method> </web-resource-collection> <user-data-constraint> <description>SSL</description> <transport-guarantee>CONFIDENTIAL</transport-guarantee> </user-data-constraint> </security-constraint> </web-app> As you can see in the preceding snippet, we used <transport-guarantee>CONFIDENTIAL</transport-guarantee>. The Java EE specification followed by application servers provides different levels of transport guarantee on the communication between clients and application server. The three levels are: Data Confidentiality (CONFIDENTIAL) : We use this level to guarantee that all communication between client and server goes through the SSL layer and connections won't be accepted over a non-secure channel. Data Integrity (INTEGRAL) : We can use this level when a full encryption is not required but we want our data to be transmitted to and from a client in such a way that, if anyone changed the data, we could detect the change. Any type of connection (NONE) : We can use this level to force the container to accept connections on HTTP and HTTPs. The following steps should be followed for setting up SSL and running our sample to show a secure WebSockets application deployed in Glassfish. Generate the server certificate: keytool -genkey -alias server-alias -keyalg RSA -keypass changeit --storepass changeit -keystore keystore.jks Export the generated server certificate in keystore.jks into the file server.cer: keytool -export -alias server-alias -storepass changeit -file server.cer -keystore keystore.jks Create the trust-store file cacerts.jks and add the server certificate to the trust store: keytool -import -v -trustcacerts -alias server-alias -file server.cer -keystore cacerts.jks -keypass changeit -storepass changeit Change the following JVM options so that they point to the location and name of the new keystore. Add this in domain.xml under java-config: <jvm-options>-Djavax.net.ssl.keyStore=${com.sun.aas.instanceRoot}/config/keystore.jks</jvm-options> <jvm-options>-Djavax.net.ssl.trustStore=${com.sun.aas.instanceRoot}/config/cacerts.jks</jvm-options> Restart GlassFish. If you go to https://localhost:8181/helloworld-ws/, you can see the secure WebSocket application. Here is how the the headers look under Chrome Developer Tools: Open Chrome Browser and click on View and then on Developer Tools . Click on Network . Select book under element name and click on Frames . As you can see in the preceding screenshot, since the application is secured using SSL the WebSockets URI, it also contains wss://, which means WebSockets over SSL. So far we have seen the encoders and decoders for WebSockets messages. We also covered how to send binary data using WebSockets. Additionally we have demonstrated a sample on how to secure WebSockets based application. We shall now cover the best practices for WebSocket based-applications.

0
0
18374

Packt

29 Oct 2013

6 min read

Starting an instance

Packt

29 Oct 2013

6 min read

(For more resources related to this topic, see here.) When you add a new instance to your project, it is automatically started. After a few moments, it will be ready for logging in using SSH. In order to create and start an instance, run the following command: % gcutil –-project=<project_id> addinstance <instance_name> gcutil will interactively collect the necessary details on the command line. For our example, we create a new instance called hello-google-compute: % gcutil –project=packt-gce-starter addinstance hello-gce INFO: Waiting for insert of instance hello-gce. Sleeping for 3s. [omitted] The omitted output will show information on the zone, image, and machine type selected during interactive setup and provide a return code to indicate whether the operation was successful. If you wish to create an instance non-interactively, you will have to set a few additional parameters. The instance name must only contain lower-case letters, numbers, or dashes and it must start with a letter. Option Description --machine-type The machine type to host the instance. gcutillistmachinetypes displays a list of available machine types, and gcutil getmachinetype provides details on a specific machine type. --image The name of the image to install, from the project's images collection. gcutil listimages displays a list of available images, and gcutil getimage provides details on a specific image. A comprehensive list of options can be listed by running the following command: % gcutil help addinstance After creation, your instance will be also displayed in the Google Cloud Console. % gcutil --project=<project_id> listinstances By default, every instance has a network setup that allows the virtual machine to communicate with other machines in the same network and with the rest of the world via the Internet. Note that, however, communication is restricted by the default firewall to incoming SSH traffic; see The Firewall object section for details. Information on a newly created instance Checking the status of an instance After having started an instance, or for routine checks during the management of your GCE infrastructure, you may want to check the status of a given virtual machine. This can be done either via the Web UI, as shown in the preceding screenshot, or by calling: % gcutil --project=<project_id> getinstance <instance_name> For our test instance, this will yield something like: % gcutil –-project=packt-gce-starter getinstance hello-gce +------------------------+-------------------------------------+ | property | value | +------------------------+-------------------------------------+ | name | hello-gce | | description | | | creation-time | 2013-05-27T11:02:54.825-07:00 | | machine | machineTypes/n1-standard-1 | | | | | status | RUNNING | | status-message | | | | | | disk | 0 | | type | EPHEMERAL | | mode | READ_WRITE | | | | | network-interface | | | network | networks/default | | ip | 10.240.17.7 | | access-configuration | External NAT | | type | ONE_TO_ONE_NAT | | external-ip | 192.158.30.140 | +------------------------+-------------------------------------+ Note the status line within the above output; it says RUNNING. This indicates that the machine is ready to be used. GCE instance states and transitions Every instance in GCE has a defined status lifecycle, as shown in the preceding figure, and the following states are known: Status Description PROVISIONING Resources are being reserved for the instance, but the virtual machine is not running yet. STAGING Resources have been acquired for the instance, and the virtual machine is prepared for launch. RUNNING The instance is booting up or running. STOPPED The instance has been either shutdown, or it failed. Subsequently, it will either reboot (changing to PROVISIONING), or stop (changing to TERMINATED). TERMINATED The instance has been either shutdown or it failed, and rebooting the virtual machine is not an option. This status is permanent, and the instance must be deleted and recreated. Logging in to your instance As mentioned before, the default network setup allows you to connect to your instance via the SSH protocol. GCE automatically handles key management for you and your project members. To relieve you from the hassle of key handling, gcutil wraps SSH and takes care of sorting out password-less authentication correctly. To login to your instance via SSH, run the following command from your workstation computer: % gcutil --project=<project_id> ssh <instance_name> For our example, you would do the following: % gcutil --project=packt-gce-starter ssh hello-gce INFO: Zone for 'hello-gce' detected as u'europe-west1-a'. INFO: Running command line: ssh -o UserKnownHostsFile=/dev/null -o CheckHostIP=no -o StrictHostKeyChecking=no -i /Users/packt/.ssh/google_compute_engine -A -p 22 packt@192.158.30.140 -- [omitted] packt@hello-gce:~$ As mentioned before, GCE takes care of SSH through the gcutil. In fact, gcutil does a lot! Most importantly, it checks whether you already have a public/private key pair and, if not, creates one for you. It also takes care of uploading your public key to the Google Cloud Console and associating it with your Google user account. In addition, it automatically injects your public key into every instance so that you can directly login, even if you did not set up a user account within the OS image. If you wish to use a different SSH client, you will have to manage usage and the upload of the correct key manually. gcutil stores the generated key pair in your home folder under the hidden directory ~/.ssh. There, you will (besides others) find two files: google_compute_engine (your private key) google_compute_engine.pub (your public key) Even if you do not wish to use gcutil for SSH'ing into your virtual machines, you should go through its setup routine at least once so that your keys are generated and uploaded to the Google Cloud Console; otherwise, you will not be able to login. One ring to rule them all During initial key ring generation, gcutil will ask you to enter and repeat a passphrase to protect your SSH key. Although you can leave the passphrase empty, we strongly recommend not doing so. If you do not protect your key with a passphrase, anyone who gets hold of your workstation computer can easily copy your key and will then have full administrative access to your whole GCE infrastructure! Summary Thus we learned how to get started with creating and running your infrastructure in the Cloud and few concepts that make up a well-performing GCE system. Resources for Article : Further resources on this subject: Blogger: Improving Your Blog with Google Analytics and Search Engine Optimization [Article] Google Earth, Google Maps and Your Photos: a Tutorial [Article] Search Engine Optimization using Sitemaps in Drupal 6 [Article]

0
0
1497

Using Location Data with PhoneGap

The DHTMLX Grid

The Dialog Widget

Highlights of Greenplum

IBM SPSS Modeler – Pushing the Limits

What is Drupal?

APEX Plug-ins

Advanced Data Operations

Getting Started with Pentaho Data Integration

Performance Testing and Load Balancing

Trending Topics

Introducing BeagleBoard

Multiserver Installation

The OpenFlow Controllers

Understanding WebSockets and Server-sent Events in Detail

Starting an instance

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access