Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7010 Articles
article-image-understanding-and-creating-simple-ssrs-reports
Packt
27 Mar 2015
14 min read
Save for later

Understanding and Creating Simple SSRS Reports

Packt
27 Mar 2015
14 min read
In this article by Deepak Agarwal and Chhavi Aggarwal, authors of the book Microsoft Dynamics AX 2012 R3 Reporting Cookbook, we will cover the following topics: Grouping in a report Adding ranges to a report Deploying a report Creating a menu item for a report Creating a report using a query in Warehouse Management (For more resources related to this topic, see here.) Reports are a basic necessity for any business process, as they aid in making critical decisions by analyzing all the data together in a customized manner. Reports can be fetched in many types, such as ad-hoc, analytical, transactional, general statements, and many more by using images, pie charts, and many other graphical representations. These reports help the user to undertake required actions. Microsoft SQL Reporting Services (SSRS) is the basic primary reporting tool of Dynamics AX 2012 R2 and R3. This article will help you to understand the development of SSRS reports in AX 2012 R3 by developing and designing reports using simple steps. These steps have further been detailed into simpler and smaller recipes. In this article, you will design a report using queries with simple formatting, and then deploy the report to the reporting server to make it available for the user. This is made easily accessible inside the rich client. Reporting overview Microsoft SQL Server Reporting Services (SSRS) is the most important feature of Dynamics AX 2012 R2 and R3 reporting. It is the best way to generate analytical, high user scale, transactional, and cost-effective reports. SSRS reports offer ease of customization of reports so that you can get what you want to see. SSRS provides a complete reporting platform that enables the development, design, deployment, and delivery of interactive reports. SSRS reports use Visual Studio (VS) to design and customize reports. They have extensive reporting capabilities and can easily be exported to Excel, Word, and PDF formats. Dynamics AX 2012 has extensive reporting capabilities like Excel, Word, Power Pivot, Management Reporter, and most importantly, SSRS reports. While there are many methodologies to generate reports, SSRS remains the prominent way to generate analytical and transactional reports. SSRS reports were first seen integrated in AX 2009, and today, they have replaced the legacy reporting system in AX 2012. SSRS reports can be developed using classes and queries. In this article, we will discuss query-based reports. In query-based reports, a query is used as the data source to fetch the data from Dynamics AX 2012 R3. We add the grouping and ranges in the query to filter the data. We use the auto design reporting feature to create a report, which is then deployed to the reporting server. After deploying the report, a menu item is attached to the report in Dynamics AX R3 so that the user can display the report from AX R3. Through the recipes in this article, we will build a vendor master report. This report will list all the vendors under each vendor group. It will use the query data source to fetch data from Dynamics AX and subsequently create an auto design-based report. So that this report can be accessed from a rich client, it will then be deployed to the reporting servicer and attached to a menu item in AX. Here are some important links to get started with this article: Install Reporting Services extensions from https://technet.microsoft.com/en-us/library/dd362088.aspx. Install Visual Studio Tools from https://technet.microsoft.com/en-us/library/dd309576.aspx. Connect Microsoft Dynamics AX to the new Reporting Services instance by visiting https://technet.microsoft.com/en-us/library/hh389773.aspx. Before you install the Reporting Services extensions see https://technet.microsoft.com/en-us/library/ee355041.aspx. Grouping in reports Grouping means putting things into groups. Grouping data simplifies the structure of the report and makes it more readable. It also helps you to find details, if required. We can group the data in the query as well as in the auto design node in Visual Studio. In this recipe, we will structure the report by grouping the VendorMaster report based on the VendorGroup to make the report more readable. How to do it... In this recipe, we will add fields under the grouping node of the dataset created earlier in Visual Studio. The fields that have been added in the grouping node will be added and shown automatically in the SSRS report. Go to Dataset and select the VendGroup field. Drag and drop it to the Groupings node under the VendorMaster auto design. This will automatically create a new grouping node and add the VendGroup field to the group. Each grouping has a header row where even fields that don't belong to the group but need to be displayed in the grouped node can be added. This groups the record and also acts like a header, as seen in the following screenshot: How it works… Grouping can also be done based on multiple fields. Use the row header to specify the fields that must be displayed in the header. A grouping can be added manually but dragging and dropping prevents a lot of tasks such as setting the row header. Adding ranges to the report Ranges are very important and useful while developing an SSRS report in AX 2012 R3. They help to show only limited data, which is filtered based on given ranges, in the report. The user can filter the data in a report on the basis of the field added as a range. The range must be specified in the query. In this recipe, we will show how we can filter the data and use a query field as a range. How to do it... In this recipe, we will add the field under the Ranges node in the query that we made in the previous recipe. By adding the field as a range, you can now filter the data on the basis of VendGroup and show only the limited data in the report. Open the PKTVendorDetails query in AOT. Drag the VendGroup and Blocked fields to the Ranges node in AOT and save your query. In the Visual Studio project, right-click on Datasets and select Refresh. Under the parameter node, VendorMaster_DynamicParameter collectively represents any parameter that will be added dynamically through the ranges. This parameter must be set to true to make additional ranges available during runtime. This adds a Select button to the report dialog, which the user can use to specify additional ranges other than what is added. Right-click on the VendorMaster auto design and select Preview. The preview should display the range that was added in the query. Click on the Select button and set the VendGroup value to 10. Click on the OK button, and then select the Report tab, as shown in the following screenshot: Save your changes and rebuild the report from Solution Explorer. Then, deploy the solution. How it works… The report dialog uses the query service UI builder to translate the ranges and to expose additional ranges through the query. Dynamic parameter: The dynamic parameter unanimously represents all the parameters that are added at runtime. It adds the Select button to the dialog from where the user can invoke an advanced query filter window. From this filter window, more ranges and sorting can be added. The dynamic parameter is available per dataset and can be enabled or disabled by setting up the Dynamic Filters property to True or False. The Report Wizard in AX 2012 still uses Morphx reports to auto-create reports using the wizard. The auto report option is available on every form that uses a new AX SSRS report. Deploying a report SSRS, being a server side solution, needs to deploy reports in Dynamics AX 2012 R3. Until the reports are deployed, the user will not be able to see them or the changes made in them, neither from Visual Studio nor from the Dynamics AX rich client. Reports can be deployed in multiple ways and the developer must make this decision. In this recipe, we will show you how we can deploy reports using the following: Microsoft Dynamics AX R3 Microsoft Visual Studio Microsoft PowerShell Getting ready In order to deploy reports, you must have the permission and rights to deploy them to SQL Reporting Services. You must also have the permission to access the reporting manager configuration. Before deploying reports using Microsoft PowerShell, you must ensure that Windows PowerShell 2.0 is installed. How to do it... Microsoft Dynamics AX R3 supports the following ways to deploy SSRS reports. Location of deployment For each of the following deployment locations, let's have a look at the steps that need to be followed: Microsoft Dynamics AX R3: Reports can be deployed individually from a developer workspace in Microsoft Dynamics AX. SSRS reports can be deployed by using the developer client in Microsoft Dynamics AX R3. In AOT, expand the SSRS Reports node, expand the Reports node, select the particular report that needs to be deployed, expand the selected report node, right-click on the report, and then select and click on Deploy Element. The developer can deploy as many reports as need to be deployed, but individually. Reports can be deployed for all the translated languages. Microsoft Visual Studio: Individual reports can be deployed using Visual Studio. Open Visual Studio. In Solution Explorer, right-click on the reporting project that contains the report that you want to deploy, and click on Deploy. The reports are deployed for the neutral (invariant) language only. Microsoft PowerShell: This is used to deploy the default reports that exist within Microsoft Dynamics AX R3. Open Windows PowerShell and by using this, you can deploy multiple reports at the same time. Visit http://msdn.microsoft.com/en-us/library/dd309703.aspx for details on how to deploy reports using PowerShell. To verify whether a report has been deployed, open the report manager in the browser and open the Dynamics AX folder. The PKTVendorDetails report should be found in the list of reports. You can find the report manager URL from System administration | Setup | Business intelligence | Reporting Services | Report servers. The report can be previewed from Reporting Services also. Open Reporting Services and click on the name of the report to preview it. How it works Report deployment is the process of actually moving all the information related to a report to a central location, which is the server, from where it can be made available to the end user. The following list indicates the typical set of actions performed during deployment: The RDL file is copied to the server. The business logic is placed in the server location in the format of a DLL. Deployment ensures that the RDL and business logic are cross-referenced to each other. The Morphx IDE from AX 2009 is still available. Any custom reports that are designed can be imported. This support is only for the purpose of backward compatibility. In AX 2012 R3, there is no concept of Morphx reports. Creating a menu item for a report The final step of developing a report in AX 2012 R3 is creating a menu item inside AX to make it available for users to open from the UI end. This recipe will tell you how to create a new menu item for a report and set the major properties for it. Also, it will teach you to add this menu item to a module to make it available for business users to access this report. How to do it... You can create the new menu item under the Menu Item node in AOT. In this recipe, the output menu item is created and linked with the menu item with SSRS report. Go to AOT | Menu Items | Output, right-click and select New Menu Item. Name it PKTVendorMasterDetails and set the properties as highlighted in the following screenshot: Open the Menu Item to run the report. A dialog appears with the Vendor hold and Group ranges added to the query, followed by a Select button. The Select button is similar to the Morphx reports option where the user can specify additional conditions. To disable the Select option, go to the Dynamic Filter property in the dataset of the query and set it to False. The report output should appear as seen in the following screenshot: How it works… The report viewer in Dynamics AX is actually a form with an embedded browser control. The browser constructs the report URL at runtime and navigates to the reports URL. Unlike in AX 2009, when the report is rendering, the data it doesn't hold up using AX. Instead, the user can use the other parts of the application while the report is rendering. This is particularly beneficial for the end users as they can proceed with other tasks as the report executes. The permission setup is important as it helps in controlling the access to a report. However, SSRS reports inherit user permission from the AX setup itself. Creating a report using a query in Warehouse Management In Dynamics AX 2012 R3, Warehouse Management is a new module. In the earlier version of AX (2012 or R2), there was a single module for Inventory and Warehouse Management. However, in AX R3, there is a separate module. AX queries are the simplest and fastest way to create SSRS reports in Microsoft Dynamics AX R3. In this recipe, we will develop an SSRS report on Warehouse Management. In AX R3, Warehouse Management is integrated with bar-coding devices such as RF-SMART, which supports purchase and receiving processes: picking, packing and shipping, transferring and stock counts, issuing materials for production orders, and reporting production as well. AX R3 also supports the workflow for the Warehouse Management module, which is used to optimize picking, packing, and loading of goods for delivery to customers. Getting ready To work through this recipe, Visual Studio must be installed on your system to design and deploy the report. You must have the permission to access all the rights of the reporting server, and reporting extensions must be installed. How to do it... Similar to other modules, Warehouse Management also has its tables with the "WHS" prefix. We start the recipe by creating a query, which consists of WHSRFMenuTable and WHSRFMenuLine as the data source. We will provide a range of Menus in the query. After creating a query, we will create an SSRS report in Visual Studio and use that query as the data source and will generate the report on warehouse management. Open AOT, add a new query, and name it PKTWarehouseMobileDeviceMenuDetails. Add a WHSRFMenuTable table. Go to Fields and set the Dynamics property to Yes. Add a WHSRFMenuLine table and set the Relation property to Yes. This will create an auto relation that will inherit from table relation node. Go to Fields and set the Dynamics property to Yes. Now open Visual Studio and add a new Dynamics AX report model project. Name it PKTWarehouseMobileDeviceMenuDetails. Add a new report to this project and name it PKTWarehouseMobileDeviceDetails. Add a new dataset and name it MobileDeviceDetails. Select the PKTWarehouseMobileDeviceMenuDetails query in the Dataset property. Select all fields from both tables. Click on OK. Now drag and drop this dataset in the design node. It will automatically create an auto design. Rename it MobileMenuDetails. In the properties, set the layout property to ReportLayoutStyleTemplate. Now preview your report. How it works When we start creating an SSRS report, VS must be connected with Microsoft Dynamics AX R3. If the Microsoft Dynamics AX option is visible in Visual Studio while creating the new project, then the reporting extensions are installed. Otherwise, we need to install the reporting extensions properly. Summary This article helps you to walk through the basis of SSRS reports and create a simple report using queries. It will also help you understand the basic characteristics of reports. Resources for Article: Further resources on this subject: Consuming Web Services using Microsoft Dynamics AX [article] Setting Up and Managing E-mails and Batch Processing [article] Exploring Financial Reporting and Analysis [article]
Read more
  • 0
  • 0
  • 10535

article-image-storm-real-time-high-velocity-computation
Packt
27 Mar 2015
10 min read
Save for later

Storm for Real-time High Velocity Computation

Packt
27 Mar 2015
10 min read
In this article by Shilpi Saxena, author of the book Real-time Analytics with Storm and Cassandra, we will cover the following topics: What's possible with data analysis? Real-time analytics—why is it becoming the need of the hour Why storm—the power of high speed distributed computations We will get you to think about some interesting problems along the lines of Air Traffic Controller (ATC), credit card fraud detection, and so on. First and foremost, you will understand what is big data. Well, big data is the buzzword of the software industry but it's much more than the buzz in reality, it's really a huge amount of data. (For more resources related to this topic, see here.) What is big data? Big data is equal to volume, veracity, variety, and velocity. The descriptions of these are as follows: Volume: Enterprises are awash with ever growing data of all types, easily amassing terabytes even petabytes of information (for example, convert 12 terabytes of tweets created each day into an improved product sentiment analysis or convert 350 billion annual meter readings to better predict power consumption). Velocity: Sometimes, 2 minutes is too late. For time-sensitive processes, such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value (for example, scrutinize 5 million trade events created each day to identify potential fraud or analyze 500 million call detail records daily in real time to predict the customer churn faster). Variety: Big data is any type of data, structured and unstructured data, such as text, sensor data, audio, video, click streams, log files, and many more. New insights are found when analyzing these data types together (for example, monitor hundreds of live video feeds from surveillance cameras to target points of interest or exploit the 80 percent data growth in images, videos, and documents to improve customer satisfaction). Well now that I have described big data, let's have a quick look at where is this data generated and how does it come into existence. The following figure demonstrates a quick snapshot of what all can happen in one second in the world of the internet and social media. Now, we need the power to process all this data at the same rate at which it is generated to gain some meaningful insight out of it, as shown: The power of computation comes with the Storm and Cassandra combination. This technological combo let's us cater to the following use cases: Credit card fraud detection Security breaches Bandwidth allocation Machine failures Supply chain Personalized content Recommendations Get acquainted to few problems that require distributed computing solution Let's do a deep dive and identify some of the problems which require distributed solutions. Real-time business solution for credit or debit card fraud detection Let's get acquainted to the problem depicted in the following figure; when we make any transaction using plastic money and swipe our debit or credit card for payment, the duration within which the bank has to validate or reject the transaction is less than 5 seconds. During this less than 5 seconds, data or transaction details have to be encrypted, travel over secure network from servicing back bank to issuing back bank, then at the issuing back bank the entire fuzzy logic for acceptance or decline of the transaction has to computed, and the result has to travel back over the secure network: The challenges such as network latency and delay can be optimized to some extent, but to achieve the preceding featuring transaction in less than 5 seconds, one has to design an application that is able to churn a considerable amount of data and generate results in 1 to 2 seconds. Aircraft Communications Addressing and Reporting system It is another typical use case that cannot be implemented without having a reliable real-time processing system in place. These systems use Satellite communication (SATCOM), and as per the following figure, they gather voice and packet data from all phases of flight in real-time and are able to generate analytics and alerts on the same data in real-time. Let's take the example from the figure in the preceding case. A flight encounters some real hazardous weather, say, electric Storms on a route, then that information is sent through satellite links and voice or data gateways to the air controller, which in real-time detects and raises the alerts to deviate routes for all other flights passing through that area. Healthcare This is another very important domain where real-time analytics over high volume and velocity data has equipped the healthcare professionals with accurate and exact information in real-time to take informed life-saving actions. The preceding figure depicts the use case where the doctors can take informed action to handle the medical situation of the patients. Data is collated from historic patient database, drug database, and patient records. Once data is collected it is processed, and live statistics and key parameters of the patient are plotted against the same collated data. This data can be used to further generate reports and alerts to aid the health care professionals in real-time. Other applications There are varieties of other applications where power of real-time computing can either optimize or help people take informed decisions. It has become a great utility and aid in following industries: Manufacturing Application performance monitoring Customer relationship management Transportation industry Network optimization Complexity of existing solutions Now that we understand the power that real-time solutions can get into various industry verticals, let's explore and find out what options do we have to process vast amount of data being generated at a very fast pace. The Hadoop Solution The Hadoop solution is a tried, tested, and proven solution in industry which we use the MapReduce jobs in clustered setup to execute jobs and generate results. MapReduce is a programming paradigm where we process large data sets by using a mapper function that processes a key and value pair and thus generate intermediate output again in form of key-value pair. Then a reduce function operates on the mapper output and merges the values associated with same intermediate key and generates result. In the preceding figure, we demonstrate the simple word count MapReduce job where: There is a huge big data store which can go up to zettabytes and petabytes Blocks of the input data are split and replicated onto each of the nodes in Hadoop cluster Each mapper job counts the number of words on the data blocks allocated to it Once the mapper is done, the words (which are actually the keys) and the counts are sent to reducers Reducers combine the mapper output and the results are generated Big data, as we know, did provide a solution to processing and generating results out of humongous volume of data, but that's predominantly a batch processing system and has almost no utility on real-time use case. A custom solution Here we talk about a solution of the kinds twitter used before the advent of Storm. The simplistic version of the problem could be that you need a real-time count of the tweets by each user; Twitter solved the problem by following mechanism shown in the following figure: Here is the detailed information of how the preceding mechanism works: They created a fire hose or queue onto which all the tweets are pushed. A set of workers' nodes read from the queue and decipher the tweet Json and maintain the count of tweets by each user by different workers. At first set of workers the data or the number of tweets are equally distributed amongst the workers, so they are shared randomly. These workers assimilate these first level count into next set of queues. From these queues (the ones mentioned at level 1) second level of workers pick from these queues. Here the sharding is not random an algorithm is in place which ensures that tweet count of one user always goes to same worker. Then the counts are dumped into data store. The queue-worker solution is described in the following: Very complex and specific to the use case Redeployment and reconfiguration is a huge task Scaling is very tedious System is not fault tolerant Paid solution Well this is always an option, lot of big companies have invested in products which let us do this kind of computing but that comes at a heavy license cost. Few solutions to name are from companies such as: IBM Oracle Vertica Gigaspace Open real-time processing tools There are few other technologies which have some similar traits and features such as Apache Storm and S4 from Yahoo, but it lacks guaranteed processing. Spark is one is essentially a batch processing system with some features on micro-batching, which could be utilized as real-time. So finally after evaluation of all these problems, we still find Storm as the best open-source candidate to handle these use cases. Storm persistence Storm processes the streaming data at very high velocity. Cassandra complements the Storms ability to process by providing support to write and read to NoSQL at a very high rate. There are variety of API's available for connecting with Cassandra. In general the API's we are talking are wrappers written over core thrift API, which offer various crud operations over Cassandra cluster using programmer friendly packages. Thrift protocol: The most basic and core of all APIs for access to Cassandra it is the RPC protocol, which provides a language neutral interface and thus exposes flexibility to communicate using Python, Java and so on. Please note almost all other API's we'd discuss are using Thrift under the hood. It is simple to use and provides basic functionality out of the box such as ring discovery, and native access. Complex features such as retry, connection pooling, and so on are not supported out of the box. We have variety of libraries which have extended Thrift and added these much required features, we'd like to touch upon a few widely used ones in this article. Hector: This is has the privilege of being one of the most stable and extensively used API for java based client applications to access the Cassandra. As said earlier it uses Thrift underneath, so it can't essentially offer any feature or functionality not supported by Thrift protocol. The reasons for its wide spread use are number of essential features ready to use and available out of the box. It has implementation for connection pooling It has ring discovery feature with an add on of automatic failover support It has a retry for downed hosts in Cassandra ring Datastax Java Driver: This one is again a recent addition to the stack of client access options to Cassandra and hence gels well with newer version of Cassandra. Here are the salient features: Connection pooling Reconnection policies Load balancing Cursor support Astyanax: It is a very recent addition to bouquet of Cassandra client API's and has been developed by Netflix, which definitely makes it more fabled than others. Let's have a look at its credentials to see where does it qualifies: It supports all Hector functions and is much more easier to use Promises better connection pooling than hector Has a better failover handling than Hector It gives me some out of the box database like features (now that's a big news) At API level it provides me functionality called Recipes in its terms which provides:Parallel all row query executionMessaging queue functionalityObject storePagination It has numerous frequently required utilities such as following: JSON Writer CVS importer Summary In this article, we reviewed the what is big data, how it is analysed, the applications in which it it used, the complexity of the solutions and the monitoring tools of Storm. Resources for Article: Further resources on this subject: Deploying Storm on Hadoop for Advertising Analysis [article] An overview of architecture and modeling in Cassandra [article] Getting Up and Running with Cassandra [article]
Read more
  • 0
  • 0
  • 1959

article-image-overview-horizon-view-architecture-and-its-components
Packt
27 Mar 2015
31 min read
Save for later

An Overview of Horizon View Architecture and its Components

Packt
27 Mar 2015
31 min read
In this article by Peter von Oven and Barry Coombs, authors of the book Mastering VMware Horizon 6, we will introduce you to the architecture and architectural components that make up the core VMware Horizon solution, concentrating on the virtual desktop elements of Horizon with Horizon View Standard. This article will cover the core Horizon View functionality of brokering virtual desktop machines that are hosted on the VMware vSphere platform. In this article, we will discuss the role of each of the Horizon View components and explain how they fit into the overall infrastructure and the benefits they bring, followed by a deep-dive into how Horizon View works. (For more resources related to this topic, see here.) Introducing the key Horizon components To start with, we are going to introduce, at a high level, the main components that make up the Horizon View product. All of the VMware Horizon components described are included as part of the licensed product, and the features that are available to you depend on whether you have the View Standard Edition, the Advanced Edition, or the Enterprise Edition. Horizon licensing also includes ESXi and vCenter licensing to support the ability to deploy the core hosting infrastructure. You can deploy as many ESXi hosts and vCenter Servers as you require to host the desktop infrastructure. The key elements of Horizon View are outlined in the following diagram: In the next section, we are going to start drilling down deeper into the architecture of how these high-level components fit together and how they work. A high-level architectural overview In this article, we will cover the core Horizon View functionality of brokering virtual desktop machines that are hosted on the VMware vSphere platform. The Horizon View architecture is pretty straightforward to understand, as its foundations lie in the standard VMware vSphere products (ESXi and vCenter). So, if you have the necessary skills and experience of working with this platform, then you are already halfway there. Horizon View builds on the vSphere infrastructure, taking advantage of some of the features of the ESX hypervisor and vCenter Server. Horizon View requires adding a number of virtual machines to perform the various View roles and functions. An overview of the View architecture is shown in the following diagram: View components run as applications that are installed on the Microsoft Windows Server operating system, so they could actually run on physical hardware as well. However, there are a great number of benefits available when you run them as virtual machines, such as delivering HA and DR, as well as the typical cost savings that can be achieved through virtualization. The following sections will cover each of these roles/components of the View architecture in greater detail. The Horizon View Connection Server The Horizon View Connection Server, sometimes referred to as Connection Broker or View Manager, is the central component of the View infrastructure. Its primary role is to connect a user to their virtual desktop by means of performing user authentication and then delivering the appropriate desktop resources based on the user's profile and user entitlement. When logging on to your virtual desktop, it is the Connection Server that you are communicating with. How does the Connection Server work? A user typically connects to their virtual desktop from their device by launching the View Client. Once the View Client has launched, the user enters the address details of the View Connection Server, which in turn responds by asking them to provide their network login details (their Active Directory (AD) domain username and password). It's worth noting that Horizon View now supports different AD function levels. These are detailed in the following screenshot: Based on their entitlements, these credentials are authenticated with AD and, if successful, the user is able to continue the logon process. Depending on what they are entitled to, the user could see a launch screen that displays a number of different desktop shortcuts available for login. These desktops represent the desktop pools that the user has been entitled to use. A pool is basically a collection of virtual desktops; for example, it could be a pool for the marketing department where the desktops contain specific applications/software for that department. Once authenticated, the View Manager makes a call to the vCenter Server to create a virtual desktop machine and then vCenter makes a call to View Composer (if you are using linked clones) to start the build process of the virtual desktop if there is not one already available. Once built, the virtual desktop is displayed/delivered within the View Client window, using the chosen display protocol (PCoIP or RDP). The process is described in detail in the following diagram: There are other ways to deploy VDI solutions that do not require a connection broker, and allow a user to connect directly to a virtual desktop; fact, there might be a specific use case for doing this such as having a large number of branches, where having local infrastructure allows trading to continue in the event of a WAN outage or poor network communication with the branch. VMware has a solution for what's referred to as a "Brokerless View": the VMware Horizon View Agent Direct-Connection Plugin. However, don't forget that, in a Horizon View environment, the View Connection Server provides greater functionality and does much more than just connecting users to desktops. The Horizon View Connection Server runs as an application on a Windows Server that, which in turn, could either be a physical or a virtual machine. Running as a virtual machine has many advantages; for example, it means that you can easily add high-availability features, which are key if you think about them, as you could potentially have hundreds of virtual user desktops running on a single-host server. Along with managing the connections for the users, the Connection Server also works with vCenter Server to manage the virtual desktop machines. For example, when using linked clones and powering on virtual desktops, these tasks might be initiated by the Connection Server, but they are executed at the vCenter Server level. Minimum requirements for the Connection Server To install the View Connection Server, you need to meet the following minimum requirements to run on physical or virtual machines: Hardware requirements: The following screenshot shows the hardware required:< Supported operating systems: The View Connection Server must be installed on one of the following operating systems: The Horizon View Security Server Horizon View Security Server is another instance and another version of the View Connection Server but, this time, it sits within your DMZ so that you can allow end users to securely connect to their virtual desktop machine from an external network or the Internet. You cannot install the View Security Server on the same machine that is already running as a View Connection Server or any of the other Horizon View components. How does the Security Server work? The user login process at the start is the same as when using a View Connection Server for internal access but, now we have added an extra security layer with the Security Server. The idea is that users can access their desktop externally without unnecessarily needing a VPN on the network first. The process is described in detail in the following diagram: The Security Server is paired with a View Connection Server that is configured by the use of a one-time password during installation. It's a bit like pairing your phone's Bluetooth with the hands-free kit in your car. When the user logs in from the View Client, they access the View Connection Server, which in turn authenticates the user against AD. If the View Connection Server is configured as a PCoIP gateway, then it will pass the connection and addressing information to the View Client. This connection information will allow the View Client to connect to the View Security Server using PCoIP. This is shown in the diagram by the green arrow (1). The View Security Server will then forward the PCoIP connection to the virtual desktop machine, (2) creating the connection for the user. The virtual desktop machine is displayed/delivered within the View Client window (3) using the chosen display protocol (PCoIP or RDP). The Horizon View Replica Server The Horizon View Replica Server, as the name suggests, is a replica or copy of a View Connection Server that is used to enable high availability to your Horizon View environment. Having a replica of your View Connection Server means that, if the Connection Server fails, users are still able to connect to their virtual desktop machines. You will need to change the IP address or update the DNS record to match this server if you are not using a load balancer. How does the Replica Server work? So, the first question is, what actually gets replicated? The View Connection Broker stores all its information relating to the end users, desktop pools, virtual desktop machines, and other View-related objects, in an Active Directory Application Mode (ADAM) database. Then, using the Lightweight Directory Access Protocol (LDAP) (it uses a method similar to what AD uses for replication), this View information gets copied from the original View Connection Server to the Replica Server. As both, the Connection Server and the Replica Server are now identical to each other, if your Connection Server fails, then you essentially have a backup that steps in and takes over so that end users can still continue to connect to their virtual desktop machines. Just like with the other components, you cannot install the Replica Server role on the same machine that is running as a View Connection Server or any of the other Horizon View components. Persistent or nonpersistent desktops In this section, we are going to talk about the different types of desktop assignments that can be deployed with Horizon View; these could also potentially have an impact on storage requirements, and also the way in which desktops are provisioned to the end users. One of the questions that always get asked is about having a dedicated (persistent) or a floating desktop assignment (nonpersistent). Desktops can either be individual virtual machines, which are dedicated to a user on a 1:1 basis (as we have in a physical desktop deployment, where each user effectively has their own desktop), or a user has a new, vanilla desktop that gets provisioned, personalized, and then assigned at the time of login and can be chosen at random from a pool of available desktops. This is the model that is used to build the user's desktop. The two options are described in more detail as follows: Persistent desktop: Users are allocated a desktop that retains all of their documents, applications, and settings between sessions. The desktop is statically assigned the first time that the user connects and is then used for all subsequent sessions. No other user is permitted access to the desktop. Nonpersistent desktop: Users might be connected to different desktops from the pool, each time that they connect. Environmental or user data does not persist between sessions and is delivered as the user logs on to their desktop. The desktop is refreshed or reset when the user logs off. In most use cases, a nonpersistent configuration is the best option, the key reason is that, in this model, you don't need to build all the desktops upfront for each user. You only need to power on a virtual desktop as and when it's required. All users start with the same basic desktop, which then gets personalized before delivery. This helps with concurrency rates. For example, you might have 5,000 people in your organization, but only 2,000 ever login at the same time; therefore, you only need to have 2,000 virtual desktops available. Otherwise, you would have to build a desktop for each one of the 5,000 users that might ever log in, resulting in more server infrastructure and certainly a lot more storage capacity. We will talk about storage in the next section. The other thing that we often see some confusion over is the difference between dedicated and floating desktops, and how linked clones fit in. Just to make it clear, linked clones and full clones are not what we are talking about when we refer to dedicated and floating desktops. Cloning operations refer to how a desktop is built, whereas the terms persistent and nonpersistent refer to how a desktop is assigned to a user. Dedicated and floating desktops are purely about user assignment and whether they have a dedicated desktop or one allocated from a pool on-demand. Linked clones and full clones are features of Horizon View, which uses View Composer to create a desktop image for each user from a master or parent image. This means, regardless of having a floating or dedicated desktop assignment, the virtual desktop machine could still be a linked or full clone. So, here's a summary of the benefits: It is operationally efficient. All users start from a single or smaller number of desktop images. Organizations reduce the amount of image and patch management. It is efficient storage-wise. The amount of storage required to host the nonpersistent desktop images will be smaller than keeping separate instances of unique user desktops. In the next section, we are going to cover an in-depth overview of Horizon View Composer and linked clones, and the advantages the technology delivers. Horizon View Composer and linked clones One of the main reasons a virtual desktop project fails to deliver, or doesn't even get out of the starting blocks, is heavy infrastructure down to storage requirements. The storage requirements are often seen as a huge cost burden, which can be attributed to the fact that people are approaching this in the same way they would approach a physical desktop environment's requirements. This would mean that each user gets their own dedicated virtual desktop and the hard disk space that comes with it, albeit a virtual disk; this then gets scaled out for the entire user population, so each user is allocated a virtual desktop with some storage. Let's take an example. If you had 1,000 users and allocated 250 GB per user's desktop, you would need 1,000 * 250 GB = 2.5 TB for the virtual desktop environment. That's a lot of storage just for desktops and could result in significant infrastructure costs that could possibly mean that the cost to deploy this amount of storage in the data center would render the project cost-in effective, compared to physical desktop deployments. A new approach to deploying storage for a virtual desktop environment is needed and this is where linked clone technology comes into play. In a nutshell, linked clones are designed to reduce the amount of disk space required, and to simplify the deployment and management of images to multiple virtual desktop machines—a centralized and much easier process. Linked clone technology Starting at a high level, a clone is a copy of an existing or parent virtual machine. This parent virtual machine (VM) is typically your gold build from which you want to create new virtual desktop machines. When a clone is created, it becomes a separate, new virtual desktop machine with its own unique identity. This process is not unique to Horizon View; it's actually a function of vSphere and vCenter, and in the case of Horizon View, we add in another component, View Composer, to manage the desktop images. There are two types of clones that we can deploy, a full clone or a linked clone. We will explain the difference in the next sections. Full clones As the name implies, a full clone disk is an exact, full-sized copy of the parent machine. Once the clone has been created, the virtual desktop machine is unique, with its own identity, and has no links back to the parent virtual machine from which it was cloned. It can operate as a fully independent virtual desktop in its own right and is not reliant on its parent virtual machine. However, as it is a full-sized copy, be aware that it will take up the same amount of storage as its parent virtual machine, which leads back to our discussion earlier in this article about storage capacity requirements. Using a full clone will require larger amounts of storage capacity and will possibly lead to higher infrastructure costs. Before you completely dismiss the idea of using full clone virtual desktop machines, there are some use cases that rely on this model. For example, if you use VMware Mirage to deliver a base layer or application layer, it only works today with full clones, dedicated Horizon View virtual desktop machines. If you have software developers, then they probably need to install specialist tools and a trust code onto a desktop, and therefore, need to "own" their desktop. Or perhaps, the applications that you run in your environment need a dedicated desktop due to the way the applications are licensed. Linked clones Having now discussed full clones, we are going to talk about deploying virtual desktop machines with linked clones. In a linked clone deployment, a delta disk is created and then used by the virtual desktop machine to store the data differences between its own operating system and the operating system of its parent virtual desktop machine. Unlike the full clone method, the linked clone is not a full copy of the virtual disk. The term linked clone refers to the fact that the linked clone will always look to its parent in order to operate, as it continues to read from the replica disk. Basically, the replica is a copy of a snapshot of the parent virtual desktop machine. The linked clone itself could potentially grow to the same size as the replica disk if you allow it to. However, you can set limits on how big it can grow, and should it start to get too big, then you can refresh the virtual desktops that are linked to it. This essentially starts the cloning process again from the initial snapshot. Immediately after a linked clone virtual desktop is deployed, the difference between the parent virtual machine and the newly created virtual desktop machine is extremely small and therefore reduces the storage capacity requirements compared to that of a full clone. This is how linked clones are more space-efficient than their full clone brothers. The underlying technology behind linked clones is more like a snapshot than a clone, but with one key difference: View Composer. With View Composer, you can have more than one active snapshot linked to the parent virtual machine disk. This allows you to create multiple virtual desktop images from just one parent. Best practice would be to deploy an environment with linked clones so as to reduce the storage requirements. However, as we previously mentioned, there are some use cases where you will need to use full clones. One thing to be aware of, which still relates to the storage, is that, rather than capacity, we are now talking about performance. All linked clone virtual desktops are going to be reading from one replica and therefore, will drive a high number of Input /Output Operations Per Second (IOPS) on the storage where the replica lives. Depending on your desktop pool design, you are fairly likely to have more than one replica, as you would typically have more than one data store. This in turn depends on the number of users who will drive the design of the solution. In Horizon View, you are able to choose the location where the replica lives. One of the recommendations is that the replica should sit on fast storage such as a local SSD. Alternative solutions would be to deploy some form of storage acceleration technology to drive the IOPS. Horizon View also has its own integrated solution called View Storage Accelerator (VSA) or Content Based Read Cache (CBRC). This feature allows you to allocate up to 2 GB of memory from the underlying ESXi host server that can be used as a cache for the most commonly read blocks. As we are talking about booting up desktop operating systems, the same blocks are required; as these can be retrieved from memory, the process is accelerated. Another solution is View Composer Array Integration (VCAI), which allows the process of building linked clones to be offloaded to the storage array and its native snapshot mechanism rather than taking CPU cycles from the host server. There are also a number of other third-party solutions that resolve the storage performance bottleneck, such as Atlantis Computing and their ILIO product, Nutanix, Nimble, and Tintri to name a few others. In the next section, we will take a deeper look at how linked clones work. How do linked clones work? The first step is to create your master virtual desktop machine image, which should contain not only the operating system, core applications, and settings, but also the Horizon View Agent components. This virtual desktop machine will become your parent VM or your gold image. This image can now be used as a template to create any new subsequent virtual desktop machines. The gold image or parent image cannot be a VM template. An overview of the linked clone process is shown in the following diagram:   Once you have created the parent virtual desktop or gold image (1), you then take a snapshot (2). When you create your desktop pool, this snapshot is selected and will become the replica (3) and will be set to be read-only. Each virtual desktop is linked back to this replica; hence the term linked clone. When you start creating your virtual desktops, you create linked clones that are unique copies for each user. Try not to create too many snapshots for your parent VM. I would recommend having just a handful, otherwise this could impact the performance of your desktops and make it a little harder to know which snapshot is which. What does View Composer build? During the image building process, and once the replica disk has been created, View Composer creates a number of other virtual disks, including the linked clone (operating system disk) itself. These are described in the following sections. Linked clone disk Not wanting to state the obvious, the main disk that gets created is the linked clone disk itself. This linked clone disk is basically an empty virtual disk container that is attached to the virtual desktop machine as the user logs in and the desktop starts up. This disk will start off small in size, but will grow over time, depending on the block changes that are requested from the replica disk by the virtual desktop machine's operating system. These block changes are stored in the linked clone disk, and this disk is sometimes referred to as the delta disk, or differential disk, due to the fact that it stores all the delta changes that the desktop operating system requests from the parent VM. As mentioned before, the linked clone disk can grow to the maximum size, equal to the parent VM but, following best practice, you would never let this happen. Typically, you can expect the linked clone disk to only increase to a few hundred MBs. We will cover this in the Linked clone processes section later. The replica disk is set as read-only and is used as the primary disk. Any writes and/or block changes that are requested by the virtual desktop are written/read directly from the linked clone disk. It is a recommended best practice to allocate tier-1 storage, such as local SSD drives, to host the replica, as all virtual desktops in the cluster will be referencing this single read-only VMDK file as their base image. Keeping it high in the stack improves performance, by reducing the overall storage IOPS required in a VDI workload. As we mentioned at the start of this section, storage costs are seen as being expensive for VDI. Linked clones reduce the burden of storage capacity but they do drive the requirement to derive a huge amount of IOPS from a single LUN. Persistent disk or user data disk The persistent disk feature of View Composer allows you to configure a separate disk that contains just the user data and user settings, and not the operating system. This allows any user data to be preserved when you update or make changes to the operating system disk, such as a recompose action. It's worth noting that the persistent disk is referenced by the VM name and not username, so bear this in mind if you want to attach the disk to another VM. This disk is also used to store the user's profile. With this in mind, you need to size it accordingly, ensuring that it is large enough to store any user profile type data such as Virtual Desktop Assessments. Disposable disk With the disposable disk option, Horizon View creates what is effectively a temporary disk that gets deleted every time the user powers off their virtual desktop machine. If you think about how the Windows desktop operating system operates and the files it creates, there are several files that are used on a temporary basis. Files such as Temporary Internet files or the Windows pagefile are two such examples. As these are only temporary files, why would you want to keep them? With Horizon View, these type of files are redirected to the disposable disk and then deleted when the VM is powered off. Horizon View provides the option to have a disposable disk for each virtual desktop. This disposable disk is used to contain temporary files that will get deleted when the virtual desktop is powered off. These are files that you don't want to store on the main operating system disk as they would consume unnecessary disk space. For example, files on the disposable disk are things such as the pagefile, Windows system temporary files, and VMware log files. Note that here we are talking about temporary system files and not user files. A user's temporary files are still stored on the user data disk so that they can be preserved. Many applications use the Windows temp folder to store installation CAB files, which can be referenced post-installation. Having said that, you might want to delete the temporary user data to reduce the desktop image size, in which case you could ensure that the user's temporary files are directed to the disposable disk. Internal disk Finally, we have the internal disk. The internal disk is used to store important configuration information, such as the computer account password, that would be needed to join the virtual desktop machine back to the domain if you refreshed the linked clones. It is also used to store Sysprep and Quickprep configurations details. In terms of disk space, the internal disk is relatively small, averaging around 20 MB. By default, the user will not see this disk from their Windows Explorer, as it contains important configuration information that you wouldn't want them to delete. Understanding the linked clone process There are several complex steps performed by View Composer and View Manager and that occur when a user launches a virtual desktop session. So, what's the process to build a linked clone desktop, and what goes on behind the scenes? When a user logs into Horizon View and requests a desktop, View Manager, using vCenter and View Composer, will create a virtual desktop machine. This process is described in the following sections. Creating and provisioning a new desktop An entry for the virtual desktop machine is created in the Active Directory Application Mode (ADAM) database before it is put into provisioning mode: The linked clone virtual desktop machine is created by View Composer. A machine account created in AD with a randomly generated password. View Composer checks for a replica disk and creates one if one does not already exist. A linked clone is created by the vCenter Server API call from View Composer. An internal disk is created to store the configuration information and machine account password. Customizing the desktop Now that you have a newly created, linked clone virtual desktop machine, the next phase is to customize it. The customization steps are as follows: The virtual desktop machine is switched to customization mode. The virtual desktop machine is customized by vCenter Server using the customizeVM_Task command and is joined to the domain with the information you entered in the View Manager console. The linked clone virtual desktop is powered on. The View Composer Agent on the linked clone virtual desktop machine starts up for the first time and joins the machine to the domain, using the NetJoinDomain command and the machine account password that was created on the internal disk. The linked clone virtual desktop machine is now Sysprep'd. Once complete, View Composer tells View Agent that customization has finished, and View Agent tells View Manager that the customization process has finished. The linked clone virtual desktop machine is powered off and a snapshot is taken. The linked clone virtual desktop machine is marked as provisioned and is now available for use. When a linked clone virtual desktop machine is powered on with the View Composer Agent running, the agent tracks any changes that are made to the machine account password. Any changes will be updated and stored on the internal disk. In many AD environments, the machine account password is changed periodically. If the View Composer Agent detects a password change, it updates the machine account password on the internal disk that was created with the linked clone. This is important, as a the linked clone virtual desktop machine is reverted to the snapshot taken after the customization during a refresh operation. For example, the agent will be able to reset the machine account password to the latest one. The linked clone process is depicted in the following diagram:   Additional features and functions of linked clones There are a number of other management functions that you can perform on a linked clone disk from View Composer; these are outlined in this section and are needed in order to deliver the ongoing management of the virtual desktop machines. Recomposing a linked clone Recomposing a linked clone virtual desktop machine or desktop pool allows you to perform updates to the operating system disk, such as updating the image with the latest patches, or software updates. You can only perform updates on the same version of an operating system, so you cannot use the recompose feature to migrate from one operating system to another, such as going from Windows XP to Windows 7. As we covered in the What does View Composer Build? section, we have separate disks for items such as user's data. These disks are not affected during a recompose operation, so all user-specific data on them is preserved. When you initiate the recompose operation, View Composer essentially starts the linked clone building process over again; thus, a new operating system disk is created, which then gets customized and a snapshot, such as the ones shown in the preceding sections, is taken. During the recompose operation, the MAC addresses of the network interface and the Windows SID are not preserved. There are some management tools and security-type solutions that might not work due to this change. However, the UUID will remain the same. The recompose process is described in the following steps: View Manager puts the linked clone into maintenance mode. View Manager calls the View Composer resync API for the linked clones being recomposed, directing View Composer to use the new base image and the snapshot. If there isn't a replica for the base image and snapshot yet, in the target datastore for the linked clone, View Composer creates the replica in the target datastore (unless a separate datastore is being used for replicas, in which case a replica is created in the replica datastore). View Composer destroys the current OS disk for the linked clone and creates a new OS disk linked to the new replica. The rest of the recompose cycle is identical to the customization phase of the provisioning and customization cycles. The following diagram shows a graphical representation of the recompose process. Before the process begins, the first thing you need to do is update your Gold Image (1) with the patch updates or new applications you want to deploy as the virtual desktops. As described in the preceding steps, the snapshot is then taken (2) to create the new replica, Replica V2 (3). The existing OS disk is destroyed, but the User Data disk (4) is maintained during the recompose process:   Refreshing a linked clone By carrying out a refresh of the linked clone virtual desktop, you are effectively reverting it to its initial state, when its original snapshot was taken after it had completed the customization phase. This process only applies to the operating system disk and no other disks are affected. An example use case for refresh operations would be recomposing a nonpersistent desktop two hours after logoff, to return it to its original state and make it available for the next user. The refresh process performs the following tasks: The linked clone virtual desktop is switched into maintenance mode. View Manager reverts the linked clone virtual desktop to the snapshot taken after customization was completed: - vdm-initial-checkpoint. The linked clone virtual desktop starts up, and View Composer Agent detects if the machine account password needs to be updated. If not, and the password on the internal disk is newer than the one in the registry, the agent will update the machine account password using the one on the internal disk. One of the reasons why you would perform a refresh operation is if the linked clone OS disk starts to become bloated. As we previously discussed, the OS-linked clone disk could grow to the full size of its parent image. This means it would be taking up more disk space than is really necessary, which kind of defeats the objective of linked clones. The refresh operation effectively resets the linked clone to a small delta between it and its parent image. The following diagram shows a representation of the refresh operation:   The linked clone on the left-hand side of the diagram (1) has started to grow in size. Refreshing reverts it back to the snapshot as if it was a new virtual desktop, as shown on the right-hand side of the diagram (2). Rebalancing operations with View Composer The rebalance operation in View Composer is used to evenly distribute the linked clone virtual desktop machines across multiple datastores in your environment. You would perform this task in the event that one of your datastores was becoming full while others have ample free space. It might also help with the performance of that particular datastore. For example, if you had 10 virtual desktop machines in one datastore and only two in another, then running a rebalance operation would potentially even this out and leave you with six virtual desktop machines per datastore. You must use the View Administrator console to initiate the rebalance operation in View Composer. If you simply try to vMotion any of your virtual desktop machines, then View Composer will not be able to keep track of them. On the other hand, if you have six virtual desktop machines on one datastore and seven on another, then it is highly likely that initiating a rebalance operation will have no effect, and no virtual desktop machines will be moved, as doing so has no benefit. A virtual desktop machine will only be moved to another datastore if the target datastore has significantly more spare capacity than the source. The rebalance process is described in the following steps: The linked clone is switched to maintenance mode. Virtual machines to be moved are identified based on the free space in the available datastores. The operating system disk and persistent disk are disconnected from the virtual desktop machine. The detached operating system disk and persistent disk are moved to the target datastore. The virtual desktop machine is moved to the target datastore. The operating system disk and persistent disk are reconnected to the linked clone virtual desktop machine. View Composer resynchronizes the linked clone virtual desktop machines. View Composer checks for the replica disk in the datastore and creates one if one does not already exist as per the provisioning steps covered earlier in this article. As per the recompose operation, the operating system disk for the linked clone gets deleted and a new one is created and then customized. The following diagram shows the rebalance operation:   Summary In this article, we discussed the Horizon View architecture and the different components that make up the complete solution. We covered the key technologies, such as how linked clones work to optimize storage. Resources for Article: Further resources on this subject: Importance of Windows RDS in Horizon View [article] Backups in the VMware View Infrastructure [article] Design, Install, and Configure [article]
Read more
  • 0
  • 0
  • 16331

article-image-system-center-reporting
Packt
27 Mar 2015
21 min read
Save for later

System Center Reporting

Packt
27 Mar 2015
21 min read
This article by the lead author Samuel Erskine, along with the co-authors Dieter Gasser, Kurt Van Hoecke, and Nasira Ismail, of the book Microsoft System Center Reporting Cookbook, discusses the drivers of organizational reporting and the general requirements on how to plan for business valued reports, steps for planning for the inputs your report data sources depends on, how you plan to view a report, the components of the System Center product, and preparing your environment for self-service Business Intelligence (BI). A report is only as good as the accuracy of its data source. A data source is populated and updated by an input channel. In this article, we will cover the following recipes: Understanding the goals of reporting Planning and optimizing dependent data inputs Planning report outputs Understanding the reporting schemas of System Center components Configuring Microsoft Excel for System Center data analysis (For more resources related to this topic, see here.) Understanding the goals of reporting This recipe discusses the drivers of organizational reporting and the general requirements on how to plan for business valued reports. Getting ready To prepare for this recipe you need to be ready to make a difference with all the rich data available to you in the area of reporting. This may require a mindset change; be prepared. How to do it... The key to successfully identifying what needs to be reported is a clear understanding of what you or the report requestor is trying to measure and why. Reporting is driven by a number of organizational needs, which may fall into one or more of these sample categories: Information to support a business case Audit and compliance driven request Budget planning and forecasting Current operational service level These categories are examples of the business needs which you must understand. Understanding the business needs of the report increases the value of the report. For example, let us expand on and map the preceding business scenarios to the System Center Product using the following table: Business/organizational objective Objective details System Center Product Information to support a business case Provide a count of computers out of warranty to justify the request to buy additional computers. System Center Configuration Manager Audit and compliance driven request Provide the security compliance state of all windows servers. Provide a list of attempted security breaches by month. System Center Configuration Manager System Center Operations Manager   Budget planning and forecasting How much storage should we plan to invest in next year's budget based on the last 3 years' usage data? System Center Operations Manager Operational Service Level How many incidents were resolved without second tier escalation? System Center Service Manager In a majority of cases for System Center administrators, the requestor does not provide the business objective. Use the preceding table as an example to guide your understanding of a report request. How it works... Reporting is a continual life cycle that begins with a request for information and should ultimately satisfy a real need. The typical life cycle of a request is illustrated in the following figure: The life cycle stages are: Report conception Report request Report creation Report enhancement/retirement The recipe focuses on the report conception stage. This stage is the most important stage of the life cycle. This is due to the fact that a report with a clear business objective will deliver the following: Focused activities: A report that does have a clear objective will reduce the risk of wasted effort usually associated with unclear requirements. Direct or indirect business benefit: The reports you create, for example using System Center data, ultimately should benefit the business. An additional benefit to this stage of report planning is knowing when a report is no longer required. This would reduce the need to manage and support a report that has no value or use. Planning and optimizing dependent data inputs A report is only as good as the accuracy of its data source. A data source is populated and updated by an input channel. This recipe discusses and provides steps for planning for the inputs your report data source(s) depends on. Getting ready Review the Understanding the goals of reporting recipe as a primer for this recipe. How to do it... The inputs of reports depend on the type of output you intend to produce and the definition of the accepted fields in the data source. An example is a report that would provide a total count of computers in a System Center Configuration Manager environment. This report will require an input field which stores a numeric value for computers in the database. Here are the recommended steps you must take to prepare and optimize the data inputs for a report: Identify the data source or sources. Document the source data type properties. Document the process used to populate the data sources (manual or automated process). Agree the authoritative source if there is more than one source for the same data. Identify and document relationship between sources. Document steps 1 to 5. The following table provides a practical example of the steps for a report on the total count of computers by the Windows operating system. Workgroup computers and computers not in the Active Directory domain are out of scope of this report request. Report input type Details Notes Data source Asset Database Populated manually by the purchase order team Data source Active Directory Automatically populated. Orchestrator runbook performs a scheduled clean-up of disabled objects Data source System Center Configuration Manager Requires an agent and currently not used to manage servers Authoritative source Active Directory Based on the report scope Data source relationship Microsoft System Center Configuration Manager is configured to discover all systems in the Active directory domain Alternative source for the report using the All systems collection Plan to document the specific fields you need from the authoritative data source. For example, use a table similar to the following. Required data Description Computer name The Fully Qualified domain name of the computer Operating system Friendly operating system name Operating system environment Server or workstation Date created in data source Date the computer joined the domain Last logon date Date the computer last updated the attributes in Active Directory The steps provided discusses an example of identifying input sources and the fields you plan to use in a requested report. Optimizing Report Inputs Once the required data for your reports have been identified and documented, you must test for validity and consistency. Data sources which are populated by automated processes tend to be less prone to consistency errors. Conversely data sources based on manual entry are prone to errors (for example, correct spelling when typing text into forms used to populate the data source). Here are typical recommended practices for improving consistency in manual and automated system populated data sources: Automated (for example, agent based):     Implement agent health check and remediation.     Include last agent update information in reports. Manual entry:     Avoid free text fields, except description or notes.     Use a list picker.     Implement mandatory constraints on required fields (for example, a request for e-mail address should only accept the right format for e-mail addresses. How it works... The reports you create and manage are only as accurate as the original data source. There may be one or more sources available for a report. The process discussed in this recipe provides steps on how to narrow down the list of requirements. The list must include the data source and the specific data fields which contain the data for the proposed report(s). These input fields are populated by manual, automated processes or a combination of both. The final part of the recipe discussed an example of how to optimize the inputs you select. These steps will assist in answering one of the typical questions often raised about reports: "Can we trust this information?" The answer, if you have performed these steps will be "Yes, and this is why and how." Planning report outputs The preceding recipe, Planning and optimizing dependent inputs, discussed what you need for a report. This recipe builds on the preceding recipes with a focus on how you plan to view a report (output). Getting ready Plan to review the Understanding the goals of reporting and Planning and optimizing dependent inputs recipes. How to do it... The type of report output depends on the input you query from the target data source(s). Typically, the output type is defined by the requestor of the report and may be in one or more of these formats: List of items (tables) Charts (2D, 3D, and formats supported by the reporting program) Geographic representation Dials and gauges A combination of all the listed formats Here is an example of the steps you must perform to plan and agree the reporting output (s): Request the target format from the initiator of the report. Check the data source supports the requested output. Create a sample dataset from the source. Create a sample output in the requestor's format(s). Agree a final format or combination of formats with the requestor. The steps to plan the output of reports are illustrated in the following figure: These are the basic minimal steps you must perform to plan for outputs. How it works... The steps in this recipe are focused on scoping the output of the report. The scope provides you with the following: Ensuring the output is defined before working on a large set of data Validating that the data source can support the requested output Avoids scope creep. The output is agreed and signed off The objective is to ensure that the request can be satisfied based on what is available and not what is desired. The process also provides an additional benefit of identifying any gaps in data before embarking on the actual report creation. There's more... When planning report outputs, you may not always have access to the actual source data. The recommend practice is not to work directly with the original source even if this is possible to avoid negatively impacting the source during the planning stage. In either case, there are other options available to you. One of these options is using a spreadsheet program such as Microsoft Excel. Mock up using Excel An approach to testing and validating report outputs is the use of Microsoft Excel. You can create a representation of the input source data including the data type (numbers, text, and formula). The data can either be a sample you create yourself or an extract from the original source of the data. The added benefit is that the spreadsheet can serve as a part of the portfolio of documentation for the report. Understanding the reporting schemas of System Center components The reporting schema of the System Center product is specific to each component. The components of the System Center product are listed in the following table: System Center component Description Configuration Manager This is configuration life cycle management. It is primarily targeted at client management; however, this is not a technical limitation, and can be used and is also used to manage servers. This component provides configuration management capabilities, which include but are not limited to deploying operating systems, performing hardware and software inventory, and performing application life cycle management. Data Protection Manager This component delivers the capabilities to provide continual protection (backup and recovery) services for servers and clients. Orchestrator This is the automation component of the product. It is a platform to connect the different vendor products in a heterogeneous environment in order to provide task automation and business-process automation. Operations Manager This component provides data center and client monitoring. Monitoring and remediation is performed at the component and deep application levels. Service Manager This provides IT service management capabilities. The capabilities are aligned with the Information Technology Infrastructure Library (ITIL) and the Microsoft Operations Framework (MOF). Virtual Machine Manager This is the component to manage virtualization. The capabilities span the management of private, public, and hybrid clouds. This recipe discusses the reporting capabilities of each of these components. Getting ready You must have a fully deployed configuration of one or more of the System Center product components. Your deployment must include the reporting option provided for the specific component. How to do it... The reporting capability for all the System Center components is rooted in their use of Microsoft SQL databases. The reporting databases for each of the components is listed in the following table: System Center component Default installation reporting database Additional information Configuration Manager CM_<Site Code> There is one database for each Configuration Manager site. Data Protection Manager DPMDB_<DPM Server Name> This is the default database for the DPM server. Additional information is written to the Operations Manager database if this optional integration is configured. Orchestrator Orchestrator This is the default name when you install Orchestrator. Operations Manager OperationsManagerDW You must install the reporting components to create and populate this database. Service Manager DWDataMart This is the default reporting database. You have the option to configure two additional databases known as OMDataMart and CMDataMart. Additionally, SQL Analysis Services creates a database called DWASDataBase that uses DWDataMart as a source. Virtual Machine Manager VirtualManagerDB This is the default database for the VMM server. Additional information is written to the Operations Manager database if this optional integration is configured. Use the steps in the following sections to view the schema of the reporting database of each of the System Center components. Configuration Manager Use the following steps: Identify the database server and instance of the Configuration Manager site. Use Microsoft SQL Server Management Studio (MSSMS) to connect to the database server. You must connect with a user account with the appropriate permission to view the Configuration Manager database. Navigate to Databases | CM_<site code> | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Configuration Manager component. Note that not all the views are listed in the screenshot. Data Protection Manager Use the following steps: Identify the database server and SQL instance of the Data Protection Manager environment. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Configuration Manager database. Navigate to Databases | DPMDB_<Server Name> | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Data Protection Manager component. Note that not all the views are shown in the screenshot. Orchestrator Use the following steps: Identify the database server and instance of the Orchestrator instance server. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Orchestrator database. Navigate to Databases | Orchestrator | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Orchestrator component. Operations Manager Use the following steps: Identify the database server and instance of the Operations Manager management group. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Operations Manager data warehouse reporting database. Navigate to Databases | OperationsManagerDW | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Operations Manager component. Note that not all the views are listed in the screenshot. Service Manager Use the following steps: Identify the database server and instance of the Service Manager data warehouse management group. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Service Manager data warehouse database. Navigate to Databases | DWDataMart | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Configuration Manager component. Note that not all the views are listed in the screenshot. Virtual Machine Manager Perform the following steps: Identify the database server and instance of the Virtual Machine Manager server. Use MSSMS to connect to the database server. You must connect with a user account with the appropriate permission to view the Virtual Machine Manager database. Go to Databases | VirtualManagerDB | Views, as shown in the following screenshot: The views listed form the reporting schema for the System Center Configuration Manager component. Note that not all the views are listed in the screenshot. How it works... The procedure provided is a simplified approach to gain a baseline of what may seem rather complicated if you are new to or have limited experiences with SQL databases. The view for each respective component is a consistent representation of the data that you can retrieve by writing reports. Each view is created from one or more tables in the database. The recommended practice is to target report construction at the views, as Microsoft ensures that these views remain consistent even when the underlying tables change. An example of how to understand the schema is as follows. Imagine the task of preparing a meal for dinner. The meal will require ingredients and a process to prepare it. Then, you will need to present the output on a plate. The following table provides a comparison of this scenario to the respective schema: Attributes of the meal Attributes of the schema Raw ingredients Database tables Packed single or combined ingredients available from a supermarket shelf SQL Server views that retrieve data from one or a combination of tables Preparing the meal Writing SQL queries using the views; use one or a combination (join) views Presenting the meal on a plate The report(s) in various formats In addition to using MSSMS, as described earlier, Microsoft supplies schema information for the components in the online documentation. This information is specific for each product and varies in the depth of the content. The See also section of this recipe provides useful links to the available information published for the schemas. There's more... It is important to understand the schema for the System Center components, but equally important are the processes that populate the databases. The data population process differs by component, but the results are the same (data is automatically inserted into the respective reporting database). The schema is a map to find the data, but the information available is provided by the agents and processes that transfer the information into the databases. Components with multiple databases System Center Service Manager and Operations Manager have a similar architecture. The data is initially written to the operational database and then transferred to the data warehouse. The operational database information is typically what is available to view in the console. The operational information is, however, not the best candidate for reporting, as this is constantly changing. Additionally, performing queries against the operational database can result in performance issues. You may view the schema of these databases using a process similar to the one described earlier, but this is not recommended for reporting purposes. See also The official online documentation for the schema is updated when Microsoft makes changes to the product, and it should be a point for reference at http://technet.microsoft.com/en-US/systemcenter/bb980621. Configuring Microsoft Excel for System Center data analysis This recipe is focused on preparing your environment for self-service Business Intelligence (BI). Getting ready Self-service BI in Microsoft Excel is made available by enabling or installing add-ins. You must download the add-ins from their respective official sites: Power Query: Download Microsoft Power Query for Excel from http://www.microsoft.com/en-gb/download/details.aspx?id=39379. PowerPivot: PowerPivot is available in the Office Professional Plus and Office 365 Professional Plus editions, and in the standalone edition of Excel 2013. Power View: Power View is also available in the Office Professional Plus and Office 365 Professional Plus editions, and in the standalone edition of Excel 2013. Power Maps: At the time of writing this article, this add-in can be downloaded from the Microsoft website. Power Map Preview for Excel 2013 can be downloaded from http://www.microsoft.com/en-us/download/details.aspx?id=38395. How to do it... The tasks discussed in this recipe are as follows: Installing the Power Query add-in Installing the Power Maps add-in Enabling PowerPivot and Power View in Microsoft Excel Installing the Power Query add-in The Power Query add-in must be installed using an MSI installer package that is available at Microsoft Download Center. The installer deploys the bits and enables the add-in in your Excel installation. The functionality in this add-in is regularly improved by Microsoft. Search for Microsoft Power Query for Excel in Download Center for the latest version. The add-in can be downloaded for 32-bit and 64-bit Microsoft Excel versions. Follow these steps to install the Power Query add-in: Review the system requirements on the download page and update your system if required. Note that when you initiate the setup, you may be prompted to install additional components if you do not have all the requirements installed. Right-click on the MSI installer and click on Install. Click on Next on the Welcome page. Accept the License Agreement and click on Next. Accept the default or click on Change to select the destination installation folder. Click on Next. On the Ready to Install Microsoft Power Query for Excel page, click on Install. The installation progress is displayed. Click on Finish on the Installation Completed page. The Power Query tab is available on the Excel ribbon after this installation. Installing the Power Map add-in The Power Map add-in must be installed using an executable (.exe) installer package that is available at Microsoft Download Center. The functionality in this add-in also is regularly improved by Microsoft. Search for Microsoft Power Map for Excel in the Download Center for the latest version. Follow these steps to install the Power Map add-in: Review the system requirements on the download page and update your system if required. Double-click on the EXE installer (Microsoft Power Map Preview for Excel) and click on Yes if you get the User Access Control dialog prompt. When prompted to install Visual C++ 2013 Runtime Libraries (x86), click on Close under Install. Check to agree to the terms and click on Install. Click on Next on the Welcome page. Click on the I Agree radio button on the License Agreement page, and then click on Next. Accept the default folder or click on Browse to select a different destination installation folder. Make your selection on who the installation should be made available to: Everyone or Just me. Click on Next. Click on Next. On the Confirm Installation page, click on Next. The installation progress is displayed. Click on Close on the Installation Completed page. The Power Map task will be made available in the Insert tab on the Excel ribbon after this installation. Enabling PowerPivot and Power View in Microsoft Excel Perform the following steps in Microsoft Excel to enable PowerPivot and Power View: In the File menu, select Options. In the Add-Ins tab, select COM Add-Ins from the Manage: dropdown at the bottom and click on the Go... button, as shown in this screenshot: Select the Power add-ins from the list of Add-Ins available, as shown in the following screenshot: Click on OK to complete the procedure of enabling add-ins in Microsoft Excel. After you've enabled the required add-ins, the different types of add-in tasks and tabs should be available on the Excel ribbon, as shown in this screenshot: This procedure can be used to enable or disable all the available Excel add-ins. You are now ready to explore System Center data, create queries, and perform analysis on the data. How it works... The add-ins for Microsoft Excel provide additional functionality to gather and analyze System Center data. Wizards can be added, interfaces can be made available to combine different sources, and a common language, Data Analysis SyntaX (DAX), can be made available for calculations and performing different forms of visualizations. The steps discussed in this recipe are required for the use of the Power BI features and functionality using Microsoft Excel. You followed the steps to install Power Query and Power Map, and you enabled PowerPivot and Power Views. These add-ins provide the foundation for self-service Business Intelligence using Microsoft Excel. See also Different types (enhanced) of functionality and integrations are available for you when you use Microsoft SQL Server or SharePoint, which are not discussed in this article. Refer to http://office.microsoft.com for additional information on them. Summary In this article, we covered the goals of reporting and how to plan and optimize dependent data inputs. We also discussed planning of report outputs, the reporting schemas of System Center components, and configuring Microsoft Excel for System Center data analysis. Resources for Article: Further resources on this subject: Adding and Importing Configuration items in System Center 2012 Service Manager [article] Mobility [article] Upgrading from Previous Versions [article]
Read more
  • 0
  • 0
  • 1421

article-image-puppet-and-os-security-tools
Packt
27 Mar 2015
17 min read
Save for later

Puppet and OS Security Tools

Packt
27 Mar 2015
17 min read
In this article by Jason Slagle, author of the book Learning Puppet Security, covers using Puppet to manage SELinux and auditd. We learned a lot so far about using Puppet to secure your systems as, well as how to use it to make groups of systems more secure. However, in all of that, we've not yet covered some of the basic OS-level functions that are available to secure a system. In this article, we'll review several of those functions. (For more resources related to this topic, see here.) SELinux is a powerful tool in the security arsenal. Most administrators experience with it, is along the lines of "how can I turn that off ?" This is born out of frustration with the poor documentation about the tool, as well as the tedious nature of the configuration. While Puppet cannot help you with the documentation (which is getting better all the time), it can help you with some of the other challenges that SELinux can bring. That is, ensuring that the proper contexts and policies are in place on the systems being managed. In this article, we'll cover the following topics related to OS-level security tools: A brief introduction to SELinux and auditd The built-in Puppet support for SELinux Community modules for SELinux Community modules for auditd At the end of this article, you should have enough skills so that you no longer need to disable SELinux. However, if you still need to do so, it is certainly possible to do via the modules presented here. Introducing SELinux and auditd During the course of this article, we'll explore the SELinux framework for Linux and see how to automate it using Puppet. As part of the process, we'll also review auditd, the logging and auditing framework for Linux. Using Puppet, we can automate the configuration of these often-neglected security tools, and even move the configuration of these tools for various services to the modules that configure those services. The SELinux framework SELinux is a security system for Linux originally developed by the United States National Security Agency (NSA). It is an in-kernel protection mechanism designed to provide Mandatory Access Controls (MACs) to the Linux kernel. SELinux isn't the only MAC framework for Linux. AppArmor is an alternative MAC framework included in the Linux kernel since Version 2.6.30. We choose to implement SELinux; since it is the default framework used under Red Hat Linux, which we're using for our examples. More information on AppArmor can be found at http://wiki.apparmor.net/index.php/Main_Page. These access controls work by confining processes to the minimal amount of files and network access that the processes require to run. By doing this, the controls limit the amount of collateral damage that can be done by a process, which becomes compromised. SELinux was first merged to the Linux mainline kernel for the 2.6.0 release. It was introduced into Red Hat Enterprise Linux with Version 4, and into Ubuntu in Version 8.04. With each successive release of the operating systems, support for SELinux grows, and it becomes easier to use. SELinux has a couple of core concepts that we need to understand to properly configure it. The first are the concepts of types and contexts. A type in SELinux is a grouping of similar things. Files used by Apache may be httpd_sys_content_t, for instance, which is a type that all content served by HTTP would have. The httpd process itself is of type httpd_t. These types are applied to objects, which represent discrete things, such as files and ports, and become part of the context of that object. The context of an object represents the object's user, role, type, and optionally data on multilevel security. For this discussion, the type is the most important component of the context. Using a policy, we grant access from the subject, which represents a running process, to various objects that represent files, network ports, memory, and so on. We do that by creating a policy that allows a subject to have access to the types it requires to function. SELinux has three modes that it can operate in. The first of these modes is disabled. As the name implies, the disabled mode runs without any SELinux enforcement. The second mode is called permissive. In permissive mode, SELinux will log any access violations, but will not act on them. This is a good way to get an idea of where you need to modify your policy, or tune Booleans to get proper system operations. The final mode, enforcing, will deny actions that do not have a policy in place. Under Red Hat Linux variants, this is the default SELinux mode. By default, Red Hat 6 runs SELinux with a targeted policy in enforcing mode. This means, that for the targeted daemons, SELinux will enforce its policy by default. An example is in order here, to explain this well. So far, we've been operating with SELinux disabled on our hosts. The first step in experimenting with SELinux is to turn it on. We'll set it to permissive mode at first, while we gather some information. To do this, after starting our master VM, we'll need to modify the SELinux configuration and reboot. While it's possible to change from enforcing mode to either permissive or disabled mode without a reboot, going back requires us to reboot. Let's edit the /etc/sysconfig/selinux file and set the SELINUX variable to permissive on our puppetmaster. Remember to start the vagrant machine and SSH in as it is necessary. Once this is done, the file should look as follows: Once this is complete, we need to reboot. To do so, run the following command: sudo shutdown -r now Wait for the system to come back online. Once the machine is back up and you SSH back into it, run the getenforce command. It should return permissive, which means SELinux is running, but not enforced. Now, we can make sure our master is running and take a look at its context. If it's not running, you can start the service with the sudo service puppetmaster start command. Now, we'll use the -Z flag on the ps command to examine the SELinux flag. Many commands, such as ps and ls use the -Z flag to view the SELinux data. We'll go ahead and run the following command to view the SELinux data for the running puppetmaster: ps -efZ|grep puppet When you do this, you'll see a Linux output, such as follows: unconfined_u:system_r:initrc_t:s0 puppet 1463     1 1 11:41 ? 00:00:29 /usr/bin/ruby /usr/bin/puppet master If you take a look at the first part of the output line, you'll see that Puppet is running in the unconfined_u:system_r:initrc_t context. This is actually somewhat of a bug and a result of the Puppet policy on CentOS 6 being out of date. We should actually be running under the system_u:system_r:puppetmaster_t:s0 context, but the policy is for a much older version of Puppet, so it runs unconfined. Let's take a look at the sshd process to see what it looks like also. To do so, we'll just grep for sshd instead: ps -efZ|grep sshd The output is as follows: system_u:system_r:sshd_t:s0-s0:c0.c1023 root 1206 1 0 11:40 ? 00:00:00 /usr/sbin/sshd This is a more traditional output one would expect. The sshd process is running under the system_u:system_r:sshd_t context. This actually corresponds to the system user, the system role, and the sshd type. The user and role are SELinux constructs that help you allow role-based access controls. The users do not map to system users, but allow us to set a policy based on the SELinux user object. This allows role-based access control, based on the SELinux user. Previously the unconfined user was a user that will not be enforced. Now, we can take a look at some objects. Doing a ls -lZ /etc/ssh command results in the following: As you can see, each of the files belongs to a context that includes the system user, as well as the object role. They are split among the etc type for configuration files and the sshd_key type for keys. The SSH policy allows the sshd process to read both of these file types. Other policies, say, for NTP, would potentially allow the ntpd process to read the etc types, but it would not be able to read the sshd_key files. This very fine-grained control is the power of SELinux. However, with great power comes very complex configuration. Configuration can be confusing to set up, if it doesn't happen correctly. For instance, with Puppet, the wrong type can potentially impact the system if not dealt with. Fortunately, in permissive mode, we will log data that we can use to assist us with this. This leads us into the second half of the system that we wish to discuss, which is auditd. In the meantime, there is a bunch of information on SELinux available on its website at http://selinuxproject.org/page/Main_Page. There's also a very funny, but informative, resource available describing SELinux at https://people.redhat.com/duffy/selinux/selinux-coloring-book_A4-Stapled.pdf. The auditd framework for audit logging SELinux does a great job at limiting access to system components; however, reporting what enforcement took place was not one of its objectives. Enter the auditd. The auditd is an auditing framework developed by Red Hat. It is a complete auditing system using rules to indicate what to audit. This can be used to log SELinux events, as well as much more. Under the hood, auditd has hooks into the kernel to watch system calls and other processes. Using the rules, you can configure logging for any of these events. For instance, you can create a rule that monitors writes to the /etc/passwd file. This would allow you to see if any users were added to the system. We can also add monitoring of files, such as lastlog and wtmp to monitor the login activity. We'll explore this example later when we configure auditd. To quickly see how a rule works, we'll manually configure a quick rule that will log the time when the wtmp file was edited. This will add some system logging around users logging in. To do this, let's edit the /etc/audit/audit.rules file to add a rule to monitor this. Edit the file and add the following lines: -w /var/log/wtmp -p wa -k logins-w /etc/passwd –p wa –k password We'll take a look at what the preceding lines do. These lines both start with the –w clauses. These indicate the files that we are monitoring. Second, we have the –p clauses. This lets you set what file operations we monitor. In this case, it is write and append operations. Finally, with the the –k entries, we're setting a keyword that is logged and can be filtered on. This should go at the end of the file. Once it's done, reload auditd with the following command: sudo service auditd restart Once this is complete, go ahead and log another ssh session in. Once you can simply log, back out. Once this is done, take a look at the /var/log/audit/audit.log file. You should see the content like the following: type=SYSCALL msg=audit(1416795396.816:482): arch=c000003e syscall=2 success=yes exit=8 a0=7fa983c446aa a1=1 a2=2 a3=7fff3f7a6590 items=1 ppid=1206 pid=2202 auid=500 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=51 comm="sshd" exe="/usr/sbin/sshd" subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 key="logins"type=SYSCALL msg=audit(1416795420.057:485): arch=c000003e syscall=2 success=yes exit=7 a0=7fa983c446aa a1=1 a2=2 a3=8 items=1 ppid=1206 pid=2202 auid=500 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=51 comm="sshd" exe="/usr/sbin/sshd" subj=system_u:system_r:sshd_t:s0-s0:c0.c1023 key="logins" There are tons of fields in this output, including the SELinux context, the userID, and so on. Of interest is the auid, which is the audit user ID. On commands run via the sudo command, this will still contain the user ID of the user who called sudo. This is a great way to log commands performed via sudo. Auditd also logs SELinux failures. They get logged under the type AVC. These access vector cache logs will be placed in the auditd log file when a SELinux violation occurs. Much like SELinux, auditd is somewhat complicated. The intricacies of it are beyond the scope of this book. You can get more information at http://people.redhat.com/sgrubb/audit/. SELinux and Puppet Puppet has direct support for several features of SELinux. There are two native Puppet types for SELinux: selboolean and selmodule. These types support setting SELinux Booleans and installing SELinux policy modules. SELinux Booleans are variables that impact on how SELinux behaves. They are set to allow various functions to be permitted. For instance, you set a SELinux Boolean to true to allow the httpd process to access network ports. SELinux modules are groupings of policies. They allow policies to be loaded in a more granular way. The Puppet selmodule type allows Puppet to load these modules. The selboolean type The targeted SELinux policy that most distributions use is based on the SELinux reference policy. One of the features of this policy is the use of Boolean variables that control actions of the policy. There are over 200 of these Booleans on a Red Hat 6-based machine. We can investigate them by installing the policycoreutils-python package on the operating system. You can do this by executing the following command: sudo yum install policycoreutils-python Once installed, we can run the semanage boolean -l command to get a list of the Boolean values, along with their descriptions. The output of this will look as follows: As you can see, there exists a very large number of settings that can be reconfigured, simply by setting the appropriate Boolean value. The selboolean Puppet type supports managing these Boolean values. The provider is fairly simple, accepting the following values: Parameter Description name This contains the name of the Boolean to be set. It defaults to the title. persistent This checks whether to write the value to disk for the next boot. provider This is the provider for the type. Usually, the default getsetsebool value is accepted. value This contains the value of the Boolean, true or false. Usage of this type is rather simple. We'll show an example that will set the puppetmaster_use_db parameter to true value. If we are using the SELinux Puppet policy, this would allow the master to talk to a database. For our use, it's a simple unused variable that we can use for demonstration purposes. As a reminder, the SElinux policy for Puppet on CentOS 6 is outdated, so setting the Boolean does not impact the version of Puppet we're running. It does, however, serve to show how a Boolean is set. To do this, we'll create a sample role and profile for our puppetmaster. This is something that would likely exist in a production environment to manage the configuration of the master. In this example, we'll simply build a small profile and role for the master. Let's start with the profile. Copy over the profiles module we've slowly been building up, and let's add a puppetmaster.pp profile. To do so, edit the profiles/manifests/puppetmaster.pp file and make it look as follows: class profiles::puppetmaster {selboolean { 'puppetmaster_use_db':   value     => on,   persistent => true,}} Then, we'll move on to the role. Copy the roles, and edit the roles/manifests/puppetmaster.pp file there and make it look as follows: class roles::puppetmaster {include profiles::puppetmaster} Once this is done, we can apply it to our host. Edit the /etc/puppet/manifests/site.pp file. We'll apply the puppetmaster role to the puppetmaster machine, as follows: node 'puppet.book.local' {include roles::puppetmaster} Now, we'll run Puppet and get the output as follows: As you can see, it set the value to on when run. Using this method, we can set any of the SELinux Boolean values we need for our system to operate properly. More information on SELinux Booleans with information on how to obtain a list of them can be found at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security-Enhanced_Linux/sect-Security-Enhanced_Linux-Working_with_SELinux-Booleans.html. The selmodule type The other native type inside Puppet is a type to manage the SELinux modules. Modules are compiled collections of the SELinux policy. They're loaded into the kernel using the selmodule command. This Puppet type provides support for this mechanism. The available parameters are as follows: Parameter Description name This contains the name of the module— it defaults to the title ensure This is the desired state—present or absent provider This specifies the provider for the type—it should be selmodule selmoduledir This is the directory that contains the module to be installed selmodulepath This provides the complete path to the module to be installed if not present in selmoduledir syncversion This checks whether to resync the module if a new version is found, such as ensure => latest  Using the module, we can take our compiled module and serve it onto the system with Puppet. We can then use the module to ensure that it gets installed on the system. This lets us centrally manage the module with Puppet. We'll see an example where this module compiles a policy and then installs it, so we won't show a specific example here. Instead, we'll move on to talk about the last SELinux-related component in Puppet. File parameters for SELinux The final internal support for SELinux types comes in the form of the file type. The file type parameters are as follows: Parameter Description selinux_ignore_defaults By default, Puppet will use the matchpathcon function to set the context of a file. This overrides that behavior if set to true value. Selrange This sets the SELinux range component. We've not really covered this. It's not used in most mainstream distributions at the time this book was written. Selrole This sets the SELinux role on the file. seltype This sets the SELinux type on the file. seluser This sets the SELinux role on the file. Usually, if you place files in the correct location (the expected location for a service) on the filesystem, Puppet will get the SELinux properties correct via its use of the matchpathcon function. This function (which also has a matching utility) applies a default context based on the policy settings. Setting the context manually is used in cases where you're storing data outside the normal location. For instance, you might be storing web data under the /opt file. The preceding types and providers provide the basics that allow you to manage SELinux on a system. We'll now take a look at a couple of community modules that build on these types and create a more in-depth solution. Summary This article looked at what SELinux and auditd were, and gave a brief example of how they can be used. We looked at what they can do, and how they can be used to secure your systems. After this, we looked at the specific support for SELinux in Puppet. We looked at the two built-in types to support it, as well as the parameters on the file type. Then, we took a look at one of the several community modules for managing SELinux. Using this module, we can store the policies as text instead of compiled blobs. Resources for Article: Further resources on this subject: The anatomy of a report processor [Article] Module, Facts, Types and Reporting tools in Puppet [Article] Designing Puppet Architectures [Article]
Read more
  • 0
  • 0
  • 13336

article-image-cassandra-architecture
Packt
26 Mar 2015
35 min read
Save for later

Cassandra Architecture

Packt
26 Mar 2015
35 min read
In this article by Nishant Neeraj, the author of the book Mastering Apache Cassandra - Second Edition, aims to set you into a perspective where you will be able to see the evolution of the NoSQL paradigm. It will start with a discussion of common problems that an average developer faces when the application starts to scale up and software components cannot keep up with it. Then, we'll see what can be assumed as a thumb rule in the NoSQL world: the CAP theorem that says to choose any two out of consistency, availability, and partition-tolerance. As we discuss this further, we will realize how much more important it is to serve the customers (availability), than to be correct (consistency) all the time. However, we cannot afford to be wrong (inconsistent) for a long time. The customers wouldn't like to see that the items are in stock, but that the checkout is failing. Cassandra comes into picture with its tunable consistency. (For more resources related to this topic, see here.) Problems in the RDBMS world RDBMS is a great approach. It keeps data consistent, it's good for OLTP (http://en.wikipedia.org/wiki/Online_transaction_processing), it provides access to good grammar, and manipulates data supported by all the popular programming languages. It has been tremendously successful in the last 40 years (the relational data model was proposed in its first incarnation by Codd, E.F. (1970) in his research paper A Relational Model of Data for Large Shared Data Banks). However, in early 2000s, big companies such as Google and Amazon, which have a gigantic load on their databases to serve, started to feel bottlenecked with RDBMS, even with helper services such as Memcache on top of them. As a response to this, Google came up with BigTable (http://research.google.com/archive/bigtable.html), and Amazon with Dynamo (http://www.cs.ucsb.edu/~agrawal/fall2009/dynamo.pdf). If you have ever used RDBMS for a complicated web application, you must have faced problems such as slow queries due to complex joins, expensive vertical scaling, and problems in horizontal scaling. Due to these problems, indexing takes a long time. At some point, you may have chosen to replicate the database, but there was still some locking, and this hurts the availability of the system. This means that under a heavy load, locking will cause the user's experience to deteriorate. Although replication gives some relief, a busy slave may not catch up with the master (or there may be a connectivity glitch between the master and the slave). Consistency of such systems cannot be guaranteed. Consistency, the property of a database to remain in a consistent state before and after a transaction, is one of the promises made by relational databases. It seems that one may need to make compromises on consistency in a relational database for the sake of scalability. With the growth of the application, the demand to scale the backend becomes more pressing, and the developer teams may decide to add a caching layer (such as Memcached) at the top of the database. This will alleviate some load off the database, but now the developers will need to maintain the object states in two places: the database, and the caching layer. Although some Object Relational Mappers (ORMs) provide a built-in caching mechanism, they have their own issues, such as larger memory requirement, and often mapping code pollutes application code. In order to achieve more from RDBMS, we will need to start to denormalize the database to avoid joins, and keep the aggregates in the columns to avoid statistical queries. Sharding or horizontal scaling is another way to distribute the load. Sharding in itself is a good idea, but it adds too much manual work, plus the knowledge of sharding creeps into the application code. Sharded databases make the operational tasks (backup, schema alteration, and adding index) difficult. To find out more about the hardships of sharding, visit http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/. There are ways to loosen up consistency by providing various isolation levels, but concurrency is just one part of the problem. Maintaining relational integrity, difficulties in managing data that cannot be accommodated on one machine, and difficult recovery, were all making the traditional database systems hard to be accepted in the rapidly growing big data world. Companies needed a tool that could support hundreds of terabytes of data on the ever-failing commodity hardware reliably. This led to the advent of modern databases like Cassandra, Redis, MongoDB, Riak, HBase, and many more. These modern databases promised to support very large datasets that were hard to maintain in SQL databases, with relaxed constrains on consistency and relation integrity. Enter NoSQL NoSQL is a blanket term for the databases that solve the scalability issues which are common among relational databases. This term, in its modern meaning, was first coined by Eric Evans. It should not be confused with the database named NoSQL (http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page). NoSQL solutions provide scalability and high availability, but may not guarantee ACID: atomicity, consistency, isolation, and durability in transactions. Many of the NoSQL solutions, including Cassandra, sit on the other extreme of ACID, named BASE, which stands for basically available, soft-state, eventual consistency. Wondering about where the name, NoSQL, came from? Read Eric Evans' blog at http://blog.sym-link.com/2009/10/30/nosql_whats_in_a_name.html. The CAP theorem In 2000, Eric Brewer (http://en.wikipedia.org/wiki/Eric_Brewer_%28scientist%29), in his keynote speech at the ACM Symposium, said, "A distributed system requiring always-on, highly-available operations cannot guarantee the illusion of coherent, consistent single-system operation in the presence of network partitions, which cut communication between active servers". This was his conjecture based on his experience with distributed systems. This conjecture was later formally proved by Nancy Lynch and Seth Gilbert in 2002 (Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services, published in ACMSIGACT News, Volume 33 Issue 2 (2002), page 51 to 59 available at http://webpages.cs.luc.edu/~pld/353/gilbert_lynch_brewer_proof.pdf). Let's try to understand this. Let's say we have a distributed system where data is replicated at two distinct locations and two conflicting requests arrive, one at each location, at the time of communication link failure between the two servers. If the system (the cluster) has obligations to be highly available (a mandatory response, even when some components of the system are failing), one of the two responses will be inconsistent with what a system with no replication (no partitioning, single copy) would have returned. To understand it better, let's take an example to learn the terminologies. Let's say you are planning to read George Orwell's book Nineteen Eighty-Four over the Christmas vacation. A day before the holidays start, you logged into your favorite online bookstore to find out there is only one copy left. You add it to your cart, but then you realize that you need to buy something else to be eligible for free shipping. You start to browse the website for any other item that you might buy. To make the situation interesting, let's say there is another customer who is trying to buy Nineteen Eighty-Four at the same time. Consistency A consistent system is defined as one that responds with the same output for the same request at the same time, across all the replicas. Loosely, one can say a consistent system is one where each server returns the right response to each request. In our example, we have only one copy of Nineteen Eighty-Four. So only one of the two customers is going to get the book delivered from this store. In a consistent system, only one can check out the book from the payment page. As soon as one customer makes the payment, the number of Nineteen Eighty-Four books in stock will get decremented by one, and one quantity of Nineteen Eighty-Four will be added to the order of that customer. When the second customer tries to check out, the system says that the book is not available any more. Relational databases are good for this task because they comply with the ACID properties. If both the customers make requests at the same time, one customer will have to wait till the other customer is done with the processing, and the database is made consistent. This may add a few milliseconds of wait to the customer who came later. An eventual consistent database system (where consistency of data across the distributed servers may not be guaranteed immediately) may have shown availability of the book at the time of check out to both the customers. This will lead to a back order, and one of the customers will be paid back. This may or may not be a good policy. A large number of back orders may affect the shop's reputation and there may also be financial repercussions. Availability Availability, in simple terms, is responsiveness; a system that's always available to serve. The funny thing about availability is that sometimes a system becomes unavailable exactly when you need it the most. In our example, one day before Christmas, everyone is buying gifts. Millions of people are searching, adding items to their carts, buying, and applying for discount coupons. If one server goes down due to overload, the rest of the servers will get even more loaded now, because the request from the dead server will be redirected to the rest of the machines, possibly killing the service due to overload. As the dominoes start to fall, eventually the site will go down. The peril does not end here. When the website comes online again, it will face a storm of requests from all the people who are worried that the offer end time is even closer, or those who will act quickly before the site goes down again. Availability is the key component for extremely loaded services. Bad availability leads to bad user experience, dissatisfied customers, and financial losses. Partition-tolerance Network partitioning is defined as the inability to communicate between two or more subsystems in a distributed system. This can be due to someone walking carelessly in a data center and snapping the cable that connects the machine to the cluster, or may be network outage between two data centers, dropped packages, or wrong configuration. Partition-tolerance is a system that can operate during the network partition. In a distributed system, a network partition is a phenomenon where, due to network failure or any other reason, one part of the system cannot communicate with the other part(s) of the system. An example of network partition is a system that has some nodes in a subnet A and some in subnet B, and due to a faulty switch between these two subnets, the machines in subnet A will not be able to send and receive messages from the machines in subnet B. The network will be allowed to lose many messages arbitrarily sent from one node to another. This means that even if the cable between the two nodes is chopped, the system will still respond to the requests. The following figure shows the database classification based on the CAP theorem: An example of a partition-tolerant system is a system with real-time data replication with no centralized master(s). So, for example, in a system where data is replicated across two data centers, the availability will not be affected, even if a data center goes down. The significance of the CAP theorem Once you decide to scale up, the first thing that comes to mind is vertical scaling, which means using beefier servers with a bigger RAM, more powerful processor(s), and bigger disks. For further scaling, you need to go horizontal. This means adding more servers. Once your system becomes distributed, the CAP theorem starts to play, which means, in a distributed system, you can choose only two out of consistency, availability, and partition-tolerance. So, let's see how choosing two out of the three options affects the system behavior as follows: CA system: In this system, you drop partition-tolerance for consistency and availability. This happens when you put everything related to a transaction on one machine or a system that fails like an atomic unit, like a rack. This system will have serious problems in scaling. CP system: The opposite of a CA system is a CP system. In a CP system, availability is sacrificed for consistency and partition-tolerance. What does this mean? If the system is available to serve the requests, data will be consistent. In an event of a node failure, some data will not be available. A sharded database is an example of such a system. AP system: An available and partition-tolerant system is like an always-on system that is at risk of producing conflicting results in an event of network partition. This is good for user experience, your application stays available, and inconsistency in rare events may be alright for some use cases. In our example, it may not be such a bad idea to back order a few unfortunate customers due to inconsistency of the system than having a lot of users return without making any purchases because of the system's poor availability. Eventual consistent (also known as BASE system): The AP system makes more sense when viewed from an uptime perspective—it's simple and provides a good user experience. But, an inconsistent system is not good for anything, certainly not good for business. It may be acceptable that one customer for the book Nineteen Eighty-Four gets a back order. But if it happens more often, the users would be reluctant to use the service. It will be great if the system could fix itself (read: repair) as soon as the first inconsistency is observed; or, maybe there are processes dedicated to fixing the inconsistency of a system when a partition failure is fixed or a dead node comes back to life. Such systems are called eventual consistent systems. The following figure shows the life of an eventual consistent system: Quoting Wikipedia, "[In a distributed system] given a sufficiently long time over which no changes [in system state] are sent, all updates can be expected to propagate eventually through the system and the replicas will be consistent". (The page on eventual consistency is available at http://en.wikipedia.org/wiki/Eventual_consistency.) Eventual consistent systems are also called BASE, a made-up term to represent that these systems are on one end of the spectrum, which has traditional databases with ACID properties on the opposite end. Cassandra is one such system that provides high availability and partition-tolerance at the cost of consistency, which is tunable. The preceding figure shows a partition-tolerant eventual consistent system. Cassandra Cassandra is a distributed, decentralized, fault tolerant, eventually consistent, linearly scalable, and column-oriented data store. This means that Cassandra is made to easily deploy over a cluster of machines located at geographically different places. There is no central master server, so no single point of failure, no bottleneck, data is replicated, and a faulty node can be replaced without any downtime. It's eventually consistent. It is linearly scalable, which means that with more nodes, the requests served per second per node will not go down. Also, the total throughput of the system will increase with each node being added. And finally, it's column oriented, much like a map (or better, a map of sorted maps) or a table with flexible columns where each column is essentially a key-value pair. So, you can add columns as you go, and each row can have a different set of columns (key-value pairs). It does not provide any relational integrity. It is up to the application developer to perform relation management. So, if Cassandra is so good at everything, why doesn't everyone drop whatever database they are using and jump start with Cassandra? This is a natural question. Some applications require strong ACID compliance, such as a booking system. If you are a person who goes by statistics, you'd ask how Cassandra fares with other existing data stores. TilmannRabl and others in their paper, Solving Big Data Challenges for Enterprise Application Performance Management (http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf), said that, "In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput from one to 12 nodes. This comes at the price of a high write and read latency. Cassandra's performance is best for high insertion rates". If you go through the paper, Cassandra wins in almost all the criteria. Equipped with proven concepts of distributed computing, made to reliably serve from commodity servers, and simple and easy maintenance, Cassandra is one of the most scalable, fastest, and very robust NoSQL database. So, the next natural question is, "What makes Cassandra so blazing fast?". Let's dive deeper into the Cassandra architecture. Understanding the architecture of Cassandra Cassandra is a relative latecomer in the distributed data-store war. It takes advantage of two proven and closely similar data-store mechanisms, namely Bigtable: A Distributed Storage System for Structured Data, 2006 (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) and Amazon Dynamo: Amazon's Highly Available Key-value Store, 2007 (http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf). The following diagram displays the read throughputs that show linear scaling of Cassandra: Like BigTable, it has a tabular data presentation. It is not tabular in the strictest sense. It is rather a dictionary-like structure where each entry holds another sorted dictionary/map. This model is more powerful than the usual key-value store and it is named a table, formerly known as a column family. The properties such as eventual consistency and decentralization are taken from Dynamo. Now, assume a column family is a giant spreadsheet, such as MS Excel. But unlike spreadsheets, each row is identified by a row key with a number (token), and unlike spreadsheets, each cell may have its own unique name within the row. In Cassandra, the columns in the rows are sorted by this unique column name. Also, since the number of partitions is allowed to be very large (1.7*1038), it distributes the rows almost uniformly across all the available machines by dividing the rows in equal token groups. Tables or column families are contained within a logical container or name space called keyspace. A keyspace can be assumed to be more or less similar to database in RDBMS. A word on max number of cells, rows, and partitions A cell in a partition can be assumed as a key-value pair. The maximum number of cells per partition is limited by the Java integer's max value, which is about 2 billion. So, one partition can hold a maximum of 2 billion cells. A row, in CQL terms, is a bunch of cells with predefined names. When you define a table with a primary key that has just one column, the primary key also serves as the partition key. But when you define a composite primary key, the first column in the definition of the primary key works as the partition key. So, all the rows (bunch of cells) that belong to one partition key go into one partition. This means that every partition can have a maximum of X rows, where X = (2*10­9/number_of_columns_in_a_row). Essentially, rows * columns cannot exceed 2 billion per partition. Finally, how many partitions can Cassandra hold for each table or column family? As we know, column families are essentially distributed hashmaps. The keys or row keys or partition keys are generated by taking a consistent hash of the string that you pass. So, the number of partitioned keys is bounded by the number of hashes these functions generate. This means that if you are using the default Murmur3 partitioner (range -263 to +263), the maximum number of partitions that you can have is 1.85*1019. If you use the Random partitioner, the number of partitions that you can have is 1.7*1038. Ring representation A Cassandra cluster is called a ring. The terminology is taken from Amazon Dynamo. Cassandra 1.1 and earlier versions used to have a token assigned to each node. Let's call this value the initial token. Each node is responsible for storing all the rows with token values (a token is basically a hash value of a row key) ranging from the previous node's initial token (exclusive) to the node's initial token (inclusive). This way, the first node, the one with the smallest initial token, will have a range from the token value of the last node (the node with the largest initial token) to the first token value. So, if you jump from node to node, you will make a circle, and this is why a Cassandra cluster is called a ring. Let's take an example. Assume that there is a hashing algorithm (partitioner) that generates tokens from 0 to 127 and you have four Cassandra machines to create a cluster. To allocate equal load, we need to assign each of the four nodes to bear an equal number of tokens. So, the first machine will be responsible for tokens one to 32, the second will hold 33 to 64, third 65 to 96, and fourth 97 to 127 and 0. If you mark each node with the maximum token number that it can hold, the cluster looks like a ring, as shown in the following figure: Token ownership and distribution in a balanced Cassandra ring Virtual nodes In Cassandra 1.1 and previous versions, when you create a cluster or add a node, you manually assign its initial token. This is extra work that the database should handle internally. Apart from this, adding and removing nodes requires manual resetting token ownership for some or all nodes. This is called rebalancing. Yet another problem was replacing a node. In the event of replacing a node with a new one, the data (rows that the to-be-replaced node owns) is required to be copied to the new machine from a replica of the old machine. For a large database, this could take a while because we are streaming from one machine. To solve all these problems, Cassandra 1.2 introduced virtual nodes (vnodes). The following figure shows 16 vnodes distributed over four servers: In the preceding figure, each node is responsible for a single continuous range. In the case of a replication factor of 2 or more, the data is also stored on other machines than the one responsible for the range. (Replication factor (RF) represents the number of copies of a table that exist in the system. So, RF=2, means there are two copies of each record for the table.) In this case, one can say one machine, one range. With vnodes, each machine can have multiple smaller ranges and these ranges are automatically assigned by Cassandra. How does this solve those issues? Let's see. If you have a 30 ring cluster and a node with 256 vnodes had to be replaced. If nodes are well-distributed randomly across the cluster, each physical node in remaining 29 nodes will have 8 or 9 vnodes (256/29) that are replicas of vnodes on the dead node. In older versions, with a replication factor of 3, the data had to be streamed from three replicas (10 percent utilization). In the case of vnodes, all the nodes can participate in helping the new node get up. The other benefit of using vnodes is that you can have a heterogeneous ring where some machines are more powerful than others, and change the vnodes ' settings such that the stronger machines will take proportionally more data than others. This was still possible without vnodes but it needed some tricky calculation and rebalancing. So, let's say you have a cluster of machines with similar hardware specifications and you have decided to add a new server that is twice as powerful as any machine in the cluster. Ideally, you would want it to work twice as harder as any of the old machines. With vnodes, you can achieve this by setting twice as many num_tokens as on the old machine in the new machine's cassandra.yaml file. Now, it will be allotted double the load when compared to the old machines. Yet another benefit of vnodes is faster repair. Node repair requires the creation of a Merkle tree for each range of data that a node holds. The data gets compared with the data on the replica nodes, and if needed, data re-sync is done. Creation of a Merkle tree involves iterating through all the data in the range followed by streaming it. For a large range, the creation of a Merkle tree can be very time consuming while the data transfer might be much faster. With vnodes, the ranges are smaller, which means faster data validation (by comparing with other nodes). Since the Merkle tree creation process is broken into many smaller steps (as there are many small nodes that exist in a physical node), the data transmission does not have to wait till the whole big range finishes. Also, the validation uses all other machines instead of just a couple of replica nodes. As of Cassandra 2.0.9, the default setting for vnodes is "on" with default vnodes per machine as 256. If for some reason you do not want to use vnodes and want to disable this feature, comment out the num_tokens variable and uncomment and set the initial_token variable in cassandra.yaml. If you are starting with a new cluster or migrating an old cluster to the latest version of Cassandra, vnodes are highly recommended. The number of vnodes that you specify on a Cassandra node represents the number of vnodes on that machine. So, the total vnodes on a cluster is the sum total of all the vnodes across all the nodes. One can always imagine a Cassandra cluster as a ring of lots of vnodes. How Cassandra works Diving into various components of Cassandra without having any context is a frustrating experience. It does not make sense why you are studying SSTable, MemTable, and log structured merge (LSM) trees without being able to see how they fit into the functionality and performance guarantees that Cassandra gives. So first we will see Cassandra's write and read mechanism. It is possible that some of the terms that we encounter during this discussion may not be immediately understandable. A rough overview of the Cassandra components is as shown in the following figure: Main components of the Cassandra service The main class of Storage Layer is StorageProxy. It handles all the requests. The messaging layer is responsible for inter-node communications, such as gossip. Apart from this, process-level structures keep a rough idea about the actual data containers and where they live. There are four data buckets that you need to know. MemTable is a hash table-like structure that stays in memory. It contains actual cell data. SSTable is the disk version of MemTables. When MemTables are full, they are persisted to hard disk as SSTable. Commit log is an append only log of all the mutations that are sent to the Cassandra cluster. Mutations can be thought of as update commands. So, insert, update, and delete operations are mutations, since they mutate the data. Commit log lives on the disk and helps to replay uncommitted changes. These three are basically core data. Then there are bloom filters and index. The bloom filter is a probabilistic data structure that lives in the memory. They both live in memory and contain information about the location of data in the SSTable. Each SSTable has one bloom filter and one index associated with it. The bloom filter helps Cassandra to quickly detect which SSTable does not have the requested data, while the index helps to find the exact location of the data in the SSTable file. With this primer, we can start looking into how write and read works in Cassandra. We will see more explanation later. Write in action To write, clients need to connect to any of the Cassandra nodes and send a write request. This node is called the coordinator node. When a node in a Cassandra cluster receives a write request, it delegates the write request to a service called StorageProxy. This node may or may not be the right place to write the data. StorageProxy's job is to get the nodes (all the replicas) that are responsible for holding the data that is going to be written. It utilizes a replication strategy to do this. Once the replica nodes are identified, it sends the RowMutation message to them, the node waits for replies from these nodes, but it does not wait for all the replies to come. It only waits for as many responses as are enough to satisfy the client's minimum number of successful writes defined by ConsistencyLevel. ConsistencyLevel is basically a fancy way of saying how reliable a read or write you want to be. Cassandra has tunable consistency, which means you can define how much reliability is wanted. Obviously, everyone wants a hundred percent reliability, but it comes with latency as the cost. For instance, in a thrice-replicated cluster (replication factor = 3), a write time consistency level TWO, means the write will become successful only if it is written to at least two replica nodes. This request will be faster than the one with the consistency level THREE or ALL, but slower than the consistency level ONE or ANY. The following figure is a simplistic representation of the write mechanism. The operations on node N2 at the bottom represent the node-local activities on receipt of the write request: The following steps show everything that can happen during a write mechanism: If the failure detector detects that there aren't enough live nodes to satisfy ConsistencyLevel, the request fails. If the failure detector gives a green signal, but writes time-out after the request is sent due to infrastructure problems or due to extreme load, StorageProxy writes a local hint to replay when the failed nodes come back to life. This is called hinted hand off. One might think that hinted handoff may be responsible for Cassandra's eventual consistency. But it's not entirely true. If the coordinator node gets shut down or dies due to hardware failure and hints on this machine cannot be forwarded, eventual consistency will not occur. The anti-entropy mechanism is responsible for consistency, rather than hinted hand-off. Anti-entropy makes sure that all replicas are in sync. If the replica nodes are distributed across data centers, it will be a bad idea to send individual messages to all the replicas in other data centers. Rather, it sends the message to one replica in each data center with a header, instructing it to forward the request to other replica nodes in that data center. Now the data is received by the node which should actually store that data. The data first gets appended to the commit log, and pushed to a MemTable appropriate column family in the memory. When the MemTable becomes full, it gets flushed to the disk in a sorted structure named SSTable. With lots of flushes, the disk gets plenty of SSTables. To manage SSTables, a compaction process runs. This process merges data from smaller SSTables to one big sorted file. Read in action Similar to a write case, when StorageProxy of the node that a client is connected to gets the request, it gets a list of nodes containing this key based on the replication strategy. The node's StorageProxy then sorts the nodes based on their proximity to itself. The proximity is determined by the snitch function that is set up for this cluster. Basically, the following types of snitches exist: SimpleSnitch: A closer node is the one that comes first when moving clockwise in the ring. (A ring is when all the machines in the cluster are placed in a circular fashion with each machine having a token number. When you walk clockwise, the token value increases. At the end, it snaps back to the first node.) PropertyFileSnitch: This snitch allows you to specify how you want your machines' location to be interpreted by Cassandra. You do this by assigning a data center name and rack name for all the machines in the cluster in the $CASSANDRA_HOME/conf/cassandra-topology.properties file. Each node has a copy of this file and you need to alter this file each time you add or remove a node. This is what the file looks like: # Cassandra Node IP=Data Center:Rack 192.168.1.100=DC1:RAC1 192.168.2.200=DC2:RAC2 10.0.0.10=DC1:RAC1 10.0.0.11=DC1:RAC1 10.0.0.12=DC1:RAC2 10.20.114.10=DC2:RAC1 10.20.114.11=DC2:RAC1 GossipingPropertyFileSnitch: The PropertyFileSnitch is kind of a pain, even when you think about it. Each node has the locations of all nodes manually written and updated every time a new node joins or an old node retires. And then, we need to copy it on all the servers. Wouldn't it be better if we just specify each node's data center and rack on just that one machine, and then have Cassandra somehow collect this information to understand the topology? This is exactly what GossipingPropertyFileSnitch does. Similar to PropertyFileSnitch, you have a file called $CASSANDRA_HOME/conf/cassandra-rackdc.properties, and in this file you specify the data center and the rack name for that machine. The gossip protocol makes sure that this information gets spread to all the nodes in the cluster (and you do not have to edit properties of files on all the nodes when a new node joins or leaves). Here is what a cassandra-rackdc.properties file looks like: # indicate the rack and dc for this node dc=DC13 rack=RAC42 RackInferringSnitch: This snitch infers the location of a node based on its IP address. It uses the third octet to infer rack name, and the second octet to assign data center. If you have four nodes 10.110.6.30, 10.110.6.4, 10.110.7.42, and 10.111.3.1, this snitch will think the first two live on the same rack as they have the same second octet (110) and the same third octet (6), while the third lives in the same data center but on a different rack as it has the same second octet but the third octet differs. Fourth, however, is assumed to live in a separate data center as it has a different second octet than the three. EC2Snitch: This is meant for Cassandra deployments on Amazon EC2 service. EC2 has regions and within regions, there are availability zones. For example, us-east-1e is an availability zone in the us-east region with availability zone named 1e. This snitch infers the region name (us-east, in this case) as the data center and availability zone (1e) as the rack. EC2MultiRegionSnitch: The multi-region snitch is just an extension of EC2Snitch where data centers and racks are inferred the same way. But you need to make sure that broadcast_address is set to the public IP provided by EC2 and seed nodes must be specified using their public IPs so that inter-data center communication can be done. DynamicSnitch: This Snitch determines closeness based on a recent performance delivered by a node. So, a quick responding node is perceived as being closer than a slower one, irrespective of its location closeness, or closeness in the ring. This is done to avoid overloading a slow performing node. DynamicSnitch is used by all the other snitches by default. You can disable it, but it is not advisable. Now, with knowledge about snitches, we know the list of the fastest nodes that have the desired row keys, it's time to pull data from them. The coordinator node (the one that the client is connected to) sends a command to the closest node to perform a read (we'll discuss local reads in a minute) and return the data. Now, based on ConsistencyLevel, other nodes will send a command to perform a read operation and send just the digest of the result. If we have read repairs (discussed later) enabled, the remaining replica nodes will be sent a message to compute the digest of the command response. Let's take an example. Let's say you have five nodes containing a row key K (that is, RF equals five), your read ConsistencyLevel is three; then the closest of the five nodes will be asked for the data and the second and third closest nodes will be asked to return the digest. If there is a difference in the digests, full data is pulled from the conflicting node and the latest of the three will be sent. These replicas will be updated to have the latest data. We still have two nodes left to be queried. If read repairs are not enabled, they will not be touched for this request. Otherwise, these two will be asked to compute the digest. Depending on the read_repair_chance setting, the request to the last two nodes is done in the background, after returning the result. This updates all the nodes with the most recent value, making all replicas consistent. Let's see what goes on within a node. Take a simple case of a read request looking for a single column within a single row. First, the attempt is made to read from MemTable, which is rapid fast, and since there exists only one copy of data, this is the fastest retrieval. If all required data is not found there, Cassandra looks into SSTable. Now, remember from our earlier discussion that we flush MemTables to disk as SSTables and later when the compaction mechanism wakes up, it merges those SSTables. So, our data can be in multiple SSTables. The following figure represents a simplified representation of the read mechanism. The bottom of the figure shows processing on the read node. The numbers in circles show the order of the event. BF stands for bloom filter. Each SSTable is associated with its bloom filter built on the row keys in the SSTable. Bloom filters are kept in the memory and used to detect if an SSTable may contain (false positive) the row data. Now, we have the SSTables that may contain the row key. The SSTables get sorted in reverse chronological order (latest first). Apart from the bloom filter for row keys, there exists one bloom filter for each row in the SSTable. This secondary bloom filter is created to detect whether the requested column names exist in the SSTable. Now, Cassandra will take SSTables one by one from younger to older, and use the index file to locate the offset for each column value for that row key and the bloom filter associated with the row (built on the column name). On the bloom filter being positive for the requested column, it looks into the SSTable file to read the column value. Note that we may have a column value in other yet-to-be-read SSTables, but that does not matter, because we are reading the most recent SSTables first, and any value that was written earlier to it does not matter. So, the value gets returned as soon as the first column in the most recent SSTable is allocated. Summary By now, you should be familiar with all the nuts and bolts of Cassandra. We have discussed how the pressure to make data stores to web scale inspired a rather not-so-common database mechanism to become mainstream, and how the CAP theorem governs the behavior of such databases. We have seen that Cassandra shines out among its peers. Then, we dipped our toes into the big picture of Cassandra read and write mechanisms. This left us with lots of fancy terms. It is understandable that it may be a lot to take in for someone new to NoSQL systems. It is okay if you do not have complete clarity at this point. As you start working with Cassandra, tweaking it, experimenting with it, and going through the Cassandra mailing list discussions or talks, you will start to come across stuff that you have read in this article and it will start to make sense, and perhaps you may want to come back and refer to this article to improve your clarity. It is not required that you understand this article fully to be able to write queries, set up clusters, maintain clusters, or do anything else related to Cassandra. A general sense of this article will take you far enough to work extremely well with Cassandra-based projects. How does this knowledge help us in building an application? Isn't it just about learning Thrift or CQL API and getting going? You might be wondering why you need to know about the compaction and storage mechanism, when all you need to do is to deliver an application that has a fast backend. It may not be obvious at this point why you are learning this, but as we move ahead with developing an application, we will come to realize that knowledge about underlying storage mechanism helps. Later, if you will deploy a cluster, performance tuning, maintenance, and integrating with other tools such as Apache Hadoop, you may find this article useful. Resources for Article: Further resources on this subject: About Cassandra [article] Replication [article] Getting Up and Running with Cassandra [article]
Read more
  • 0
  • 0
  • 5947
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-testing-android-sdk
Packt
26 Mar 2015
25 min read
Save for later

Testing with the Android SDK

Packt
26 Mar 2015
25 min read
In this article by the author, Paul Blundell, of the book, Learning Android Application Testing, we learn to start digging a bit deeper to recognize the building blocks available to create more useful tests. We will be covering the following topics: Common assertions View assertions Other assertion types Helpers to test User Interfaces Mock objects Instrumentation TestCase class hierarchies Using external libraries We will be analyzing these components and showing examples of their use when applicable. The examples in this article are intentionally split from the original Android project that contains them. This is done to let you concentrate and focus only on the subject being presented, though the complete examples in a single project can be downloaded as explained later. Right now, we are interested in the trees and not the forest. Along with the examples presented, we will be identifying reusable common patterns that will help you in the creation of tests for your own projects. (For more resources related to this topic, see here.) The demonstration application A very simple application has been created to demonstrate the use of some of the tests in this article. The source for the application can be downloaded from XXXXXXXXXXXXX. The following screenshot shows this application running: When reading the explanation of the tests in this article, at any point, you can refer to the demo application that is provided in order to see the test in action. The previous simple application has a clickable link, text input, click on a button and a defined layout UI, we can test these one by one. Assertions in depth Assertions are methods that check for a condition that can be evaluated. If the condition is not met, the assertion method will throw an exception, thereby aborting the execution of the test. The JUnit API includes the class Assert. This is the base class of all the TestCase classes that hold several assertion methods useful for writing tests. These inherited methods test for a variety of conditions and are overloaded to support different parameter types. They can be grouped together in the following different sets, depending on the condition checked, for example: assertEquals assertTrue assertFalse assertNull assertNotNull assertSame assertNotSame fail The condition tested is pretty obvious and is easily identifiable by the method name. Perhaps the ones that deserve some attention are assertEquals() and assertSame(). The former, when used on objects, asserts that both objects passed as parameters are equally calling the objects' equals() method. The latter asserts that both objects refer to the same object. If, in some case, equals() is not implemented by the class, then assertEquals() and assertSame() will do the same thing. When one of these assertions fails inside a test, an AssertionFailedException is thrown, and this indicates that the test has failed. Occasionally, during the development process, you might need to create a test that you are not implementing at that precise time. However, you want to flag that the creation of the test was postponed. In such cases, you can use the fail() method, which always fails and uses a custom message that indicates the condition: public void testNotImplementedYet() {    fail("Not implemented yet"); } Still, there is another common use for fail() that is worth mentioning. If we need to test whether a method throws an exception, we can surround the code with a try-catch block and force a fail if the exception was not thrown. For example: public void testShouldThrowException() {    try {    MyFirstProjectActivity.methodThatShouldThrowException();      fail("Exception was not thrown");    } catch ( Exception ex ) {      // do nothing    } } JUnit4 has the annotation @Test(expected=Exception.class), and this supersedes the need for using fail() when testing exceptions. With this annotation, the test will only pass if the expected exception is thrown. Custom message It is worth knowing that all assert methods provide an overloaded version including a custom String message. Should the assertion fail, this custom message will be printed by the test runner, instead of a default message. The premise behind this is that, sometimes, the generic error message does not reveal enough details, and it is not obvious how the test failed. This custom message can be extremely useful to easily identify the failure once you are looking at the test report, so it's highly recommended as a best practice to use this version. The following is an example of a simple test that uses this recommendation: public void testMax() { int a = 10; int b = 20;   int actual = Math.max(a, b);   String failMsg = "Expected: " + b + " but was: " + actual; assertEquals(failMsg, b, actual); } In the preceding example, we can see another practice that would help you organize and understand your tests easily. This is the use of explicit names for variables that hold the actual values. There are other libraries available that have better default error messages and also a more fluid interface for testing. One of these that is worth looking at is Fest (https://code.google.com/p/fest/). Static imports Though basic assertion methods are inherited from the Assert base class, some other assertions need specific imports. To improve the readability of your tests, there is a pattern to statically import the assert methods from the corresponding classes. Using this pattern instead of having: public void testAlignment() { int margin = 0;    ... android.test.ViewAsserts.assertRightAligned (errorMsg, editText, margin); } We can simplify it by adding the static import: import static android.test.ViewAsserts.assertRightAligned; public void testAlignment() {    int margin = 0;    assertRightAligned(errorMsg, editText, margin); } View assertions The assertions introduced earlier handle a variety of types as parameters, but they are only intended to test simple conditions or simple objects. For example, we have asertEquals(short expected, short actual) to test short values, assertEquals(int expected, int actual) to test integer values, assertEquals(Object expected, Object expected) to test any Object instance, and so on. Usually, while testing user interfaces in Android, you will face the problem of more sophisticated methods, which are mainly related with Views. In this respect, Android provides a class with plenty of assertions in android.test.ViewAsserts (see http://developer.android.com/reference/android/test/ViewAsserts.html for more details), which test relationships between Views and their absolute and relative positions on the screen. These methods are also overloaded to provide different conditions. Among the assertions, we can find the following: assertBaselineAligned: This asserts that two Views are aligned on their baseline; that is, their baselines are on the same y location. assertBottomAligned: This asserts that two views are bottom aligned; that is, their bottom edges are on the same y location. assertGroupContains: This asserts that the specified group contains a specific child once and only once. assertGroupIntegrity: This asserts the specified group's integrity. The child count should be >= 0 and each child should be non-null. assertGroupNotContains: This asserts that the specified group does not contain a specific child. assertHasScreenCoordinates: This asserts that a View has a particular x and y position on the visible screen. assertHorizontalCenterAligned: This asserts that the test View is horizontally center aligned with respect to the reference view. assertLeftAligned: This asserts that two Views are left aligned; that is, their left edges are on the same x location. An optional margin can also be provided. assertOffScreenAbove: This asserts that the specified view is above the visible screen. assertOffScreenBelow: This asserts that the specified view is below the visible screen. assertOnScreen: This asserts that a View is on the screen. assertRightAligned: This asserts that two Views are right-aligned; that is, their right edges are on the same x location. An optional margin can also be specified. assertTopAligned: This asserts that two Views are top aligned; that is, their top edges are on the same y location. An optional margin can also be specified. assertVerticalCenterAligned: This asserts that the test View is vertically center-aligned with respect to the reference View. The following example shows how you can use ViewAssertions to test the user interface layout: public void testUserInterfaceLayout() {    int margin = 0;    View origin = mActivity.getWindow().getDecorView();    assertOnScreen(origin, editText);    assertOnScreen(origin, button);    assertRightAligned(editText, button, margin); } The assertOnScreen method uses an origin to start looking for the requested Views. In this case, we are using the top-level window decor View. If, for some reason, you don't need to go that high in the hierarchy, or if this approach is not suitable for your test, you may use another root View in the hierarchy, for example View.getRootView(), which, in our concrete example, would be editText.getRootView(). Even more assertions If the assertions that are reviewed previously do not seem to be enough for your tests' needs, there is still another class included in the Android framework that covers other cases. This class is MoreAsserts (http://developer.android.com/reference/android/test/MoreAsserts.html). These methods are also overloaded to support different parameter types. Among the assertions, we can find the following: assertAssignableFrom: This asserts that an object is assignable to a class. assertContainsRegex: This asserts that an expected Regex matches any substring of the specified String. It fails with the specified message if it does not. assertContainsInAnyOrder: This asserts that the specified Iterable contains precisely the elements expected, but in any order. assertContainsInOrder: This asserts that the specified Iterable contains precisely the elements expected, but in the same order. assertEmpty: This asserts that an Iterable is empty. assertEquals: This is for some Collections not covered in JUnit asserts. assertMatchesRegex: This asserts that the specified Regex exactly matches the String and fails with the provided message if it does not. assertNotContainsRegex: This asserts that the specified Regex does not match any substring of the specified String, and fails with the provided message if it does. assertNotEmpty: This asserts that some Collections not covered in JUnit asserts are not empty. assertNotMatchesRegex: This asserts that the specified Regex does not exactly match the specified String, and fails with the provided message if it does. checkEqualsAndHashCodeMethods: This is a utility used to test the equals() and hashCode() results at once. This tests whether equals() that is applied to both objects matches the specified result. The following test checks for an error during the invocation of the capitalization method called via a click on the UI button: @UiThreadTest public void testNoErrorInCapitalization() { String msg = "capitalize this text"; editText.setText(msg);   button.performClick();   String actual = editText.getText().toString(); String notExpectedRegexp = "(?i:ERROR)"; String errorMsg = "Capitalization error for " + actual; assertNotContainsRegex(errorMsg, notExpectedRegexp, actual); } If you are not familiar with regular expressions, invest some time and visit http://developer.android.com/reference/java/util/regex/package-summary.html because it will be worth it! In this particular case, we are looking for the word ERROR contained in the result with a case-insensitive match (setting the flag i for this purpose). That is, if for some reason, capitalization doesn't work in our application, and it contains an error message, we can detect this condition with the assertion. Note that because this is a test that modifies the user interface, we must annotate it with @UiThreadTest; otherwise, it won't be able to alter the UI from a different thread, and we will receive the following exception: INFO/TestRunner(610): ----- begin exception ----- INFO/TestRunner(610): android.view.ViewRoot$CalledFromWrongThreadException: Only the original thread that created a view hierarchy can touch its views. INFO/TestRunner(610):     at android.view.ViewRoot.checkThread(ViewRoot.java:2932) [...] INFO/TestRunner(610):     at android.app. Instrumentation$InstrumentationThread.run(Instrumentation.java:1447) INFO/TestRunner(610): ----- end exception ----- The TouchUtils class Sometimes, when testing UIs, it is helpful to simulate different kinds of touch events. These touch events can be generated in many different ways, but probably android.test.TouchUtils is the simplest to use. This class provides reusable methods to generate touch events in test cases that are derived from InstrumentationTestCase. The featured methods allow a simulated interaction with the UI under test. The TouchUtils class provides the infrastructure to inject the events using the correct UI or main thread, so no special handling is needed, and you don't need to annotate the test using @UIThreadTest. TouchUtils supports the following: Clicking on a View and releasing it Tapping on a View (touching it and quickly releasing) Long-clicking on a View Dragging the screen Dragging Views The following test represents a typical usage of TouchUtils:    public void testListScrolling() {        listView.scrollTo(0, 0);          TouchUtils.dragQuarterScreenUp(this, activity);        int actualItemPosition = listView.getFirstVisiblePosition();          assertTrue("Wrong position", actualItemPosition > 0);    } This test does the following: Repositions the list at the beginning to start from a known condition Scrolls the list Checks for the first visible position to see that it was correctly scrolled Even the most complex UIs can be tested in that way, and it would help you detect a variety of conditions that could potentially affect the user experience. Mock objects We have seen the mock objects provided by the Android testing framework, and evaluated the concerns about not using real objects to isolate our tests from the surrounding environment. Martin Fowler calls these two styles the classical and mockist Test-driven Development dichotomy in his great article Mocks aren't stubs, which can be read online at http://www.martinfowler.com/articles/mocksArentStubs.html. Independent of this discussion, we are introducing mock objects as one of the available building blocks because, sometimes, using mock objects in our tests is recommended, desirable, useful, or even unavoidable. The Android SDK provides the following classes in the subpackage android.test.mock to help us: MockApplication: This is a mock implementation of the Application class. All methods are non-functional and throw UnsupportedOperationException. MockContentProvider: This is a mock implementation of ContentProvider. All methods are non-functional and throw UnsupportedOperationException. MockContentResolver: This is a mock implementation of the ContentResolver class that isolates the test code from the real content system. All methods are non-functional and throw UnsupportedOperationException. MockContext: This is a mock context class, and this can be used to inject other dependencies. All methods are non-functional and throw UnsupportedOperationException. MockCursor: This is a mock Cursor class that isolates the test code from real Cursor implementation. All methods are non-functional and throw UnsupportedOperationException. MockDialogInterface: This is a mock implementation of the DialogInterface class. All methods are non-functional and throw UnsupportedOperationException. MockPackageManager: This is a mock implementation of the PackageManager class. All methods are non-functional and throw UnsupportedOperationException. MockResources: This is a mock Resources class. All of these classes have non-functional methods that throw UnsupportedOperationException when used. If you need to use some of these methods, or if you detect that your test is failing with this Exception, you should extend one of these base classes and provide the required functionality. MockContext overview This mock can be used to inject other dependencies, mocks, or monitors into the classes under test. Extend this class to provide your desired behavior, overriding the correspondent methods. The Android SDK provides some prebuilt mock Context objects, each of which has a separate use case. The IsolatedContext class In your tests, you might find the need to isolate the Activity under test from other Android components to prevent unwanted interactions. This can be a complete isolation, but sometimes, this isolation avoids interacting with other components, and for your Activity to still run correctly, some connection with the system is required. For those cases, the Android SDK provides android.test.IsolatedContext, a mock Context that not only prevents interaction with most of the underlying system but also satisfies the needs of interacting with other packages or components such as Services or ContentProviders. Alternate route to file and database operations In some cases, all we need is to be able to provide an alternate route to the file and database operations. For example, if we are testing the application on a real device, we perhaps don't want to affect the existing database but use our own testing data. Such cases can take advantage of another class that is not part of the android.test.mock subpackage but is part of android.test instead, that is, RenamingDelegatingContext. This class lets us alter operations on files and databases by having a prefix that is specified in the constructor. All other operations are delegated to the delegating Context that you must specify in the constructor too. Suppose our Activity under test uses a database we want to control, probably introducing specialized content or fixture data to drive our tests, and we don't want to use the real files. In this case, we create a RenamingDelegatingContext class that specifies a prefix, and our unchanged Activity will use this prefix to create any files. For example, if our Activity tries to access a file named birthdays.txt, and we provide a RenamingDelegatingContext class that specifies the prefix test, then this same Activity will access the file testbirthdays.txt instead when it is being tested. The MockContentResolver class The MockContentResolver class implements all methods in a non-functional way and throws the exception UnsupportedOperationException if you attempt to use them. The reason for this class is to isolate tests from the real content. Let's say your application uses a ContentProvider class to feed your Activity information. You can create unit tests for this ContentProvider using ProviderTestCase2, which we will be analyzing shortly, but when we try to produce functional or integration tests for the Activity against ContentProvider, it's not so evident as to what test case to use. The most obvious choice is ActivityInstrumentationTestCase2, mainly if your functional tests simulate user experience because you might need the sendKeys() method or similar methods, which are readily available on these tests. The first problem you might encounter then is that it's unclear as to where to inject a MockContentResolver in your test to be able to use test data with your ContentProvider. There's no way to inject a MockContext either. The TestCase base class This is the base class of all other test cases in the JUnit framework. It implements the basic methods that we were analyzing in the previous examples (setUp()). The TestCase class also implements the junit.framework.Test interface, meaning it can be run as a JUnit test. Your Android test cases should always extend TestCase or one of its descendants. The default constructor All test cases require a default constructor because, sometimes, depending on the test runner used, this is the only constructor that is invoked, and is also used for serialization. According to the documentation, this method is not intended to be used by "mere mortals" without calling setName(String name). Therefore, to appease the Gods, a common pattern is to use a default test case name in this constructor and invoke the given name constructor afterwards: public class MyTestCase extends TestCase {    public MyTestCase() {      this("MyTestCase Default Name");    }      public MyTestCase(String name) {      super(name);    } } The given name constructor This constructor takes a name as an argument to label the test case. It will appear in test reports and would be of much help when you try to identify where failed tests have come from. The setName() method There are some classes that extend TestCase that don't provide a given name constructor. In such cases, the only alternative is to call setName(String name). The AndroidTestCase base class This class can be used as a base class for general-purpose Android test cases. Use it when you need access to Android resources, databases, or files in the filesystem. Context is stored as a field in this class, which is conveniently named mContext and can be used inside the tests if needed, or the getContext() method can be used too. Tests based on this class can start more than one Activity using Context.startActivity(). There are various test cases in Android SDK that extend this base class: ApplicationTestCase<T extends Application> ProviderTestCase2<T extends ContentProvider> ServiceTestCase<T extends Service> When using the AndroidTestCase Java class, you inherit some base assertion methods that can be used; let's look at these in more detail. The assertActivityRequiresPermission() method The signature for this method is as follows: public void assertActivityRequiresPermission (String packageName, String className, String permission) Description This assertion method checks whether the launching of a particular Activity is protected by a specific permission. It takes the following three parameters: packageName: This is a string that indicates the package name of the activity to launch className: This is a string that indicates the class of the activity to launch permission: This is a string with the permission to check The Activity is launched and then SecurityException is expected, which mentions that the required permission is missing in the error message. The actual instantiation of an activity is not handled by this assertion, and thus, an Instrumentation is not needed. Example This test checks the requirement of the android.Manifest.permission.WRITE_EXTERNAL_STORAGE permission, which is needed to write to external storage, in the MyContactsActivity Activity: public void testActivityPermission() { String pkg = "com.blundell.tut"; String activity = PKG + ".MyContactsActivity"; String permission = android.Manifest.permission.CALL_PHONE; assertActivityRequiresPermission(pkg, activity, permission); } Always use the constants that describe the permissions from android.Manifest.permission, not the strings, so if the implementation changes, your code will still be valid. The assertReadingContentUriRequiresPermission method The signature for this method is as follows: public void assertReadingContentUriRequiresPermission(Uri uri, String permission) Description This assertion method checks whether reading from a specific URI requires the permission provided as a parameter. It takes the following two parameters: uri: This is the Uri that requires a permission to query permission: This is a string that contains the permission to query If a SecurityException class is generated, which contains the specified permission, this assertion is validated. Example This test tries to read contacts and verifies that the correct SecurityException is generated: public void testReadingContacts() {    Uri URI = ContactsContract.AUTHORITY_URI;    String PERMISSION = android.Manifest.permission.READ_CONTACTS;    assertReadingContentUriRequiresPermission(URI, PERMISSION); } The assertWritingContentUriRequiresPermission() method The signature for this method is as follows: public void assertWritingContentUriRequiresPermission (Uri uri,   String permission) Description This assertion method checks whether inserting into a specific Uri requires the permission provided as a parameter. It takes the following two parameters: uri: This is the Uri that requires a permission to query permission: This is a string that contains the permission to query If a SecurityException class is generated, which contains the specified permission, this assertion is validated. Example This test tries to write to Contacts and verifies that the correct SecurityException is generated: public void testWritingContacts() { Uri uri = ContactsContract.AUTHORITY_URI;    String permission = android.Manifest.permission.WRITE_CONTACTS; assertWritingContentUriRequiresPermission(uri, permission); } Instrumentation Instrumentation is instantiated by the system before any of the application code is run, thereby allowing monitoring of all the interactions between the system and the application. As with many other Android application components, instrumentation implementations are described in the AndroidManifest.xml under the tag <instrumentation>. However, with the advent of Gradle, this has now been automated for us, and we can change the properties of the instrumentation in the app's build.gradle file. The AndroidManifest file for your tests will be automatically generated: defaultConfig { testApplicationId 'com.blundell.tut.tests' testInstrumentationRunner   "android.test.InstrumentationTestRunner" } The values mentioned in the preceding code are also the defaults if you do not declare them, meaning that you don't have to have any of these parameters to start writing tests. The testApplicationId attribute defines the name of the package for your tests. As a default, it is your application under the test package name + tests. You can declare a custom test runner using testInstrumentationRunner. This is handy if you want to have tests run in a custom way, for example, parallel test execution. There are also many other parameters in development, and I would advise you to keep your eyes upon the Google Gradle plugin website (http://tools.android.com/tech-docs/new-build-system/user-guide). The ActivityMonitor inner class As mentioned earlier, the Instrumentation class is used to monitor the interaction between the system and the application or the Activities under test. The inner class Instrumentation ActivityMonitor allows the monitoring of a single Activity within an application. Example Let's pretend that we have a TextView in our Activity that holds a URL and has its auto link property set: <TextView        android_id="@+id/link        android_layout_width="match_parent"    android_layout_height="wrap_content"        android_text="@string/home"    android_autoLink="web" " /> If we want to verify that, when clicked, the hyperlink is correctly followed and some browser is invoked, we can create a test like this: public void testFollowLink() {        IntentFilter intentFilter = new IntentFilter(Intent.ACTION_VIEW);        intentFilter.addDataScheme("http");        intentFilter.addCategory(Intent.CATEGORY_BROWSABLE);          Instrumentation inst = getInstrumentation();        ActivityMonitor monitor = inst.addMonitor(intentFilter, null, false);        TouchUtils.clickView(this, linkTextView);        monitor.waitForActivityWithTimeout(3000);        int monitorHits = monitor.getHits();        inst.removeMonitor(monitor);          assertEquals(1, monitorHits);    } Here, we will do the following: Create an IntentFilter for intents that would open a browser. Add a monitor to our Instrumentation based on the IntentFilter class. Click on the hyperlink. Wait for the activity (hopefully the browser). Verify that the monitor hits were incremented. Remove the monitor. Using monitors, we can test even the most complex interactions with the system and other Activities. This is a very powerful tool to create integration tests. The InstrumentationTestCase class The InstrumentationTestCase class is the direct or indirect base class for various test cases that have access to Instrumentation. This is the list of the most important direct and indirect subclasses: ActivityTestCase ProviderTestCase2<T extends ContentProvider> SingleLaunchActivityTestCase<T extends Activity> SyncBaseInstrumentation ActivityInstrumentationTestCase2<T extends Activity> ActivityUnitTestCase<T extends Activity> The InstrumentationTestCase class is in the android.test package, and extends junit.framework.TestCase, which extends junit.framework.Assert. The launchActivity and launchActivityWithIntent method These utility methods are used to launch Activities from a test. If the Intent is not specified using the second option, a default Intent is used: public final T launchActivity (String pkg, Class<T> activityCls,   Bundle extras) The template class parameter T is used in activityCls and as the return type, limiting its use to Activities of that type. If you need to specify a custom Intent, you can use the following code that also adds the intent parameter: public final T launchActivityWithIntent (String pkg, Class<T>   activityCls, Intent intent) The sendKeys and sendRepeatedKeys methods While testing Activities' UI, you will face the need to simulate interaction with qwerty-based keyboards or DPAD buttons to send keys to complete fields, select shortcuts, or navigate throughout the different components. This is what the different sendKeys and sendRepeatedKeys are used for. There is one version of sendKeys that accepts integer keys values. They can be obtained from constants defined in the KeyEvent class. For example, we can use the sendKeys method in this way:    public void testSendKeyInts() {        requestMessageInputFocus();        sendKeys(                KeyEvent.KEYCODE_H,                KeyEvent.KEYCODE_E,                KeyEvent.KEYCODE_E,                KeyEvent.KEYCODE_E,                KeyEvent.KEYCODE_Y,                KeyEvent.KEYCODE_DPAD_DOWN,                KeyEvent.KEYCODE_ENTER);        String actual = messageInput.getText().toString();          assertEquals("HEEEY", actual);    } Here, we are sending H, E, and Y letter keys and then the ENTER key using their integer representations to the Activity under test. Alternatively, we can create a string by concatenating the keys we desire to send, discarding the KEYCODE prefix, and separating them with spaces that are ultimately ignored:      public void testSendKeyString() {        requestMessageInputFocus();          sendKeys("H 3*E Y DPAD_DOWN ENTER");        String actual = messageInput.getText().toString();          assertEquals("HEEEY", actual);    } Here, we did exactly the same as in the previous test but we used a String "H 3* EY DPAD_DOWN ENTER". Note that every key in the String can be prefixed by a repeating factor followed by * and the key to be repeated. We used 3*E in our previous example, which is the same as E E E, that is, three times the letter E. If sending repeated keys is what we need in our tests, there is also another alternative that is precisely intended for these cases: public void testSendRepeatedKeys() {        requestMessageInputFocus();          sendRepeatedKeys(                1, KeyEvent.KEYCODE_H,                3, KeyEvent.KEYCODE_E,                1, KeyEvent.KEYCODE_Y,                1, KeyEvent.KEYCODE_DPAD_DOWN,                1, KeyEvent.KEYCODE_ENTER);        String actual = messageInput.getText().toString();          assertEquals("HEEEY", actual);    } This is the same test implemented in a different manner. The repetition number precedes each key. The runTestOnUiThread helper method The runTestOnUiThread method is a helper method used to run portions of a test on the UI thread. We used this inside the method requestMessageInputFocus(); so that we can set the focus on our EditText before waiting for the application to be idle, using Instrumentation.waitForIdleSync(). Also, the runTestOnUiThread method throws an exception, so we have to deal with this case: private void requestMessageInputFocus() {        try {            runTestOnUiThread(new Runnable() {                @Override                public void run() {                    messageInput.requestFocus();                }            });        } catch (Throwable throwable) {            fail("Could not request focus.");        }        instrumentation.waitForIdleSync();    } Alternatively, as we have discussed before, to run a test on the UI thread, we can annotate it with @UiThreadTest. However, sometimes, we need to run only parts of the test on the UI thread because other parts of it are not suitable to run on that thread, for example, database calls, or we are using other helper methods that provide the infrastructure themselves to use the UI thread, for example the TouchUtils methods. Summary We investigated the most relevant building blocks and reusable patterns to create our tests. Along this journey, we: Understood the common assertions found in JUnit tests Explained the specialized assertions found in the Android SDK Explored Android mock objects and their use in Android tests Now that we have all the building blocks, it is time to start creating more and more tests to acquire the experience needed to master the technique. Resources for Article: Further resources on this subject: Android Virtual Device Manager [article] Signing an application in Android using Maven [article] The AsyncTask and HardwareTask Classes [article]
Read more
  • 0
  • 0
  • 7695

article-image-big-data-analysis-r-and-hadoop
Packt
26 Mar 2015
37 min read
Save for later

Big Data Analysis (R and Hadoop)

Packt
26 Mar 2015
37 min read
This article is written by Yu-Wei, Chiu (David Chiu), the author of Machine Learning with R Cookbook. In this article, we will cover the following topics: Preparing the RHadoop environment Installing rmr2 Installing rhdfs Operating HDFS with rhdfs Implementing a word count problem with RHadoop Comparing the performance between an R MapReduce program and a standard R program Testing and debugging the rmr2 program Installing plyrmr Manipulating data with plyrmr Conducting machine learning with RHadoop Configuring RHadoop clusters on Amazon EMR (For more resources related to this topic, see here.) RHadoop is a collection of R packages that enables users to process and analyze big data with Hadoop. Before understanding how to set up RHadoop and put it in to practice, we have to know why we need to use machine learning to big-data scale. The emergence of Cloud technology has made real-time interaction between customers and businesses much more frequent; therefore, the focus of machine learning has now shifted to the development of accurate predictions for various customers. For example, businesses can provide real-time personal recommendations or online advertisements based on personal behavior via the use of a real-time prediction model. However, if the data (for example, behaviors of all online users) is too large to fit in the memory of a single machine, you have no choice but to use a supercomputer or some other scalable solution. The most popular scalable big-data solution is Hadoop, which is an open source framework able to store and perform parallel computations across clusters. As a result, you can use RHadoop, which allows R to leverage the scalability of Hadoop, helping to process and analyze big data. In RHadoop, there are five main packages, which are: rmr: This is an interface between R and Hadoop MapReduce, which calls the Hadoop streaming MapReduce API to perform MapReduce jobs across Hadoop clusters. To develop an R MapReduce program, you only need to focus on the design of the map and reduce functions, and the remaining scalability issues will be taken care of by Hadoop itself. rhdfs: This is an interface between R and HDFS, which calls the HDFS API to access the data stored in HDFS. The use of rhdfs is very similar to the use of the Hadoop shell, which allows users to manipulate HDFS easily from the R console. rhbase: This is an interface between R and HBase, which accesses Hbase and is distributed in clusters through a Thrift server. You can use rhbase to read/write data and manipulate tables stored within HBase. plyrmr: This is a higher-level abstraction of MapReduce, which allows users to perform common data manipulation in a plyr-like syntax. This package greatly lowers the learning curve of big-data manipulation. ravro: This allows users to read avro files in R, or write avro files. It allows R to exchange data with HDFS. In this article, we will start by preparing the Hadoop environment, so that you can install RHadoop. We then cover the installation of three main packages: rmr, rhdfs, and plyrmr. Next, we will introduce how to use rmr to perform MapReduce from R, operate an HDFS file through rhdfs, and perform a common data operation using plyrmr. Further, we will explore how to perform machine learning using RHadoop. Lastly, we will introduce how to deploy multiple RHadoop clusters on Amazon EC2. Preparing the RHadoop environment As RHadoop requires an R and Hadoop integrated environment, we must first prepare an environment with both R and Hadoop installed. Instead of building a new Hadoop system, we can use the Cloudera QuickStart VM (the VM is free), which contains a single node Apache Hadoop Cluster and R. In this recipe, we will demonstrate how to download the Cloudera QuickStart VM. Getting ready To use the Cloudera QuickStart VM, it is suggested that you should prepare a 64-bit guest OS with either VMWare or VirtualBox, or the KVM installed. If you choose to use VMWare, you should prepare a player compatible with WorkStation 8.x or higher: Player 4.x or higher, ESXi 5.x or higher, or Fusion 4.x or higher. Note, 4 GB of RAM is required to start VM, with an available disk space of at least 3 GB. How to do it... Perform the following steps to set up a Hadoop environment using the Cloudera QuickStart VM: Visit the Cloudera QuickStart VM download site (you may need to update the link as Cloudera upgrades its VMs , the current version of CDH is 5.3) at http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-2-x.html. A screenshot of the Cloudera QuickStart VM download site Depending on the virtual machine platform installed on your OS, choose the appropriate link (you may need to update the link as Cloudera upgrades its VMs) to download the VM file: To download VMWare: You can visit https://downloads.cloudera.com/demo_vm/vmware/cloudera-quickstart-vm-5.2.0-0-vmware.7z To download KVM: You can visit https://downloads.cloudera.com/demo_vm/kvm/cloudera-quickstart-vm-5.2.0-0-kvm.7z To download VirtualBox: You can visit https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.2.0-0-virtualbox.7z Next, you can start the QuickStart VM using the virtual machine platform installed on your OS. You should see the desktop of Centos 6.2 in a few minutes. The screenshot of Cloudera QuickStart VM. You can then open a terminal and type hadoop, which will display a list of functions that can operate a Hadoop cluster. The terminal screenshot after typing hadoop Open a terminal and type R. Access an R session and check whether version 3.1.1 is already installed in the Cloudera QuickStart VM. If you cannot find R installed in the VM, please use the following command to install R: $ yum install R R-core R-core-devel R-devel How it works... Instead of building a Hadoop system on your own, you can use the Hadoop VM application provided by Cloudera (the VM is free). The QuickStart VM runs on CentOS 6.2 with a single node Apache Hadoop cluster, Hadoop Ecosystem module, and R installed. This helps you to save time, instead of requiring you to learn how to install and use Hadoop. The QuickStart VM requires you to have a computer with a 64-bit guest OS, at least 4 GB of RAM, 3 GB of disk space, and either VMWare, VirtualBox, or KVM installed. As a result, you may not be able to use this version of VM on some computers. As an alternative, you could consider using Amazon's Elastic MapReduce instead. We will illustrate how to prepare a RHadoop environment in EMR in the last recipe of this article. Setting up the Cloudera QuickStart VM is simple. Download the VM from the download site and then open the built image with either VMWare, VirtualBox, or KVM. Once you can see the desktop of CentOS, you can then access the terminal and type hadoop to see whether Hadoop is working; then, type R to see whether R works in the QuickStart VM. See also Besides using the Cloudera QuickStart VM, you may consider using a Sandbox VM provided by Hontonworks or MapR. You can find Hontonworks Sandbox at http://hortonworks.com/products/hortonworks-sandbox/#install and mapR Sandbox at https://www.mapr.com/products/mapr-sandbox-hadoop/download. Installing rmr2 The rmr2 package allows you to perform big data processing and analysis via MapReduce on a Hadoop cluster. To perform MapReduce on a Hadoop cluster, you have to install R and rmr2 on every task node. In this recipe, we will illustrate how to install rmr2 on a single node of a Hadoop cluster. Getting ready Ensure that you have completed the previous recipe by starting the Cloudera QuickStart VM and connecting the VM to the Internet, so that you can proceed with downloading and installing the rmr2 package. How to do it... Perform the following steps to install rmr2 on the QuickStart VM: First, open the terminal within the Cloudera QuickStart VM. Use the permission of the root to enter an R session: $ sudo R You can then install dependent packages before installing rmr2: > install.packages(c("codetools", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava", "caTools")) Quit the R session: > q() Next, you can download rmr-3.3.0 to the QuickStart VM. You may need to update the link if Revolution Analytics upgrades the version of rmr2: $ wget --no-check-certificate https://raw.githubusercontent.com/RevolutionAnalytics/rmr2/3.3.0/build/rmr2_3.3.0.tar.gz You can then install rmr-3.3.0 to the QuickStart VM: $ sudo R CMD INSTALL rmr2_3.3.0.tar.gz Lastly, you can enter an R session and use the library function to test whether the library has been successfully installed: $ R > library(rmr2) How it works... In order to perform MapReduce on a Hadoop cluster, you have to install R and RHadoop on every task node. Here, we illustrate how to install rmr2 on a single node of a Hadoop cluster. First, open the terminal of the Cloudera QuickStart VM. Before installing rmr2, we first access an R session with root privileges and install dependent R packages. Next, after all the dependent packages are installed, quit the R session and use the wget command in the Linux shell to download rmr-3.3.0 from GitHub to the local filesystem. You can then begin the installation of rmr2. Lastly, you can access an R session and use the library function to validate whether the package has been installed. See also To see more information and read updates about RHadoop, you can refer to the RHadoop wiki page hosted on GitHub: https://github.com/RevolutionAnalytics/RHadoop/wiki Installing rhdfs The rhdfs package is the interface between R and HDFS, which allows users to access HDFS from an R console. Similar to rmr2, one should install rhdfs on every task node, so that one can access HDFS resources through R. In this recipe, we will introduce how to install rhdfs on the Cloudera QuickStart VM. Getting ready Ensure that you have completed the previous recipe by starting the Cloudera QuickStart VM and connecting the VM to the Internet, so that you can proceed with downloading and installing the rhdfs package. How to do it... Perform the following steps to install rhdfs: First, you can download rhdfs 1.0.8 from GitHub. You may need to update the link if Revolution Analytics upgrades the version of rhdfs: $wget --no-check-certificate https://raw.github.com/ RevolutionAnalytics/rhdfs/master/build/rhdfs_1.0.8.tar.gz Next, you can install rhdfs under the command-line mode: $ sudo HADOOP_CMD=/usr/bin/hadoop R CMD INSTALL rhdfs_1.0.8.tar.gz You can then set up JAVA_HOME. The configuration of JAVA_HOME depends on the installed Java version within the VM: $ sudo JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera R CMD javareconf Last, you can set up the system environment and initialize rhdfs. You may need to update the environment setup if you use a different version of QuickStart VM: $ R > Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") > Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.5.0-cdh5.2.0.jar") > library(rhdfs) > hdfs.init() How it works... The package, rhdfs, provides functions so that users can manage HDFS using R. Similar to rmr2, you should install rhdfs on every task node, so that one can access HDFS through the R console. To install rhdfs, you should first download rhdfs from GitHub. You can then install rhdfs in R by specifying where the HADOOP_CMD is located. You must configure R with Java support through the command, javareconf. Next, you can access R and configure where HADOOP_CMD and HADOOP_STREAMING are located. Lastly, you can initialize rhdfs via the rhdfs.init function, which allows you to begin operating HDFS through rhdfs. See also To find where HADOOP_CMD is located, you can use the which hadoop command in the Linux shell. In most Hadoop systems, HADOOP_CMD is located at /usr/bin/hadoop. As for the location of HADOOP_STREAMING, the streaming JAR file is often located in /usr/lib/hadoop-mapreduce/. However, if you cannot find the directory, /usr/lib/Hadoop-mapreduce, in your Linux system, you can search the streaming JAR by using the locate command. For example: $ sudo updatedb $ locate streaming | grep jar | more Operating HDFS with rhdfs The rhdfs package is an interface between Hadoop and R, which can call an HDFS API in the backend to operate HDFS. As a result, you can easily operate HDFS from the R console through the use of the rhdfs package. In the following recipe, we will demonstrate how to use the rhdfs function to manipulate HDFS. Getting ready To proceed with this recipe, you need to have completed the previous recipe by installing rhdfs into R, and validate that you can initial HDFS via the hdfs.init function. How to do it... Perform the following steps to operate files stored on HDFS: Initialize the rhdfs package: > Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") > Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.5.0-cdh5.2.0.jar") > library(rhdfs) > hdfs.init () You can then manipulate files stored on HDFS, as follows:     hdfs.put: Copy a file from the local filesystem to HDFS: > hdfs.put('word.txt', './')     hdfs.ls: Read the list of directory from HDFS: > hdfs.ls('./')     hdfs.copy: Copy a file from one HDFS directory to another: > hdfs.copy('word.txt', 'wordcnt.txt')     hdfs.move : Move a file from one HDFS directory to another: > hdfs.move('wordcnt.txt', './data/wordcnt.txt')     hdfs.delete: Delete an HDFS directory from R: > hdfs.delete('./data/')     hdfs.rm: Delete an HDFS directory from R: > hdfs.rm('./data/')     hdfs.get: Download a file from HDFS to a local filesystem: > hdfs.get(word.txt', '/home/cloudera/word.txt')     hdfs.rename: Rename a file stored on HDFS: hdfs.rename('./test/q1.txt','./test/test.txt')     hdfs.chmod: Change the permissions of a file or directory: > hdfs.chmod('test', permissions= '777')     hdfs.file.info: Read the meta information of the HDFS file: > hdfs.file.info('./') Also, you can write stream to the HDFS file: > f = hdfs.file("iris.txt","w") > data(iris) > hdfs.write(iris,f) > hdfs.close(f) Lastly, you can read stream from the HDFS file: > f = hdfs.file("iris.txt", "r") > dfserialized = hdfs.read(f) > df = unserialize(dfserialized) > df > hdfs.close(f) How it works... In this recipe, we demonstrate how to manipulate HDFS using the rhdfs package. Normally, you can use the Hadoop shell to manipulate HDFS, but if you would like to access HDFS from R, you can use the rhdfs package. Before you start using rhdfs, you have to initialize rhdfs with hdfs.init(). After initialization, you can operate HDFS through the functions provided in the rhdfs package. Besides manipulating HDFS files, you can exchange streams to HDFS through hdfs.read and hdfs.write. We, therefore, demonstrate how to write a data frame in R to an HDFS file, iris.txt, using hdfs.write. Lastly, you can recover the written file back to the data frame using the hdfs.read function and the unserialize function. See also To initialize rhdfs, you have to set HADOOP_CMD and HADOOP_STREAMING in the system environment. Instead of setting the configuration each time you're using rhdfs, you can put the configurations in the .rprofile file. Therefore, every time you start an R session, the configuration will be automatically loaded. Implementing a word count problem with RHadoop To demonstrate how MapReduce works, we illustrate the example of a word count, which counts the number of occurrences of each word in a given input set. In this recipe, we will demonstrate how to use rmr2 to implement a word count problem. Getting ready In this recipe, we will need an input file as our word count program input. You can download the example input from https://github.com/ywchiu/ml_R_cookbook/tree/master/CH12. How to do it... Perform the following steps to implement the word count program: First, you need to configure the system environment, and then load rmr2 and rhdfs into an R session. You may need to update the use of the JAR file if you use a different version of QuickStart VM: > Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") > Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.5.0-cdh5.2.0.jar ") > library(rmr2) > library(rhdfs) > hdfs.init() You can then create a directory on HDFS and put the input file into the newly created directory: > hdfs.mkdir("/user/cloudera/wordcount/data") > hdfs.put("wc_input.txt", "/user/cloudera/wordcount/data") Next, you can create a map function: > map = function(.,lines) { keyval( +   unlist( +     strsplit( +       x = lines, +       split = " +")), +   1)} Create a reduce function: > reduce = function(word, counts) +   keyval(word, sum(counts)) + } Call the MapReduce program to count the words within a document: > hdfs.root = 'wordcount' > hdfs.data = file.path(hdfs.root, 'data') > hdfs.out = file.path(hdfs.root, 'out') > wordcount = function (input, output=NULL) { + mapreduce(input=input, output=output, input.format="text", map=map, + reduce=reduce) + } > out = wordcount(hdfs.data, hdfs.out) Lastly, you can retrieve the top 10 occurring words within the document: > results = from.dfs(out) > results$key[order(results$val, decreasing = TRUE)][1:10] How it works... In this recipe, we demonstrate how to implement a word count using the rmr2 package. First, we need to configure the system environment and load rhdfs and rmr2 into R. Then, we specify the input of our word count program from the local filesystem into the HDFS directory, /user/cloudera/wordcount/data, via the hdfs.put function. Next, we begin implementing the MapReduce program. Normally, we can divide the MapReduce program into the map and reduce functions. In the map function, we first use the strsplit function to split each line into words. Then, as the strsplit function returns a list of words, we can use the unlist function to character vectors. Lastly, we can return key-value pairs with each word as a key and the value as one. As the reduce function receives the key-value pair generated from the map function, the reduce function sums the count and returns the number of occurrences of each word (or key). After we have implemented the map and reduce functions, we can submit our job via the mapreduce function. Normally, the mapreduce function requires four inputs, which are the HDFS input path, the HDFS output path, the map function, and the reduce function. In this case, we specify the input as wordcount/data, output as wordcount/out, mapfunction as map, reduce function as reduce, and wrap the mapreduce call in function, wordcount. Lastly, we call the function, wordcount and store the output path in the variable, out. We can use the from.dfs function to load the HDFS data into the results variable, which contains the mapping of words and number of occurrences. We can then generate the top 10 occurring words from the results variable. See also In this recipe, we demonstrate how to write an R MapReduce program to solve a word count problem. However, if you are interested in how to write a native Java MapReduce program, you can refer to http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html. Comparing the performance between an R MapReduce program and a standard R program Those not familiar with how Hadoop works may often see Hadoop as a remedy for big data processing. Some might believe that Hadoop can return the processed results for any size of data within a few milliseconds. In this recipe, we will compare the performance between an R MapReduce program and a standard R program to demonstrate that Hadoop does not perform as quickly as some may believe. Getting ready In this recipe, you should have completed the previous recipe by installing rmr2 into the R environment. How to do it... Perform the following steps to compare the performance of a standard R program and an R MapReduce program: First, you can implement a standard R program to have all numbers squared: > a.time = proc.time() > small.ints2=1:100000 > result.normal = sapply(small.ints2, function(x) x^2) > proc.time() - a.time To compare the performance, you can implement an R MapReduce program to have all numbers squared: > b.time = proc.time() > small.ints= to.dfs(1:100000) > result = mapreduce(input = small.ints, map = function(k,v)       cbind(v,v^2)) > proc.time() - b.time How it works... In this recipe, we implement two programs to square all the numbers. In the first program, we use a standard R function, sapply, to square the sequence from 1 to 100,000. To record the program execution time, we first record the processing time before the execution in a.time, and then subtract a.time from the current processing time after the execution. Normally, the execution takes no more than 10 seconds. In the second program, we use the rmr2 package to implement a program in the R MapReduce version. In this program, we also record the execution time. Normally, this program takes a few minutes to complete a task. The performance comparison shows that a standard R program outperforms the MapReduce program when processing small amounts of data. This is because a Hadoop system often requires time to spawn daemons, job coordination between daemons, and fetching data from data nodes. Therefore, a MapReduce program often takes a few minutes to a couple of hours to finish the execution. As a result, if you can fit your data in the memory, you should write a standard R program to solve the problem. Otherwise, if the data is too large to fit in the memory, you can implement a MapReduce solution. See also In order to check whether a job will run smoothly and efficiently in Hadoop, you can run a MapReduce benchmark, MRBench, to evaluate the performance of the job: $ hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-test.jar mrbench -numRuns 50 Testing and debugging the rmr2 program Since running a MapReduce program will require a considerable amount of time, varying from a few minutes to several hours, testing and debugging become very important. In this recipe, we will illustrate some techniques you can use to troubleshoot an R MapReduce program. Getting ready In this recipe, you should have completed the previous recipe by installing rmr2 into an R environment. How to do it... Perform the following steps to test and debug an R MapReduce program: First, you can configure the backend as local in rmr.options: > rmr.options(backend = 'local') Again, you can execute the number squared MapReduce program mentioned in the previous recipe: > b.time = proc.time() > small.ints= to.dfs(1:100000) > result = mapreduce(input = small.ints, map = function(k,v)       cbind(v,v^2)) > proc.time() - b.time In addition to this, if you want to print the structure information of any variable in the MapReduce program, you can use the rmr.str function: > out = mapreduce(to.dfs(1), map = function(k, v) rmr.str(v)) Dotted pair list of 14 $ : language mapreduce(to.dfs(1), map = function(k, v) rmr.str(v)) $ : language mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, in.folder = if (is.list(input)) {     lapply(input, to.dfs.path) ...< $ : language c.keyval(do.call(c, lapply(in.folder, function(fname) {     kv = get.data(fname) ... $ : language do.call(c, lapply(in.folder, function(fname) {     kv = get.data(fname) ... $ : language lapply(in.folder, function(fname) {     kv = get.data(fname) ... $ : language FUN("/tmp/Rtmp813BFJ/file25af6e85cfde"[[1L]], ...) $ : language unname(tapply(1:lkv, ceiling((1:lkv)/(lkv/(object.size(kv)/10^6))), function(r) {     kvr = slice.keyval(kv, r) ... $ : language tapply(1:lkv, ceiling((1:lkv)/(lkv/(object.size(kv)/10^6))), function(r) {     kvr = slice.keyval(kv, r) ... $ : language lapply(X = split(X, group), FUN = FUN, ...) $ : language FUN(X[[1L]], ...) $ : language as.keyval(map(keys(kvr), values(kvr))) $ : language is.keyval(x) $ : language map(keys(kvr), values(kvr)) $ :length 2 rmr.str(v) ..- attr(*, "srcref")=Class 'srcref' atomic [1:8] 1 34 1 58 34 58 1 1 .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x3f984f0> v num 1 How it works... In this recipe, we introduced some debugging and testing techniques you can use while implementing the MapReduce program. First, we introduced the technique to test a MapReduce program in a local mode. If you would like to run the MapReduce program in a pseudo distributed or fully distributed mode, it would take you a few minutes to several hours to complete the task, which would involve a lot of wastage of time while troubleshooting your MapReduce program. Therefore, you can set the backend to the local mode in rmr.options so that the program will be executed in the local mode, which takes lesser time to execute. Another debugging technique is to list the content of the variable within the map or reduce function. In an R program, you can use the str function to display the compact structure of a single variable. In rmr2, the package also provides a function named rmr.str, which allows you to print out the content of a single variable onto the console. In this example, we use rmr.str to print the content of variables within a MapReduce program. See also For those who are interested in the option settings for the rmr2 package, you can refer to the help document of rmr.options: > help(rmr.options) Installing plyrmr The plyrmr package provides common operations (as found in plyr or reshape2) for users to easily perform data manipulation through the MapReduce framework. In this recipe, we will introduce how to install plyrmr on the Hadoop system. Getting ready Ensure that you have completed the previous recipe by starting the Cloudera QuickStart VM and connecting the VM to the Internet. Also, you need to have the rmr2 package installed beforehand. How to do it... Perform the following steps to install plyrmr on the Hadoop system: First, you should install libxml2-devel and curl-devel in the Linux shell: $ yum install libxml2-devel $ sudo yum install curl-devel You can then access R and install the dependent packages: $ sudo R > Install.packages(c(" Rcurl", "httr"), dependencies = TRUE > Install.packages("devtools", dependencies = TRUE) > library(devtools) > install_github("pryr", "hadley") > install.packages(c(" R.methodsS3", "hydroPSO"), dependencies = TRUE > q() Next, you can download plyrmr 0.5.0 and install it on Hadoop VM. You may need to update the link if Revolution Analytics upgrades the version of plyrmr: $ wget -no-check-certificate https://raw.github.com/RevolutionAnalytics/plyrmr/master/build/plyrmr_0.5.0.tar.gz $ sudo R CMD INSTALL plyrmr_0.5.0.tar.gz Lastly, validate the installation: $ R > library(plyrmr) How it works... Besides writing an R MapReduce program using the rmr2 package, you can use the plyrmr to manipulate data. The plyrmr package is similar to hive and pig in the Hadoop ecosystem, which is the abstraction of the MapReduce program. Therefore, we can implement an R MapReduce program in plyr style instead of implementing the map f and reduce functions. To install plyrmr, first install the package of libxml2-devel and curl-devel, using the yum install command. Then, access R and install the dependent packages. Lastly, download the file from GitHub and install plyrmr in R. See also To read more information about plyrmr, you can use the help function to refer to the following document: > help(package=plyrmr) Manipulating data with plyrmr While writing a MapReduce program with rmr2 is much easier than writing a native Java version, it is still hard for nondevelopers to write a MapReduce program. Therefore, you can use plyrmr, a high-level abstraction of the MapReduce program, so that you can use plyr-like operations to manipulate big data. In this recipe, we will introduce some operations you can use to manipulate data. Getting ready In this recipe, you should have completed the previous recipes by installing plyrmr and rmr2 in R. How to do it... Perform the following steps to manipulate data with plyrmr: First, you need to load both plyrmr and rmr2 into R: > library(rmr2) > library(plyrmr) You can then set the execution mode to the local mode: > plyrmr.options(backend="local") Next, load the Titanic dataset into R: > data(Titanic) > titanic = data.frame(Titanic) Begin the operation by filtering the data: > where( +   Titanic, + Freq >=100) You can also use a pipe operator to filter the data: > titanic %|% where(Freq >=100) Put the Titanic data into HDFS and load the path of the data to the variable, tidata: > tidata = to.dfs(data.frame(Titanic), output = '/tmp/titanic') > tidata Next, you can generate a summation of the frequency from the Titanic data: > input(tidata) %|% transmute(sum(Freq)) You can also group the frequency by sex: > input(tidata) %|% group(Sex) %|% transmute(sum(Freq)) You can then sample 10 records out of the population: > sample(input(tidata), n=10) In addition to this, you can use plyrmr to join two datasets: > convert_tb = data.frame(Label=c("No","Yes"), Symbol=c(0,1)) ctb = to.dfs(convert_tb, output = 'convert') > as.data.frame(plyrmr::merge(input(tidata), input(ctb), by.x="Survived", by.y="Label")) > file.remove('convert') How it works... In this recipe, we introduce how to use plyrmr to manipulate data. First, we need to load the plyrmr package into R. Then, similar to rmr2, you have to set the backend option of plyrmr as the local mode. Otherwise, you will have to wait anywhere between a few minutes to several hours if plyrmr is running on Hadoop mode (the default setting). Next, we can begin the data manipulation with data filtering. You can choose to call the function nested inside the other function call in step 4. On the other hand, you can use the pipe operator, %|%, to chain multiple operations. Therefore, we can filter data similar to step 4, using pipe operators in step 5. Next, you can input the dataset into either the HDFS or local filesystem, using to.dfs in accordance with the current running mode. The function will generate the path of the dataset and save it in the variable, tidata. By knowing the path, you can access the data using the input function. Next, we illustrate how to generate a summation of the frequency from the Titanic dataset with the transmute and sum functions. Also, plyrmr allows users to sum up the frequency by gender. Additionally, in order to sample data from a population, you can also use the sample function to select 10 records out of the Titanic dataset. Lastly, we demonstrate how to join two datasets using the merge function from plyrmr. See also Here we list some functions that can be used to manipulate data with plyrmr. You may refer to the help function for further details on their usage and functionalities: Data manipulation: bind.cols: This adds new columns select: This is used to select columns where: This is used to select rows transmute: This uses all of the above plus their summaries From reshape2: melt and dcast: It converts long and wide data frames Summary: count quantile sample Extract: top.k bottom.k Conducting machine learning with RHadoop At this point, some may believe that the use of RHadoop can easily solve machine learning problems of big data via numerous existing machine learning packages. However, you cannot use most of these to solve machine learning problems as they cannot be executed in the MapReduce mode. In the following recipe, we will demonstrate how to implement a MapReduce version of linear regression and compare this version with the one using the lm function. Getting ready In this recipe, you should have completed the previous recipe by installing rmr2 into the R environment. How to do it... Perform the following steps to implement a linear regression in MapReduce: First, load the cats dataset from the MASS package: > library(MASS) > data(cats) > X = matrix(cats$Bwt) > y = matrix(cats$Hwt) You can then generate a linear regression model by calling the lm function: > model = lm(y~X) > summary(model)   Call: lm(formula = y ~ X)   Residuals:    Min    1Q Median     3Q     Max -3.5694 -0.9634 -0.0921 1.0426 5.1238   Coefficients:            Estimate Std. Error t value Pr(>|t|)   (Intercept) -0.3567     0.6923 -0.515   0.607   X             4.0341     0.2503 16.119   <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1   Residual standard error: 1.452 on 142 degrees of freedom Multiple R-squared: 0.6466, Adjusted R-squared: 0.6441 F-statistic: 259.8 on 1 and 142 DF, p-value: < 2.2e-16 You can now make a regression plot with the given data points and model: > plot(y~X) > abline(model, col="red") Linear regression plot of cats dataset Load rmr2 into R: > Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") > Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-> streaming-2.5.0-cdh5.2.0.jar") > library(rmr2) > rmr.options(backend="local") You can then set up X and y values: > X = matrix(cats$Bwt) > X.index = to.dfs(cbind(1:nrow(X), X)) > y = as.matrix(cats$Hwt) Make a Sum function to sum up the values: > Sum = +   function(., YY) +     keyval(1, list(Reduce('+', YY))) Compute Xtx in MapReduce, Job1: > XtX = +   values( +     from.dfs( +       mapreduce( +         input = X.index, +         map = +           function(., Xi) { +             Xi = Xi[,-1] +              keyval(1, list(t(Xi) %*% Xi))}, +         reduce = Sum, +         combine = TRUE)))[[1]] You can then compute Xty in MapReduce, Job2: Xty = +   values( +     from.dfs( +       mapreduce( +         input = X.index, +         map = function(., Xi) { +           yi = y[Xi[,1],] +           Xi = Xi[,-1] +           keyval(1, list(t(Xi) %*% yi))}, +         reduce = Sum, +         combine = TRUE)))[[1]] Lastly, you can derive the coefficient from XtX and Xty: > solve(XtX, Xty)          [,1] [1,] 3.907113 How it works... In this recipe, we demonstrate how to implement linear logistic regression in a MapReduce fashion in R. Before we start the implementation, we review how traditional linear models work. We first retrieve the cats dataset from the MASS package. We then load X as the body weight (Bwt) and y as the heart weight (Hwt). Next, we begin to fit the data into a linear regression model using the lm function. We can then compute the fitted model and obtain the summary of the model. The summary shows that the coefficient is 4.0341 and the intercept is -0.3567. Furthermore, we draw a scatter plot in accordance with the given data points and then draw a regression line on the plot. As we cannot perform linear regression using the lm function in the MapReduce form, we have to rewrite the regression model in a MapReduce fashion. Here, we would like to implement a MapReduce version of linear regression in three steps, which are: calculate the Xtx value with the MapReduce, job1, calculate the Xty value with MapReduce, job2, and then derive the coefficient value: In the first step, we pass the matrix, X, as the input to the map function. The map function then calculates the cross product of the transposed matrix, X, and, X. The reduce function then performs the sum operation defined in the previous section. In the second step, the procedure of calculating Xty is similar to calculating XtX. The procedure calculates the cross product of the transposed matrix, X, and, y. The reduce function then performs the sum operation. Lastly, we use the solve function to derive the coefficient, which is 3.907113. As the results show, the coefficients computed by lm and MapReduce differ slightly. Generally speaking, the coefficient computed by the lm model is more accurate than the one calculated by MapReduce. However, if your data is too large to fit in the memory, you have no choice but to implement linear regression in the MapReduce version. See also You can access more information on machine learning algorithms at: https://github.com/RevolutionAnalytics/rmr2/tree/master/pkg/tests Configuring RHadoop clusters on Amazon EMR Until now, we have only demonstrated how to run a RHadoop program in a single Hadoop node. In order to test our RHadoop program on a multi-node cluster, the only thing you need to do is to install RHadoop on all the task nodes (nodes with either task tracker for mapreduce version 1 or node manager for map reduce version 2) of Hadoop clusters. However, the deployment and installation is time consuming. On the other hand, you can choose to deploy your RHadoop program on Amazon EMR, so that you can deploy multi-node clusters and RHadoop on every task node in only a few minutes. In the following recipe, we will demonstrate how to configure RHadoop cluster on an Amazon EMR service. Getting ready In this recipe, you must register and create an account on AWS, and you also must know how to generate a EC2 key-pair before using Amazon EMR. For those who seek more information on how to start using AWS, please refer to the tutorial provided by Amazon at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html. How to do it... Perform the following steps to configure RHadoop on Amazon EMR: First, you can access the console of the Amazon Web Service (refer to https://us-west-2.console.aws.amazon.com/console/) and find EMR in the analytics section. Then, click on EMR. Access EMR service from AWS console. You should find yourself in the cluster list of the EMR dashboard (refer to https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-list::); click on Create cluster. Cluster list of EMR Then, you should find yourself on the Create Cluster page (refer to https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#create-cluster:). Next, you should specify Cluster name and Log folder S3 location in the cluster configuration. Cluster configuration in the create cluster page You can then configure the Hadoop distribution on Software Configuration. Configure the software and applications Next, you can configure the number of nodes within the Hadoop cluster. Configure the hardware within Hadoop cluster You can then specify the EC2 key-pair for the master node login. Security and access to the master node of the EMR cluster To set up RHadoop, one has to perform bootstrap actions to install RHadoop on every task node. Please write a file named bootstrapRHadoop.sh, and insert the following lines within the file: echo 'install.packages(c("codetools", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava", "caTools"), repos="http://cran.us.r-project.org")' > /home/hadoop/installPackage.R sudo Rscript /home/hadoop/installPackage.R wget --no-check-certificate https://raw.githubusercontent.com/RevolutionAnalytics/rmr2/master/build/rmr2_3.3.0.tar.gz sudo R CMD INSTALL rmr2_3.3.0.tar.gz wget --no-check-certificate https://raw.github.com/RevolutionAnalytics/rhdfs/master/build/rhdfs_1.0.8.tar.gz sudo HADOOP_CMD=/home/hadoop/bin/hadoop R CMD INSTALL rhdfs_1.0.8.tar.gz You should upload bootstrapRHadoop.sh to S3. You now need to add the bootstrap action with Custom action, and add s3://<location>/bootstrapRHadoop.sh within the S3 location. Set up the bootstrap action Next, you can click on Create cluster to launch the Hadoop cluster. Create the cluster Lastly, you should see the master public DNS when the cluster is ready. You can now access the terminal of the master node with your EC2-key pair: A screenshot of the created cluster How it works... In this recipe, we demonstrate how to set up RHadoop on Amazon EMR. The benefit of this is that you can quickly create a scalable, on demand Hadoop with just a few clicks within a few minutes. This helps save you time from building and deploying a Hadoop application. However, you have to pay for the number of running hours for each instance. Before using Amazon EMR, you should create an AWS account and know how to set up the EC2 key-pair and the S3. You can then start installing RHadoop on Amazon EMR. In the first step, access the EMR cluster list and click on Create cluster. You can see a list of configurations on the Create cluster page. You should then set up the cluster name and log folder in the S3 location in the cluster configuration. Next, you can set up the software configuration and choose the Hadoop distribution you would like to install. Amazon provides both its own distribution and the MapR distribution. Normally, you would skip this section unless you have concerns about the default Hadoop distribution. You can then configure the hardware by specifying the master, core, and task node. By default, there is only one master node, and two core nodes. You can add more core and task nodes if you like. You should then set up the key-pair to login to the master node. You should next make a file containing all the start scripts named bootstrapRHadoop.sh. After the file is created, you should save the file in the S3 storage. You can then specify custom action in Bootstrap Action with bootstrapRHadoop.sh as the Bootstrap script. Lastly, you can click on Create cluster and wait until the cluster is ready. Once the cluster is ready, one can see the master public DNS and can use the EC2 key-pair to access the terminal of the master node. Beware! Terminate the running instance if you do not want to continue using the EMR service. Otherwise, you will be charged per instance for every hour you use. See also Google also provides its own cloud solution, the Google compute engine. For those who would like to know more, please refer to https://cloud.google.com/compute/. Summary In this article, we started by preparing the Hadoop environment, so that you can install RHadoop. We then covered the installation of three main packages: rmr, rhdfs, and plyrmr. Next, we introduced how to use rmr to perform MapReduce from R, operate an HDFS file through rhdfs, and perform a common data operation using plyrmr. Further, we explored how to perform machine learning using RHadoop. Lastly, we introduced how to deploy multiple RHadoop clusters on Amazon EC2. Resources for Article: Further resources on this subject: Warming Up [article] Derivatives Pricing [article] Using R for Statistics, Research, and Graphics [article]
Read more
  • 0
  • 0
  • 4746

article-image-subscribing-report
Packt
26 Mar 2015
6 min read
Save for later

Subscribing to a report

Packt
26 Mar 2015
6 min read
 In this article by Johan Yu, the author of Salesforce Reporting and Dashboards, we get acquainted to the components used when working with reports on the Salesforce platform. Subscribing to a report is a new feature in Salesforce introduced in the Spring 2015 release. When you subscribe to a report, you will get a notification on weekdays, daily, or weekly, when the reports meet the criteria defined. You just need to subscribe to the report that you most care about. (For more resources related to this topic, see here.) Subscribing to a report is not the same as the report's Schedule Future Run option, where scheduling a report for a future run will keep e-mailing you the report content at a specified frequency defined, without specifying any conditions. But when you subscribe to a report, you will receive notifications when the report output meets the criteria you have defined. Subscribing to a report will not send you the e-mail content, but just an alert that the report you subscribed to meets the conditions specified. To subscribe to a report, you do not need additional permission as our administrator is able to control to enable or disable this feature for the entire organization. By default, this feature will be turned on for customers using the Salesforce Spring 2015 release. If you are an administrator for the organization, you can check out this feature by navigating to Setup | Customize | Reports & Dashboards | Report Notification | Enable report notification subscriptions for all users. Besides receiving notifications via e-mail, you also can opt for Salesforce1 notifications and posts to Chatter feeds, and execute a custom action. Report Subscription To subscribe to a report, you need to define a set of conditions to trigger the notifications. Here is what you need to understand before you subscribe to a report: When: Everytime conditions are met or only the first time conditions are met. Conditions: An aggregate can be a record count or a summarize field. Then define the operator and value you want the aggregate to be compared to. The summarize field means a field that you use in that report to summarize its data as average, smallest, largest, or sum. You can add multiple conditions, but at this moment, you only have the AND condition. Schedule frequency: Schedule weekday, daily, weekly, and the time the report will be run. Actions: E-mail notifications: You will get e-mail alerts when conditions are met. Posts to Chatter feeds: Alerts will be posted to your Chatter feed. Salesforce1 notifications: Alerts in your Salesforce1 app. Execute a custom action: This will trigger a call to the apex class. You will need a developer to write apex code for this. Active: This is a checkbox used to activate or disable subscription. You may just need to disable it when you need to unsubscribe temporarily; otherwise, deleting will remove all the settings defined. The following screenshot shows the conditions set in order to subscribe to a report: Monitoring a report subscription How can you know whether you have subscribed to a report? When you open the report and see the Subscribe button, it means you are not subscribed to that report:   Once you configure the report to subscribe, the button label will turn to Edit Subscription. But, do not get it wrong that not all reports with Edit Subscription, you will get alerts when the report meets the criteria, because the setting may just not be active, remember step above when you subscribe a report. To know all the reports you subscribe to at a glance, as long as you have View Setup and Configuration permissions, navigate to Setup | Jobs | Scheduled Jobs, and look for Type as Reporting Notification, as shown in this screenshot:   Hands-on – subscribing to a report Here is our next use case: you would like to get a notification in your Salesforce1 app—an e-mail notification—and also posts on your Chatter feed once the Closed Won opportunity for the month has reached $50,000. Salesforce should check the report daily, but instead of getting this notification daily, you want to get it only once a week or month; otherwise, it will be disturbing. Creating reports Make sure you set the report with the correct filter, set Close Date as This Month, and summarize the Amount field, as shown in the following screenshot:   Subscribing Click on the Subscribe button and fill in the following details: Type as Only the first time conditions are met Conditions: Aggregate as Sum of Amount Operator as Greater Than or Equal Value as 50000 Schedule: Frequency as Every Weekday Time as 7AM In Actions, select: Send Salesforce1 Notification Post to Chatter Feed Send Email Notification In Active, select the checkbox Testing and saving The good thing of this feature is the ability to test without waiting until the scheduled date or time. Click on the Save & Run Now button. Here is the result: Salesforce1 notifications Open your Salesforce1 mobile app, look for the notification icon, and notice a new alert from the report you subscribed to, as shown in this screenshot: If you click on the notification, it will take you to the report that is shown in the following screenshot:   Chatter feed Since you selected the Post to Chatter Feed action, the same alert will go to your Chatter feed as well. Clicking on the link in the Chatter feed will open the same report in your Salesforce1 mobile app or from the web browser, as shown in this screenshot: E-mail notification The last action we've selected for this exercise is to send an e-mail notification. The following screenshot shows how the e-mail notification would look:   Limitations The following limitations are observed while subscribing to a report: You can set up to five conditions per report, and no OR logic conditions are possible You can subscribe for up to five reports, so use it wisely Summary In this article, you became familiar with components when working with reports on the Salesforce platform. We saw different report formats and the uniqueness of each format. We continued discussions on adding various types of charts to the report with point-and-click effort and no code; all of this can be done within minutes. We saw how to add filters to reports to customize our reports further, including using Filter Logic, Cross Filter, and Row Limit for tabular reports. We walked through managing and customizing custom report types, including how to hide unused report types and report type adoption analysis. In the last part of this article, we saw how easy it is to subscribe to a report and define criteria. Resources for Article: Further resources on this subject: Salesforce CRM – The Definitive Admin Handbook - Third Edition [article] Salesforce.com Customization Handbook [article] Developing Applications with Salesforce Chatter [article]
Read more
  • 0
  • 0
  • 2584

article-image-wlan-encryption-flaws
Packt
25 Mar 2015
21 min read
Save for later

WLAN Encryption Flaws

Packt
25 Mar 2015
21 min read
In this article by Cameron Buchanan, author of the book Kali Linux Wireless Penetration Testing Beginner's Guide. (For more resources related to this topic, see here.) "640K is more memory than anyone will ever need."                                                                             Bill Gates, Founder, Microsoft Even with the best of intentions, the future is always unpredictable. The WLAN committee designed WEP and then WPA to be foolproof encryption mechanisms but, over time, both these mechanisms had flaws that have been widely publicized and exploited in the real world. WLAN encryption mechanisms have had a long history of being vulnerable to cryptographic attacks. It started with WEP in early 2000, which eventually was completely broken. In recent times, attacks are slowly targeting WPA. Even though there is no public attack available currently to break WPA in all general conditions, there are attacks that are feasible under special circumstances. In this section, we will take a look at the following topics: Different encryption schemas in WLANs Cracking WEP encryption Cracking WPA encryption   WLAN encryption WLANs transmit data over the air and thus there is an inherent need to protect data confidentiality. This is best done using encryption. The WLAN committee (IEEE 802.11) formulated the following protocols for data encryption: Wired Equivalent Privacy (WEP) Wi-Fi Protected Access (WPA) Wi-Fi Protection Access v2 (WPAv2) In this section, we will take a look at each of these encryption protocols and demonstrate various attacks against them. WEP encryption The WEP protocol was known to be flawed as early as 2000 but, surprisingly, it is still continuing to be used and access points still ship with WEP enabled capabilities. There are many cryptographic weaknesses in WEP and they were discovered by Walker, Arbaugh, Fluhrer, Martin, Shamir, KoreK, and many others. Evaluation of WEP from a cryptographic standpoint is beyond the scope, as it involves understanding complex math. In this section, we will take a look at how to break WEP encryption using readily available tools on the BackTrack platform. This includes the entire aircrack-ng suite of tools—airmon-ng, aireplay-ng, airodump-ng, aircrack-ng, and others. The fundamental weakness in WEP is its use of RC4 and a short IV value that is recycled every 224 frames. While this is a large number in itself, there is a 50 percent chance of four reuses every 5,000 packets. To use this to our advantage, we generate a large amount of traffic so that we can increase the likelihood of IVs that have been reused and thus compare two cipher texts encrypted with the same IV and key. Let's now first set up WEP in our test lab and see how we can break it. Time for action – cracking WEP Follow the given instructions to get started: Let's first connect to our access point Wireless Lab and go to the settings area that deals with wireless encryption mechanisms: On my access point, this can be done by setting the Security Mode to WEP. We will also need to set the WEP key length. As shown in the following screenshot, I have set WEP to use 128bit keys. I have set the default key to WEP Key 1 and the value in hex to abcdefabcdefabcdefabcdef12 as the 128-bit WEP key. You can set this to whatever you choose: Once the settings are applied, the access point should now be offering WEP as the encryption mechanism of choice. Let's now set up the attacker machine. Let's bring up Wlan0 by issuing the following command: ifconfig wlan0 up Then, we will run the following command: airmon-ng start wlan0 This is done so as to create mon0, the monitor mode interface, as shown in the following screenshot. Verify that the mon0 interface has been created using the iwconfig command: Let's run airodump-ng to locate our lab access point using the following command: airodump-ng mon0 As you can see in the following screenshot, we are able to see the Wireless Lab access point running WEP: For this exercise, we are only interested in the Wireless Lab, so let's enter the following command to only see packets for this network: airodump-ng –bssid 00:21:91:D2:8E:25 --channel 11 --write WEPCrackingDemo mon0 The preceding command line is shown in the following screenshot: We will request airodump-ng to save the packets into a pcap file using the --write directive: Now let's connect our wireless client to the access point and use the WEP key as abcdefabcdefabcdefabcdef12. Once the client has successfully connected, airodump-ng should report it on the screen. If you do an ls in the same directory, you will be able to see files prefixed with WEPCrackingDemo-*, as shown in the following screenshot. These are traffic dump files created by airodump-ng: If you notice the airodump-ng screen, the number of data packets listed under the #Data column is very few in number (only 68). In WEP cracking, we need a large number of data packets, encrypted with the same key to exploit weaknesses in the protocol. So, we will have to force the network to produce more data packets. To do this, we will use the aireplay-ng tool: We will capture ARP packets on the wireless network using Aireplay-ng and inject them back into the network to simulate ARP responses. We will be starting Aireplay-ng in a separate window, as shown in the next screenshot. Replaying these packets a few thousand times, we will generate a lot of data traffic on the network. Even though Aireplay-ng does not know the WEP key, it is able to identify the ARP packets by looking at the size of the packets. ARP is a fixed header protocol; thus, the size of the ARP packets can be easily determined and can be used to identify them even within encrypted traffic. We will run aireplay-ng with the options that are discussed next. The -3 option is for ARP replay, -b specifies the BSSID of our network, and -h specifies the client MAC address that we are spoofing. We need to do this, as replay attacks will only work for authenticated and associated client MAC addresses: Very soon you should see that aireplay-ng was able to sniff ARP packets and started replaying them into the network. If you encounter channel-related errors as I did, append –ignore-negative-one to your command, as shown in the following screenshot: At this point, airodump-ng will also start registering a lot of data packets. All these sniffed packets are being stored in the WEPCrackingDemo-* files that we saw previously: Now let's start with the actual cracking part! We fire up aircrack-ng with the option WEPCRackingDemo-0*.cap in a new window. This will start the aircrack-ng software and it will begin working on cracking the WEP key using the data packets in the file. Note that it is a good idea to have Airodump-ng collect the WEP packets, aireplay-ng do the replay attack, and aircrack-ng attempt to crack the WEP key based on the captured packets, all at the same time. In this experiment, all of them are open in separate windows. Your screen should look like the following screenshot when aircrack-ng is working on the packets to crack the WEP key: The number of data packets required to crack the key is nondeterministic, but generally in the order of a hundred thousand or more. On a fast network (or using aireplay-ng), this should take 5-10 minutes at most. If the number of data packets currently in the file is not sufficient, then aircrack-ng will pause, as shown in the following screenshot, and wait for more packets to be captured; it will then restart the cracking process: Once enough data packets have been captured and processed, aircrack-ng should be able to break the key. Once it does, it proudly displays it in the terminal and exits, as shown in the following screenshot: It is important to note that WEP is totally flawed and any WEP key (no matter how complex) will be cracked by Aircrack-ng. The only requirement is that a large enough number of data packets, encrypted with this key, are made available to aircrack-ng. What just happened? We set up WEP in our lab and successfully cracked the WEP key. In order to do this, we first waited for a legitimate client of the network to connect to the access point. After this, we used the aireplay-ng tool to replay ARP packets into the network. This caused the network to send ARP replay packets, thus greatly increasing the number of data packets sent over the air. We then used the aircrack-ng tool to crack the WEP key by analyzing cryptographic weaknesses in these data packets. Note that we can also fake an authentication to the access point using the Shared Key Authentication bypass technique. This can come in handy if the legitimate client leaves the network. This will ensure that we can spoof an authentication and association and continue to send our replayed packets into the network. Have a go hero – fake authentication with WEP cracking In the previous exercise, if the legitimate client had suddenly logged off the network, we would not have been able to replay the packets as the access point will refuse to accept packets from un-associated clients. While WEP cracking is going on. Log off the legitimate client from the network and verify that you are still able to inject packets into the network and whether the access point accepts and responds to them. WPA/WPA2 WPA( or WPA v1 as it is referred to sometimes) primarily uses the TKIP encryption algorithm. TKIP was aimed at improving WEP, without requiring completely new hardware to run it. WPA2 in contrast mandatorily uses the AES-CCMP algorithm for encryption, which is much more powerful and robust than TKIP. Both WPA and WPA2 allow either EAP-based authentication, using RADIUS servers (Enterprise) or a Pre-Shared key (PSK) (personal)-based authentication schema. WPA/WPA2 PSK is vulnerable to a dictionary attack. The inputs required for this attack are the four-way WPA handshake between client and access point, and a wordlist that contains common passphrases. Then, using tools such as Aircrack-ng, we can try to crack the WPA/WPA2 PSK passphrase. An illustration of the four-way handshake is shown in the following screenshot: The way WPA/WPA2 PSK works is that it derives the per-session key, called the Pairwise Transient Key (PTK), using the Pre-Shared Key and five other parameters—SSID of Network, Authenticator Nounce (ANounce), Supplicant Nounce (SNounce), Authenticator MAC address (Access Point MAC), and Suppliant MAC address (Wi-Fi Client MAC). This key is then used to encrypt all data between the access point and client. An attacker who is eavesdropping on this entire conversation by sniffing the air can get all five parameters mentioned in the previous paragraph. The only thing he does not have is the Pre-Shared Key. So, how is the Pre-Shared Key created? It is derived by using the WPA-PSK passphrase supplied by the user, along with the SSID. The combination of both of these is sent through the Password-Based Key Derivation Function (PBKDF2), which outputs the 256-bit shared key. In a typical WPA/WPA2 PSK dictionary attack, the attacker would use a large dictionary of possible passphrases with the attack tool. The tool would derive the 256-bit Pre-Shared key from each of the passphrases and use it with the other parameters, described earlier, to create the PTK. The PTK will be used to verify the Message Integrity Check (MIC) in one of the handshake packets. If it matches, then the guessed passphrase from the dictionary was correct; if not, it was incorrect. Eventually, if the authorized network passphrase exists in the dictionary, it will be identified. This is exactly how WPA/WPA2 PSK cracking works! The following figure illustrates the steps involved: In the next exercise, we will take a look at how to crack a WPA PSK wireless network. The exact same steps will be involved in cracking a WPA2-PSK network using CCMP(AES) as well. Time for action – cracking WPA-PSK weak passphrases Follow the given instructions to get started: Let's first connect to our access point Wireless Lab and set the access point to use WPA-PSK. We will set the WPA-PSK passphrase to abcdefgh so that it is vulnerable to a dictionary attack: We start airodump-ng with the following command so that it starts capturing and storing all packets for our network: airodump-ng –bssid 00:21:91:D2:8E:25 –channel 11 –write WPACrackingDemo mon0" The following screenshot shows the output: Now we can wait for a new client to connect to the access point so that we can capture the four-way WPA handshake, or we can send a broadcast deauthentication packet to force clients to reconnect. We do the latter to speed things up. The same thing can happen again with the unknown channel error. Again, use –-ignore-negative-one. This can also require more than one attempt: As soon as we capture a WPA handshake, the airodump-ng tool will indicate it in the top-right corner of the screen with a WPA handshake followed by the access point's BSSID. If you are using –ignore-negative-one, the tool may replace the WPA handshake with a fixed channel message. Just keep an eye out for a quick flash of a WPA handshake. We can stop the airodump-ng utility now. Let's open up the cap file in Wireshark and view the four-way handshake. Your Wireshark terminal should look like the following screenshot. I have selected the first packet of the four-way handshake in the trace file in the screenshot. The handshake packets are the one whose protocol is EAPOL: Now we will start the actual key cracking exercise! For this, we need a dictionary of common words. Kali ships with many dictionary files in the metasploit folder located as shown in the following screenshot. It is important to note that, in WPA cracking, you are just as good as your dictionary. BackTrack ships with some dictionaries, but these may be insufficient. Passwords that people choose depend on a lot of things. This includes things such as which country users live in, common names and phrases in that region the, security awareness of the users, and a host of other things. It may be a good idea to aggregate country- and region-specific word lists, when undertaking a penetration test: We will now invoke the aircrack-ng utility with the pcap file as the input and a link to the dictionary file, as shown in the following screenshot. I have used nmap.lst , as shown in the terminal: aircrack-ng uses the dictionary file to try various combinations of passphrases and tries to crack the key. If the passphrase is present in the dictionary file, it will eventually crack it and your screen will look similar to the one in the screenshot: Please note that, as this is a dictionary attack, the prerequisite is that the passphrase must be present in the dictionary file you are supplying to aircrack-ng. If the passphrase is not present in the dictionary, the attack will fail! What just happened? We set up WPA-PSK on our access point with a common passphrase: abcdefgh. We then use a deauthentication attack to have legitimate clients reconnect to the access point. When we reconnect, we capture the four-way WPA handshake between the access point and the client. As WPA-PSK is vulnerable to a dictionary attack, we feed the capture file that contains the WPA four-way handshake and a list of common passphrases (in the form of a wordlist) to Aircrack-ng. As the passphrase abcdefgh is present in the wordlist, Aircrack-ng is able to crack the WPA-PSK shared passphrase. It is very important to note again that, in WPA dictionary-based cracking, you are just as good as the dictionary you have. Thus, it is important to compile a large and elaborate dictionary before you begin. Though BackTrack ships with its own dictionary, it may be insufficient at times and might need more words, especially taking into account the localization factor. Have a go hero – trying WPA-PSK cracking with Cowpatty Cowpatty is a tool that can also crack a WPA-PSK passphrase using a dictionary attack. This tool is included with BackTrack. I leave it as an exercise for you to use Cowpatty to crack the WPA-PSK passphrase. Also, set an uncommon passphrase that is not present in the dictionary and try the attack again. You will now be unsuccessful in cracking the passphrase with both Aircrack-ng and Cowpatty. It is important to note that the same attack applies even to a WPA2 PSK network. I encourage you to verify this independently. Speeding up WPA/WPA2 PSK cracking As we have already seen in the previous section, if we have the correct passphrase in our dictionary, cracking WPA-Personal will work every time like a charm. So, why don't we just create a large elaborate dictionary of millions of common passwords and phrases people use? This would help us a lot and most of the time, we would end up cracking the passphrase. It all sounds great but we are missing one key component here— the time taken. One of the more CPU and time-consuming calculations is that of the Pre-Shared key using the PSK passphrase and the SSID through the PBKDF2. This function hashes the combination of both over 4,096 times before outputting the 256-bit Pre-Shared key. The next step in cracking involves using this key along with parameters in the four-way handshake and verifying against the MIC in the handshake. This step is computationally inexpensive. Also, the parameters will vary in the handshake every time and hence, this step cannot be precomputed. Thus, to speed up the cracking process, we need to make the calculation of the Pre-Shared key from the passphrase as fast as possible. We can speed this up by precalculating the Pre-Shared Key, also called the Pairwise Master Key (PMK) in 802.11 standard parlance. It is important to note that, as the SSID is also used to calculate the PMK, with the same passphrase and with a different SSID, we will end up with a different PMK. Thus, the PMK depends on both the passphrase and the SSID. In the next exercise, we will take a look at how to precalculate the PMK and use it for WPA/WPA2 PSK cracking. Time for action – speeding up the cracking process We can proceed with the following steps: We can precalculate the PMK for a given SSID and wordlist using the genpmk tool with the following command: genpmk –f <chosen wordlist>–d PMK-Wireless-Lab –s "Wireless Lab This creates the PMK-Wireless-Lab file containing the pregenerated PMK: We now create a WPA-PSK network with the passphrase abcdefgh (present in the dictionary we used) and capture a WPA-handshake for that network. We now use Cowpatty to crack the WPA passphrase, as shown in the following screenshot: It takes approximately 7.18 seconds for Cowpatty to crack the key, using the precalculated PMKs. We now use aircrack-ng with the same dictionary file, and the cracking process takes over 22 minutes. This shows how much we are gaining because of the precalculation. In order to use these PMKs with aircrack-ng, we need to use a tool called airolib-ng. We will give it the options airolib-ng, PMK-Aircrack --import,and cowpatty PMK-Wireless-Lab, where PMK-Aircrack is the aircrack-ng compatible database to be created and PMK-Wireless-Lab is the genpmk compliant PMK database that we created previously. We now feed this database to aircrack-ng and the cracking process speeds up remarkably. We use the following command: aircrack-ng –r PMK-Aircrack WPACrackingDemo2-01.cap There are additional tools available on BackTrack such as Pyrit that can leverage multi CPU systems to speed up cracking. We give the pcap filename with the -r option and the genpmk compliant PMK file with the -i option. Even on the same system used with the previous tools, Pyrit takes around 3 seconds to crack the key, using the same PMK file created using genpmk. What just happened? We looked at various different tools and techniques to speed up WPA/WPA2-PSK cracking. The whole idea is to pre-calculate the PMK for a given SSID and a list of passphrases in our dictionary. Decrypting WEP and WPA packets In all the exercises we have done till now, we cracked the WEP and WPA keys using various techniques. What do we do with this information? The first step is to decrypt data packets we have captured using these keys. In the next exercise, we will decrypt the WEP and WPA packets in the same trace file that we captured over the air, using the keys we cracked. Time for action – decrypting WEP and WPA packets We can proceed with the following steps: We will decrypt packets from the WEP capture file we created earlier: WEPCrackingDemo-01.cap. For this, we will use another tool in the Aircrack-ng suite called airdecap-ng. We will run the following command, as shown in the following screenshot, using the WEP key we cracked previously: airdecap-ng -w abcdefabcdefabcdefabcdef12 WEPCrackingDemo-02.cap The decrypted files are stored in a file named WEPCrackingDemo-02-dec.cap. We use the tshark utility to view the first ten packets in the file. Please note that you may see something different based on what you captured: WPA/WPA2 PSK will work in exactly the same way as with WEP, using the airdecap-ng utility, as shown in the following screenshot, with the following command: airdecap-ng –p abdefg WPACrackingDemo-02.cap –e "Wireless Lab" What just happened? We just saw how we can decrypt WEP and WPA/WPA2-PSK encrypted packets using Airdecap-ng. It is interesting to note that we can do the same using Wireshark. We would encourage you to explore how this can be done by consulting the Wireshark documentation. Connecting to WEP and WPA networks We can also connect to the authorized network after we have cracked the network key. This can come in handy during penetration testing. Logging onto the authorized network with the cracked key is the ultimate proof you can provide to your client that his network is insecure. Time for action – connecting to a WEP network We can proceed with the following steps: Use the iwconfig utility to connect to a WEP network, once you have the key. In a past exercise, we broke the WEP key—abcdefabcdefabcdefabcdef12: What just happened? We saw how to connect to a WEP network. Time for action – connecting to a WPA network We can proceed with the following steps: In the case of WPA, the matter is a bit more complicated. The iwconfig utility cannot be used with WPA/WPA2 Personal and Enterprise, as it does not support it. We will use a new tool called WPA_supplicant for this lab. To use WPA_supplicant for a network, we will need to create a configuration file, as shown in the following screenshot. We will name this file wpa-supp.conf: We will then invoke the WPA_supplicant utility with the following options: -D wext -i wlan0 –c wpa-supp.conf to connect to the WPA network we just cracked. Once the connection is successful, WPA_supplicant will give you the message: Connection to XXXX completed. For both the WEP and WPA networks, once you are connected, you can use dhcpclient to grab a DHCP address from the network by typing dhclient3 wlan0. What just happened? The default Wi-Fi utility iwconfig cannot be used to connect to WPA/WPA2 networks. The de-facto tool for this is WPA_Supplicant. In this lab, we saw how we can use it to connect to a WPA network. Summary In this section, we learnt about WLAN encryption. WEP is flawed and no matter what the WEP key is, with enough data packet samples: it is always possible to crack WEP. WPA/WPA2 is cryptographically un-crackable currently; however, under special circumstances, such as when a weak passphrase is chosen in WPA/WPA2-PSK, it is possible to retrieve the passphrase using dictionary attacks. Resources for Article: Further resources on this subject: Veil-Evasion [article] Wireless and Mobile Hacks [article] Social Engineering Attacks [article]
Read more
  • 0
  • 0
  • 11009
article-image-cross-browser-tests-using-selenium-webdriver
Packt
25 Mar 2015
18 min read
Save for later

Cross-browser Tests using Selenium WebDriver

Packt
25 Mar 2015
18 min read
In this article by Prashanth Sams, author of the book Selenium Essentials, helps you to perform efficient compatibility tests. Here, we will also learn about how to run tests on cloud. You will cover the following topics in the article: Selenium WebDriver compatibility tests Selenium cross-browser tests on cloud Selenium headless browser testing (For more resources related to this topic, see here.) Selenium WebDriver compatibility tests Selenium WebDriver handles browser compatibility tests on almost every popular browser, including Chrome, Firefox, Internet Explorer, Safari, and Opera. In general, every browser's JavaScript engine differs from the others, and each browser interprets the HTML tags differently. The WebDriver API drives the web browser as the real user would drive it. By default, FirefoxDriver comes with the selenium-server-standalone.jar library added; however, for Chrome, IE, Safari, and Opera, there are libraries that need to be added or instantiated externally. Let's see how we can instantiate each of the following browsers through its own driver: Mozilla Firefox: The selenium-server-standalone library is bundled with FirefoxDriver to initialize and run tests in a Firefox browser. FirefoxDriver is added to the Firefox profile as a file extension on starting a new instance of FirefoxDriver. Please check the Firefox versions and its suitable drivers at http://selenium.googlecode.com/git/java/CHANGELOG. The following is the code snippet to kick start Mozilla Firefox: WebDriver driver = new FirefoxDriver(); Google Chrome: Unlike FirefoxDriver, the ChromeDriver is an external library file that makes use of WebDriver's wire protocol to run Selenium tests in a Google Chrome web browser. The following is the code snippet to kick start Google Chrome: System.setProperty("webdriver.chrome.driver","C:\chromedriver.exe"); WebDriver driver = new ChromeDriver(); To download ChromeDriver, refer to http://chromedriver.storage.googleapis.com/index.html. Internet Explorer: IEDriverServer is an executable file that uses the WebDriver wire protocol to control the IE browser in Windows. Currently, IEDriverServer supports the IE versions 6, 7, 8, 9, and 10. The following code snippet helps you to instantiate IEDriverServer: System.setProperty("webdriver.ie.driver","C:\IEDriverServer.exe"); DesiredCapabilities dc = DesiredCapabilities.internetExplorer(); dc.setCapability(InternetExplorerDriver.INTRODUCE_FLAKINESS_BY_IGNORING_SECURITY_DOMAINS, true); WebDriver driver = new InternetExplorerDriver(dc); To download IEDriverServer, refer to http://selenium-release.storage.googleapis.com/index.html. Apple Safari: Similar to FirefoxDriver, SafariDriver is internally bound with the latest Selenium servers, which starts the Apple Safari browser without any external library. SafariDriver supports the Safari browser versions 5.1.x and runs only on MAC. For more details, refer to http://elementalselenium.com/tips/69-safari. The following code snippet helps you to instantiate SafariDriver: WebDriver driver = new SafariDriver(); Opera: OperaPrestoDriver (formerly called OperaDriver) is available only for Presto-based Opera browsers. Currently, it does not support Opera versions 12.x and above. However, the recent releases (Opera 15.x and above) of Blink-based Opera browsers are handled using OperaChromiumDriver. For more details, refer to https://github.com/operasoftware/operachromiumdriver. The following code snippet helps you to instantiate OperaChrumiumDriver: DesiredCapabilities capabilities = new DesiredCapabilities(); capabilities.setCapability("opera.binary", "C://Program Files (x86)//Opera//opera.exe"); capabilities.setCapability("opera.log.level", "CONFIG"); WebDriver driver = new OperaDriver(capabilities); To download OperaChromiumDriver, refer to https://github.com/operasoftware/operachromiumdriver/releases. TestNG TestNG (Next Generation) is one of the most widely used unit-testing frameworks implemented for Java. It runs Selenium-based browser compatibility tests with the most popular browsers. The Eclipse IDE users must ensure that the TestNG plugin is integrated with the IDE manually. However, the TestNG plugin is bundled with IntelliJ IDEA as default. The testng.xml file is a TestNG build file to control test execution; the XML file can run through Maven tests using POM.xmlwith the help of the following code snippet: <plugin>    <groupId>org.apache.maven.plugins</groupId>    <artifactId>maven-surefire-plugin</artifactId>    <version>2.12.2</version>    <configuration>      <suiteXmlFiles>      <suiteXmlFile>testng.xml</suiteXmlFile>      </suiteXmlFiles>    </configuration> </plugin> To create a testng.xml file, right-click on the project folder in the Eclipse IDE, navigate to TestNG | Convert to TestNG, and click on Convert to TestNG, as shown in the following screenshot: The testng.xml file manages the entire tests; it acts as a mini data source by passing the parameters directly into the test methods. The location of the testng.xml file is hsown in the following screenshot: As an example, create a Selenium project (for example, Selenium Essentials) along with the testng.xml file, as shown in the previous screenshot. Modify the testng.xml file with the following tags: <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE suite SYSTEM "http://testng.org/testng-1.0.dtd"> <suite name="Suite" verbose="3" parallel="tests" thread-count="5">   <test name="Test on Firefox">    <parameter name="browser" value="Firefox" />    <classes>      <class name="package.classname" />      </classes> </test>   <test name="Test on Chrome">    <parameter name="browser" value="Chrome" />    <classes>    <class name="package.classname" />    </classes> </test>   <test name="Test on InternetExplorer">    <parameter name="browser" value="InternetExplorer" />    <classes>       <class name="package.classname" />    </classes> </test>   <test name="Test on Safari">    <parameter name="browser" value="Safari" />    <classes>      <class name="package.classname" />    </classes> </test>   <test name="Test on Opera">    <parameter name="browser" value="Opera" />    <classes>      <class name="package.classname" />    </classes> </test> </suite> <!-- Suite --> Download all the external drivers except FirefoxDriver and SafariDriver, extract the zipped folders, and locate the external drivers in the test script as mentioned in the preceding snippets for each browser. The following Java snippet will explain to you how you can get parameters directly from the testng.xml file and how you can run cross-browser tests as a whole: @BeforeTest @Parameters({"browser"}) public void setUp(String browser) throws MalformedURLException { if (browser.equalsIgnoreCase("Firefox")) {    System.out.println("Running Firefox");    driver = new FirefoxDriver(); } else if (browser.equalsIgnoreCase("chrome")) {    System.out.println("Running Chrome"); System.setProperty("webdriver.chrome.driver", "C:\chromedriver.exe");    driver = new ChromeDriver(); } else if (browser.equalsIgnoreCase("InternetExplorer")) {    System.out.println("Running Internet Explorer"); System.setProperty("webdriver.ie.driver", "C:\IEDriverServer.exe"); DesiredCapabilities dc = DesiredCapabilities.internetExplorer();    dc.setCapability (InternetExplorerDriver.INTRODUCE_FLAKINESS_BY_IGNORING_SECURITY_DOMAINS, true); //If IE fail to work, please remove this line and remove enable protected mode for all the 4 zones from Internet options    driver = new InternetExplorerDriver(dc); } else if (browser.equalsIgnoreCase("safari")) {    System.out.println("Running Safari");    driver = new SafariDriver(); } else if (browser.equalsIgnoreCase("opera")) {    System.out.println("Running Opera"); // driver = new OperaDriver();       --Use this if the location is set properly--    DesiredCapabilities capabilities = new DesiredCapabilities(); capabilities.setCapability("opera.binary", "C://Program Files (x86)//Opera//opera.exe");    capabilities.setCapability("opera.log.level", "CONFIG");    driver = new OperaDriver(capabilities); } } SafariDriver is not yet stable. A few of the major issues in SafariDriver are as follows: SafariDriver won't work properly in Windows SafariDriver does not support modal dialog box interaction You cannot navigate forward or backwards in browser history through SafariDriver Selenium cross-browser tests on the cloud The ability to automate Selenium tests on the cloud is quite interesting, with instant access to real devices. Sauce Labs, BrowserStack, and TestingBot are the leading web-based tools used for cross-browser compatibility checking. These tools contain unique test automation features, such as diagnosing failures through screenshots and video, executing parallel tests, running Appium mobile automation tests, executing tests on internal local servers, and so on. SauceLabs SauceLabs is the standard Selenium test automation web app to do cross-browser compatibility tests on the cloud. It lets you automate tests on your favorite programming languages using test frameworks such as JUnit, TestNG, Rspec, and many more. SauceLabs cloud tests can also be executed from the Selenium Builder IDE interface. Check for the available SauceLabs devices, OS, and platforms at https://saucelabs.com/platforms. Access the websitefrom your web browser, log in, and obtain the Sauce username and Access Key. Make use of the obtained credentials to drive tests over the SauceLabs cloud. SauceLabs creates a new instance of the virtual machine while launching the tests. Parallel automation tests are also possible using SauceLabs. The following is a Java program to run tests over the SauceLabs cloud: package packagename;   import java.net.URL; import org.openqa.selenium.remote.DesiredCapabilities; import org.openqa.selenium.remote.RemoteWebDriver; import java.lang.reflect.*;   public class saucelabs {   private WebDriver driver;   @Parameters({"username", "key", "browser", "browserVersion"}) @BeforeMethod public void setUp(@Optional("yourusername") String username,                     @Optional("youraccesskey") String key,                      @Optional("iphone") String browser,                      @Optional("5.0") String browserVersion,                      Method method) throws Exception {   // Choose the browser, version, and platform to test DesiredCapabilities capabilities = new DesiredCapabilities(); capabilities.setBrowserName(browser); capabilities.setCapability("version", browserVersion); capabilities.setCapability("platform", Platform.MAC); capabilities.setCapability("name", method.getName()); // Create the connection to SauceLabs to run the tests this.driver = new RemoteWebDriver( new URL("http://" + username + ":" + key + "@ondemand.saucelabs.com:80/wd/hub"), capabilities); }   @Test public void Selenium_Essentials() throws Exception {    // Make the browser get the page and check its title driver.get("http://www.google.com"); System.out.println("Page title is: " + driver.getTitle()); Assert.assertEquals("Google", driver.getTitle()); WebElement element = driver.findElement(By.name("q")); element.sendKeys("Selenium Essentials"); element.submit(); } @AfterMethod public void tearDown() throws Exception { driver.quit(); } } SauceLabs has a setup similar to BrowserStack on test execution and generates detailed logs. The breakpoints feature allows the user to manually take control over the virtual machine and pause tests, which helps the user to investigate and debug problems. By capturing JavaScript's console log, the JS errors and network requests are displayed for quick diagnosis while running tests against Google Chrome browser. BrowserStack BrowserStack is a cloud-testing web app to access virtual machines instantly. It allows users to perform multi-browser testing of their applications on different platforms. It provides a setup similar to SauceLabs for cloud-based automation using Selenium. Access the site https://www.browserstack.com from your web browser, log in, and obtain the BrowserStack username and Access Key. Make use of the obtained credentials to drive tests over the BrowserStack cloud. For example, the following generic Java program with TestNG provides a detailed overview of the process that runs on the BrowserStack cloud. Customize the browser name, version, platform, and so on, using capabilities. Let's see the Java program we just talked about: package packagename;   import org.openqa.selenium.remote.DesiredCapabilities; import org.openqa.selenium.remote.RemoteWebDriver;   public class browserstack {   public static final String USERNAME = "yourusername"; public static final String ACCESS_KEY = "youraccesskey"; public static final String URL = "http://" + USERNAME + ":" + ACCESS_KEY + "@hub.browserstack.com/wd/hub";   private WebDriver driver;   @BeforeClass public void setUp() throws Exception {    DesiredCapabilities caps = new DesiredCapabilities();    caps.setCapability("browser", "Firefox");    caps.setCapability("browser_version", "23.0");    caps.setCapability("os", "Windows");    caps.setCapability("os_version", "XP");    caps.setCapability("browserstack.debug", "true"); //This enable Visual Logs      driver = new RemoteWebDriver(new URL(URL), caps); }   @Test public void testOnCloud() throws Exception {    driver.get("http://www.google.com");    System.out.println("Page title is: " + driver.getTitle());    Assert.assertEquals("Google", driver.getTitle());    WebElement element = driver.findElement(By.name("q"));    element.sendKeys("seleniumworks");    element.submit(); }   @AfterClass public void tearDown() throws Exception {    driver.quit(); } } The app generates and stores test logs for the user to access anytime. The generated logs provide a detailed analysis with step-by-step explanations. To enhance the test speed, run parallel Selenium tests on the BrowserStack cloud; however, the automation plan has to be upgraded to increase the number of parallel test runs. TestingBot TestingBot also provides a setup similar to BrowserStack and SauceLabs for cloud-based cross-browser test automation using Selenium. It records a video of the running tests to analyze problems and debug. Additionally, it provides support to capture the screenshots on test failure. To run local Selenium tests, it provides an SSH tunnel tool that lets you run tests against local servers or other web servers. TestingBot uses Amazon's cloud infrastructure to run Selenium scripts in various browsers. Access the site https://testingbot.com/, log in, and obtain Client Key and Client Secret from your TestingBot account. Make use of the obtained credentials to drive tests over the TestingBot cloud. Let's see an example Java test program with TestNG using the Eclipse IDE that runs on the TestingBot cloud: package packagename;   import java.net.URL;   import org.openqa.selenium.remote.DesiredCapabilities; import org.openqa.selenium.remote.RemoteWebDriver;   public class testingbot { private WebDriver driver;   @BeforeClass public void setUp() throws Exception { DesiredCapabilitiescapabillities = DesiredCapabilities.firefox();    capabillities.setCapability("version", "24");    capabillities.setCapability("platform", Platform.WINDOWS);    capabillities.setCapability("name", "testOnCloud");    capabillities.setCapability("screenshot", true);    capabillities.setCapability("screenrecorder", true);    driver = new RemoteWebDriver( new URL ("http://ClientKey:ClientSecret@hub.testingbot.com:4444/wd/hub"), capabillities); }   @Test public void testOnCloud() throws Exception {    driver.get      ("http://www.google.co.in/?gws_rd=cr&ei=zS_mUryqJoeMrQf-yICYCA");    driver.findElement(By.id("gbqfq")).clear();    WebElement element = driver.findElement(By.id("gbqfq"));    element.sendKeys("selenium"); Assert.assertEquals("selenium - Google Search", driver.getTitle()); }   @AfterClass public void tearDown() throws Exception {    driver.quit(); } } Click on the Tests tab to check the log results. The logs are well organized with test steps, screenshots, videos, and a summary. Screenshots are captured on each and every step to make the tests more precise, as follows: capabillities.setCapability("screenshot", true); // screenshot capabillities.setCapability("screenrecorder", true); // video capture TestingBot provides a unique feature by scheduling and running tests directly from the site. The tests can be prescheduled to repeat tests any number of times on a daily or weekly basis. It's even more accurate on scheduling the test start time. You will be apprised of test failures with an alert through e-mail, an API call, an SMS, or a Prowl notification. This feature enables error handling to rerun failed tests automatically as per the user settings. Launch Selenium IDE, record tests, and save the test case or test suite in default format (HTML). Access the https://testingbot.com/ URL from your web browser and click on the Test Lab tab. Now, try to upload the already-saved Selenium test case, select the OS platform and browser name and version. Finally, save the settings and execute tests. The test results are recorded and displayed under Tests. Selenium headless browser testing A headless browser is a web browser without Graphical User Interface (GUI). It accesses and renders web pages but doesn't show them to any human being. A headless browser should be able to parse JavaScript. Currently, most of the systems encourage tests against headless browsers due to its efficiency and time-saving properties. PhantomJS and HTMLUnit are the most commonly used headless browsers. Capybara-webkit is another efficient headless WebKit for rails-based applications. PhantomJS PhantomJS is a headless WebKit scriptable with JavaScript API. It is generally used for headless testing of web applications that comes with built-in GhostDriver. Tests on PhantomJs are obviously fast since it has fast and native support for various web standards, such as DOM handling, CSS selector, JSON, canvas, and SVG. In general, WebKit is a layout engine that allows the web browsers to render web pages. Some of the browsers, such as Safari and Chrome, use WebKit. Apparently, PhantomJS is not a test framework; it is a headless browser that is used only to launch tests via a suitable test runner called GhostDriver. GhostDriver is a JS implementation of WebDriver Wire Protocol for PhantomJS; WebDriver Wire Protocol is a standard API that communicates with the browser. By default, the GhostDriver is embedded with PhantomJS. To download PhantomJS, refer to http://phantomjs.org/download.html. Download PhantomJS, extract the zipped file (for example, phantomjs-1.x.x-windows.zip for Windows) and locate the phantomjs.exe folder. Add the following imports to your test code: import org.openqa.selenium.phantomjs.PhantomJSDriver; import org.openqa.selenium.phantomjs.PhantomJSDriverService; import org.openqa.selenium.remote.DesiredCapabilities; Introduce PhantomJSDriver using capabilities to enable or disable JavaScript or to locate the phantomjs executable file path: DesiredCapabilities caps = new DesiredCapabilities(); caps.setCapability("takesScreenshot", true); caps.setJavascriptEnabled(true); // not really needed; JS is enabled by default caps.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY, "C:/phantomjs.exe"); WebDriver driver = new PhantomJSDriver(caps); Alternatively, PhantomJSDriver can also be initialized as follows: System.setProperty("phantomjs.binary.path", "/phantomjs.exe"); WebDriver driver = new PhantomJSDriver(); PhantomJS supports screen capture as well. Since PhantomJS is a WebKit and a real layout and rendering engine, it is feasible to capture a web page as a screenshot. It can be set as follows: caps.setCapability("takesScreenshot", true); The following is the test snippet to capture a screenshot on test run: File scrFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE); FileUtils.copyFile(scrFile, new File("c:\sample.jpeg"),true); For example, check the following test program for more details: package packagename;   import java.io.File; import java.util.concurrent.TimeUnit; import org.apache.commons.io.FileUtils; import org.openqa.selenium.*; import org.openqa.selenium.phantomjs.PhantomJSDriver;   public class phantomjs { private WebDriver driver; private String baseUrl;   @BeforeTest public void setUp() throws Exception { System.setProperty("phantomjs.binary.path", "/phantomjs.exe");    driver = new PhantomJSDriver();    baseUrl = "https://www.google.co.in"; driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS); }   @Test public void headlesstest() throws Exception { driver.get(baseUrl + "/"); driver.findElement(By.name("q")).sendKeys("selenium essentials"); File scrFile = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE); FileUtils.copyFile(scrFile, new File("c:\screen_shot.jpeg"), true); }   @AfterTest public void tearDown() throws Exception {    driver.quit(); } } HTMLUnitDriver HTMLUnit is a headless (GUI-less) browser written in Java and is typically used for testing. HTMLUnitDriver, which is based on HTMLUnit, is the fastest and most lightweight implementation of WebDriver. It runs tests using a plain HTTP request, which is quicker than launching a browser and executes tests way faster than other drivers. The HTMLUnitDriver is added to the latest Selenium servers (2.35 or above). The JavaScript engine used by HTMLUnit (Rhino) is unique and different from any other popular browsers available on the market. HTMLUnitDriver supports JavaScript and is platform independent. By default, the JavaScript support for HTMLUnitDriver is disabled. Enabling JavaScript in HTMLUnitDriver slows down the test execution; however, it is advised to enable JavaScript support because most of the modern sites are Ajax-based web apps. By enabling JavaScript, it also throws a number of JavaScript warning messages in the console during test execution. The following snippet lets you enable JavaScript for HTMLUnitDriver: HtmlUnitDriver driver = new HtmlUnitDriver(); driver.setJavascriptEnabled(true); // enable JavaScript The following line of code is an alternate way to enable JavaScript: HtmlUnitDriver driver = new HtmlUnitDriver(true); The following piece of code lets you handle a transparent proxy using HTMLUnitDriver: HtmlUnitDriver driver = new HtmlUnitDriver(); driver.setProxy("xxx.xxx.xxx.xxx", port); // set proxy for handling Transparent Proxy driver.setJavascriptEnabled(true); // enable JavaScript [this emulate IE's js by default] HTMLUnitDriver can emulate the popular browser's JavaScript in a better way. By default, HTMLUnitDriver emulates IE's JavaScript. For example, to handle the Firefox web browser with version 17, use the following snippet: HtmlUnitDriver driver = new HtmlUnitDriver(BrowserVersion.FIREFOX_17); driver.setJavascriptEnabled(true); Here is the snippet to emulate a specific browser's JavaScript using capabilities: DesiredCapabilities capabilities = DesiredCapabilities.htmlUnit(); driver = new HtmlUnitDriver(capabilities); DesiredCapabilities capabilities = DesiredCapabilities.firefox(); capabilities.setBrowserName("Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0"); capabilities.setVersion("24.0"); driver = new HtmlUnitDriver(capabilities); Summary In this article, you learned to perform efficient compatibility tests and also learned about how to run tests on cloud. Resources for Article: Further resources on this subject: Selenium Testing Tools [article] First Steps with Selenium RC [article] Quick Start into Selenium Tests [article]
Read more
  • 0
  • 0
  • 6603

article-image-prerequisites
Packt
25 Mar 2015
6 min read
Save for later

Prerequisites

Packt
25 Mar 2015
6 min read
In this article by Deepak Vohra, author of the book, Advanced Java® EE Development with WildFly® you will see how to create a Java EE project and its pre-requisites. (For more resources related to this topic, see here.) The objective of the EJB 3.x specification is to simplify its development by improving the EJB architecture. This simplification is achieved by providing metadata annotations to replace XML configuration. It also provides default configuration values by making entity and session beans POJOs (Plain Old Java Objects) and by making component and home interfaces redundant. The EJB 2.x entity beans is replaced with EJB 3.x entities. EJB 3.0 also introduced the Java Persistence API (JPA) for object-relational mapping of Java objects. WildFly 8.x supports EJB 3.2 and the JPA 2.1 specifications from Java EE 7. The sample application is based on Java EE 6 and EJB 3.1. The configuration of EJB 3.x with Java EE 7 is also discussed and the sample application can be used or modified to run on a Java EE 7 project. We have used a Hibernate 4.3 persistence provider. Unlike some of the other persistence providers, the Hibernate persistence provider supports automatic generation of relational database tables including the joining of tables. In this article, we will create an EJB 3.x project. This article has the following topics: Setting up the environment Creating a WildFly runtime Creating a Java EE project Setting up the environment We need to download and install the following software: WildFly 8.1.0.Final: Download wildfly-8.1.0.Final.zip from http://wildfly.org/downloads/. MySQL 5.6 Database-Community Edition: Download this edition from http://dev.mysql.com/downloads/mysql/. When installing MySQL, also install Connector/J. Eclipse IDE for Java EE Developers: Download Eclipse Luna from https://www.eclipse.org/downloads/packages/release/Luna/SR1. JBoss Tools (Luna) 4.2.0.Final: Install this as a plug-in to Eclipse from the Eclipse Marketplace (http://tools.jboss.org/downloads/installation.html). The latest version from Eclipse Marketplace is likely to be different than 4.2.0. Apache Maven: Download version 3.05 or higher from http://maven.apache.org/download.cgi. Java 7: Download Java 7 from http://www.oracle.com/technetwork/java/javase/downloads/index.html?ssSourceSiteId=ocomcn. Set the environment variables: JAVA_HOME, JBOSS_HOME, MAVEN_HOME, and MYSQL_HOME. Add %JAVA_HOME%/bin, %MAVEN_HOME%/bin, %JBOSS_HOME%/bin, and %MYSQL_HOME%/bin to the PATH environment variable. The environment settings used are C:wildfly-8.1.0.Final for JBOSS_HOME, C:Program FilesMySQLMySQL Server 5.6.21 for MYSQL_HOME, C:mavenapache-maven-3.0.5 for MAVEN_HOME, and C:Program FilesJavajdk1.7.0_51 for JAVA_HOME. Run the add-user.bat script from the %JBOSS_HOME%/bin directory to create a user for the WildFly administrator console. When prompted What type of user do you wish to add?, select a) Management User. The other option is b) Application User. Management User is used to log in to Administration Console, and Application User is used to access applications. Subsequently, specify the Username and Password for the new user. When prompted with the question, Is this user going to be used for one AS process to connect to another AS..?, enter the answer as no. When installing and configuring the MySQL database, specify a password for the root user (the password mysql is used in the sample application). Creating a WildFly runtime As the application is run on WildFly 8.1, we need to create a runtime environment for WildFly 8.1 in Eclipse. Select Window | Preferences in Eclipse. In Preferences, select Server | Runtime Environment. Click on the Add button to add a new runtime environment, as shown in the following screenshot: In New Server Runtime Environment, select JBoss Community | WildFly 8.x Runtime. Click on Next: In WildFly Application Server 8.x, which appears below New Server Runtime Environment, specify a Name for the new runtime or choose the default name, which is WildFly 8.x Runtime. Select the Home Directory for the WildFly 8.x server using the Browse button. The Home Directory is the directory where WildFly 8.1 is installed. The default path is C:wildfly-8.1.0.Final. Select the Runtime JRE as JavaSE-1.7. If the JDK location is not added to the runtime list, first add it from the JRE preferences screen in Eclipse. In Configuration base directory, select standalone as the default setting. In Configuration file, select standalone.xml as the default setting. Click on Finish: A new server runtime environment for WildFly 8.x Runtime gets created, as shown in the following screenshot. Click on OK: Creating a Server Runtime Environment for WildFly 8.x is a prerequisite for creating a Java EE project in Eclipse. In the next topic, we will create a new Java EE project for an EJB 3.x application. Creating a Java EE project JBoss Tools provides project templates for different types of JBoss projects. In this topic, we will create a Java EE project for an EJB 3.x application. Select File | New | Other in Eclipse IDE. In the New wizard, select the JBoss Central | Java EE EAR Project wizard. Click on the Next button: The Java EE EAR Project wizard gets started. By default, a Java EE 6 project is created. A Java EE EAR Project is a Maven project. The New Project Example window lists the requirements and runs a test for the requirements. The JBoss AS runtime is required and some plugins (including the JBoss Maven Tools plugin) are required for a Java EE project. Select Target Runtime as WildFly 8.x Runtime, which was created in the preceding topic. Then, check the Create a blank project checkbox. Click on the Next button: Specify Project name as jboss-ejb3, Package as org.jboss.ejb3, and tick the Use default Workspace location box. Click on the Next button: Specify Group Id as org.jboss.ejb3, Artifact Id as jboss-ejb3, Version as 1.0.0, and Package as org.jboss.ejb3.model. Click on Finish: A Java EE project gets created, as shown in the following Project Explorer window. The jboss-ejb3 project consists of three subprojects: jboss-ejb3-ear, jboss-ejb3-ejb, and jboss-ejb3-web. Each subproject consists of a pom.xml file for Maven. The jboss-ejb3-ejb subproject consists of a META-INF/persistence.xml file within the src/main/resources source folder for the JPA database persistence configuration. Summary In this article, we learned how to create a Java EE project and its prerequisites. Resources for Article: Further resources on this subject: Common performance issues [article] Running our first web application [article] Various subsystem configurations [article]
Read more
  • 0
  • 0
  • 5560

article-image-geolocating-photos-map
Packt
25 Mar 2015
7 min read
Save for later

Geolocating photos on the map

Packt
25 Mar 2015
7 min read
In this article by Joel Lawhead, author of the book, QGIS Python Programming Cookbook uses the tags to create locations on a map for some photos and provide links to open them. (For more resources related to this topic, see here.) Getting ready You will need to download some sample geotagged photos from https://github.com/GeospatialPython/qgis/blob/gh-pages/photos.zip?raw=true and place them in a directory named photos in your qgis_data directory. How to do it... QGIS requires the Python Imaging Library (PIL), which should already be included with your installation. PIL can parse EXIF tags. We will gather the filenames of the photos, parse the location information, convert it to decimal degrees, create the point vector layer, add the photo locations, and add an action link to the attributes. To do this, we need to perform the following steps: In the QGIS Python Console, import the libraries that we'll need, including k for parsing image data and the glob module for doing wildcard file searches: import globimport Imagefrom ExifTags import TAGS Next, we'll create a function that can parse the header data: def exif(img):   exif_data = {}   try:         i = Image.open(img)       tags = i._getexif()       for tag, value in tags.items():         decoded = TAGS.get(tag, tag)           exif_data[decoded] = value   except:       passreturn exif_data Now, we'll create a function that can convert degrees-minute-seconds to decimal degrees, which is how coordinates are stored in JPEG images: def dms2dd(d, m, s, i):   sec = float((m * 60) + s)   dec = float(sec / 3600)   deg = float(d + dec)   if i.upper() == 'W':       deg = deg * -1   elif i.upper() == 'S':       deg = deg * -1   return float(deg) Next, we'll define a function to parse the location data from the header data: def gps(exif):   lat = None   lon = None   if exif['GPSInfo']:             # Lat       coords = exif['GPSInfo']       i = coords[1]       d = coords[2][0][0]       m = coords[2][1][0]       s = coords[2][2][0]       lat = dms2dd(d, m ,s, i)       # Lon       i = coords[3]       d = coords[4][0][0]       m = coords[4][1][0]       s = coords[4][2][0]       lon = dms2dd(d, m ,s, i)return lat, lon Next, we'll loop through the photos directory, get the filenames, parse the location information, and build a simple dictionary to store the information, as follows: photos = {}photo_dir = "/Users/joellawhead/qgis_data/photos/"files = glob.glob(photo_dir + "*.jpg")for f in files:   e = exif(f)   lat, lon = gps(e) photos[f] = [lon, lat] Now, we'll set up the vector layer for editing: lyr_info = "Point?crs=epsg:4326&field=photo:string(75)"vectorLyr = QgsVectorLayer(lyr_info, "Geotagged Photos" , "memory")vpr = vectorLyr.dataProvider() We'll add the photo details to the vector layer: features = []for pth, p in photos.items():   lon, lat = p   pnt = QgsGeometry.fromPoint(QgsPoint(lon,lat))   f = QgsFeature()   f.setGeometry(pnt)   f.setAttributes([pth])   features.append(f)vpr.addFeatures(features)vectorLyr.updateExtents() Now, we can add the layer to the map and make the active layer: QgsMapLayerRegistry.instance().addMapLayer(vectorLyr)iface.setActiveLayer(vectorLyr)activeLyr = iface.activeLayer() Finally, we'll add an action that allows you to click on it and open the photo: actions = activeLyr.actions()actions.addAction(QgsAction.OpenUrl, "Photos", '[% "photo" %]') How it works... Using the included PIL EXIF parser, getting location information and adding it to a vector layer is relatively straightforward. This action is a default option for opening a URL. However, you can also use Python expressions as actions to perform a variety of tasks. The following screenshot shows an example of the data visualization and photo popup: There's more... Another plugin called Photo2Shape is available, but it requires you to install an external EXIF tag parser. Image change detection Change detection allows you to automatically highlight the differences between two images in the same area if they are properly orthorectified. We'll do a simple difference change detection on two images, which are several years apart, to see the differences in urban development and the natural environment. Getting ready You can download the two images from https://github.com/GeospatialPython/qgis/blob/gh-pages/change-detection.zip?raw=true and put them in a directory named change-detection in the rasters directory of your qgis_data directory. Note that the file is 55 megabytes, so it may take several minutes to download. How to do it... We'll use the QGIS raster calculator to subtract the images in order to get the difference, which will highlight significant changes. We'll also add a color ramp shader to the output in order to visualize the changes. To do this, we need to perform the following steps: First, we need to import the libraries that we need in to the QGIS console: from PyQt4.QtGui import *from PyQt4.QtCore import *from qgis.analysis import * Now, we'll set up the path names and raster names for our images: before = "/Users/joellawhead/qgis_data/rasters/change-detection/before.tif"|after = "/Users/joellawhead/qgis_data/rasters/change-detection/after.tif"beforeName = "Before"afterName = "After" Next, we'll establish our images as raster layers: beforeRaster = QgsRasterLayer(before, beforeName)afterRaster = QgsRasterLayer(after, afterName) Then, we can build the calculator entries: beforeEntry = QgsRasterCalculatorEntry()afterEntry = QgsRasterCalculatorEntry()beforeEntry.raster = beforeRasterafterEntry.raster = afterRasterbeforeEntry.bandNumber = 1afterEntry.bandNumber = 2beforeEntry.ref = beforeName + "@1"afterEntry.ref = afterName + "@2"entries = [afterEntry, beforeEntry] Now, we'll set up the simple expression that does the math for remote sensing: exp = "%s - %s" % (afterEntry.ref, beforeEntry.ref) Then, we can set up the output file path, the raster extent, and pixel width and height: output = "/Users/joellawhead/qgis_data/rasters/change-detection/change.tif"e = beforeRaster.extent()w = beforeRaster.width()h = beforeRaster.height() Now, we perform the calculation: change = QgsRasterCalculator(exp, output, "GTiff", e, w, h, entries)change.processCalculation() Finally, we'll load the output as a layer, create the color ramp shader, apply it to the layer, and add it to the map, as shown here: lyr = QgsRasterLayer(output, "Change")algorithm = QgsContrastEnhancement.StretchToMinimumMaximumlimits = QgsRaster.ContrastEnhancementMinMaxlyr.setContrastEnhancement(algorithm, limits)s = QgsRasterShader()c = QgsColorRampShader()c.setColorRampType(QgsColorRampShader.INTERPOLATED)i = []qri = QgsColorRampShader.ColorRampItemi.append(qri(0, QColor(0,0,0,0), 'NODATA'))i.append(qri(-101, QColor(123,50,148,255), 'Significant Itensity Decrease'))i.append(qri(-42.2395, QColor(194,165,207,255), 'Minor Itensity Decrease'))i.append(qri(16.649, QColor(247,247,247,0), 'No Change'))i.append(qri(75.5375, QColor(166,219,160,255), 'Minor Itensity Increase'))i.append(qri(135, QColor(0,136,55,255), 'Significant Itensity Increase'))c.setColorRampItemList(i)s.setRasterShaderFunction(c)ps = QgsSingleBandPseudoColorRenderer(lyr.dataProvider(), 1, s)lyr.setRenderer(ps)QgsMapLayerRegistry.instance().addMapLayer(lyr) How it works... If a building is added in the new image, it will be brighter than its surroundings. If a building is removed, the new image will be darker in that area. The same holds true for vegetation, to some extent. Summary The concept is simple. We subtract the older image data from the new image data. Concentrating on urban areas tends to be highly reflective and results in higher image pixel values. Resources for Article: Further resources on this subject: Prototyping Arduino Projects using Python [article] Python functions – Avoid repeating code [article] Pentesting Using Python [article]
Read more
  • 0
  • 0
  • 3254
Packt
25 Mar 2015
35 min read
Save for later

Fun with Sprites – Sky Defense

Packt
25 Mar 2015
35 min read
This article is written by Roger Engelbert, the author of Cocos2d-x by Example: Beginner's Guide - Second Edition. Time to build our second game! This time, you will get acquainted with the power of actions in Cocos2d-x. I'll show you how an entire game could be built just by running the various action commands contained in Cocos2d-x to make your sprites move, rotate, scale, fade, blink, and so on. And you can also use actions to animate your sprites using multiple images, like in a movie. So let's get started. In this article, you will learn: How to optimize the development of your game with sprite sheets How to use bitmap fonts in your game How easy it is to implement and run actions How to scale, rotate, swing, move, and fade out a sprite How to load multiple .png files and use them to animate a sprite How to create a universal game with Cocos2d-x (For more resources related to this topic, see here.) The game – sky defense Meet our stressed-out city of...your name of choice here. It's a beautiful day when suddenly the sky begins to fall. There are meteors rushing toward the city and it is your job to keep it safe. The player in this game can tap the screen to start growing a bomb. When the bomb is big enough to be activated, the player taps the screen again to detonate it. Any nearby meteor will explode into a million pieces. The bigger the bomb, the bigger the detonation, and the more meteors can be taken out by it. But the bigger the bomb, the longer it takes to grow it. But it's not just bad news coming down. There are also health packs dropping from the sky and if you allow them to reach the ground, you'll recover some of your energy. The game settings This is a universal game. It is designed for the iPad retina screen and it will be scaled down to fit all the other screens. The game will be played in landscape mode, and it will not need to support multitouch. The start project The command line I used was: cocos new SkyDefense -p com.rengelbert.SkyDefense -l cpp -d /Users/rengelbert/Desktop/SkyDefense In Xcode you must set the Devices field in Deployment Info to Universal, and the Device Family field is set to Universal. And in RootViewController.mm, the supported interface orientation is set to Landscape. The game we are going to build requires only one class, GameLayer.cpp, and you will find that the interface for this class already contains all the information it needs. Also, some of the more trivial or old-news logic is already in place in the implementation file as well. But I'll go over this as we work on the game. Adding screen support for a universal app Now things get a bit more complicated as we add support for smaller screens in our universal game, as well as some of the most common Android screen sizes. So open AppDelegate.cpp. Inside the applicationDidFinishLaunching method, we now have the following code: auto screenSize = glview->getFrameSize(); auto designSize = Size(2048, 1536); glview->setDesignResolutionSize(designSize.width, designSize.height, ResolutionPolicy::EXACT_FIT); std::vector<std::string> searchPaths; if (screenSize.height > 768) {    searchPaths.push_back("ipadhd");    director->setContentScaleFactor(1536/designSize.height); } else if (screenSize.height > 320) {    searchPaths.push_back("ipad");    director->setContentScaleFactor(768/designSize.height); } else {    searchPaths.push_back("iphone");    director->setContentScaleFactor(380/designSize.height); } auto fileUtils = FileUtils::getInstance(); fileUtils->setSearchPaths(searchPaths); Once again, we tell our GLView object (our OpenGL view) that we designed the game for a certain screen size (the iPad retina screen) and once again, we want our game screen to resize to match the screen on the device (ResolutionPolicy::EXACT_FIT). Then we determine where to load our images from, based on the device's screen size. We have art for iPad retina, then for regular iPad which is shared by iPhone retina, and for the regular iPhone. We end by setting the scale factor based on the designed target. Adding background music Still inside AppDelegate.cpp, we load the sound files we'll use in the game, including a background.mp3 (courtesy of Kevin MacLeod from incompetech.com), which we load through the command: auto audioEngine = SimpleAudioEngine::getInstance(); audioEngine->preloadBackgroundMusic(fileUtils->fullPathForFilename("background.mp3").c_str()); We end by setting the effects' volume down a tad: //lower playback volume for effects audioEngine->setEffectsVolume(0.4f); For background music volume, you must use setBackgroundMusicVolume. If you create some sort of volume control in your game, these are the calls you would make to adjust the volume based on the user's preference. Initializing the game Now back to GameLayer.cpp. If you take a look inside our init method, you will see that the game initializes by calling three methods: createGameScreen, createPools, and createActions. We'll create all our screen elements inside the first method, and then create object pools so we don't instantiate any sprite inside the main loop; and we'll create all the main actions used in our game inside the createActions method. And as soon as the game initializes, we start playing the background music, with its should loop parameter set to true: SimpleAudioEngine::getInstance()-   >playBackgroundMusic("background.mp3", true); We once again store the screen size for future reference, and we'll use a _running Boolean for game states. If you run the game now, you should only see the background image: Using sprite sheets in Cocos2d-x A sprite sheet is a way to group multiple images together in one image file. In order to texture a sprite with one of these images, you must have the information of where in the sprite sheet that particular image is found (its rectangle). Sprite sheets are often organized in two files: the image one and a data file that describes where in the image you can find the individual textures. I used TexturePacker to create these files for the game. You can find them inside the ipad, ipadhd, and iphone folders inside Resources. There is a sprite_sheet.png file for the image and a sprite_sheet.plist file that describes the individual frames inside the image. This is what the sprite_sheet.png file looks like: Batch drawing sprites In Cocos2d-x, sprite sheets can be used in conjunction with a specialized node, called SpriteBatchNode. This node can be used whenever you wish to use multiple sprites that share the same source image inside the same node. So you could have multiple instances of a Sprite class that uses a bullet.png texture for instance. And if the source image is a sprite sheet, you can have multiple instances of sprites displaying as many different textures as you could pack inside your sprite sheet. With SpriteBatchNode, you can substantially reduce the number of calls during the rendering stage of your game, which will help when targeting less powerful systems, though not noticeably in more modern devices. Let me show you how to create a SpriteBatchNode. Time for action – creating SpriteBatchNode Let's begin implementing the createGameScreen method in GameLayer.cpp. Just below the lines that add the bg sprite, we instantiate our batch node: void GameLayer::createGameScreen() {   //add bg auto bg = Sprite::create("bg.png"); ...   SpriteFrameCache::getInstance()-> addSpriteFramesWithFile("sprite_sheet.plist"); _gameBatchNode = SpriteBatchNode::create("sprite_sheet.png"); this->addChild(_gameBatchNode); In order to create the batch node from a sprite sheet, we first load all the frame information described by the sprite_sheet.plist file into SpriteFrameCache. And then we create the batch node with the sprite_sheet.png file, which is the source texture shared by all sprites added to this batch node. (The background image is not part of the sprite sheet, so it's added separately before we add _gameBatchNode to GameLayer.) Now we can start putting stuff inside _gameBatchNode. First, the city: for (int i = 0; i < 2; i++) { auto sprite = Sprite::createWithSpriteFrameName   ("city_dark.png");    sprite->setAnchorPoint(Vec2(0.5,0)); sprite->setPosition(_screenSize.width * (0.25f + i * 0.5f),0)); _gameBatchNode->addChild(sprite, kMiddleground); sprite = Sprite::createWithSpriteFrameName ("city_light.png"); sprite->setAnchorPoint(Vec2(0.5,0)); sprite->setPosition(Vec2(_screenSize.width * (0.25f + i * 0.5f), _screenSize.height * 0.1f)); _gameBatchNode->addChild(sprite, kBackground); } Then the trees: //add trees for (int i = 0; i < 3; i++) { auto sprite = Sprite::createWithSpriteFrameName("trees.png"); sprite->setAnchorPoint(Vec2(0.5f, 0.0f)); sprite->setPosition(Vec2(_screenSize.width * (0.2f + i * 0.3f),0)); _gameBatchNode->addChild(sprite, kForeground);   } Notice that here we create sprites by passing it a sprite frame name. The IDs for these frame names were loaded into SpriteFrameCache through our sprite_sheet.plist file. The screen so far is made up of two instances of city_dark.png tiling at the bottom of the screen, and two instances of city_light.png also tiling. One needs to appear on top of the other and for that we use the enumerated values declared in GameLayer.h: enum { kBackground, kMiddleground, kForeground }; We use the addChild( Node, zOrder) method to layer our sprites on top of each other, using different values for their z order. So for example, when we later add three sprites showing the trees.png sprite frame, we add them on top of all previous sprites using the highest value for z that we find in the enumerated list, which is kForeground. Why go through the trouble of tiling the images and not using one large image instead, or combining some of them with the background image? Because I wanted to include the greatest number of images possible inside the one sprite sheet, and have that sprite sheet to be as small as possible, to illustrate all the clever ways you can use and optimize sprite sheets. This is not necessary in this particular game. What just happened? We began creating the initial screen for our game. We are using a SpriteBatchNode to contain all the sprites that use images from our sprite sheet. So SpriteBatchNode behaves as any node does—as a container. And we can layer individual sprites inside the batch node by manipulating their z order. Bitmap fonts in Cocos2d-x The Cocos2d-x Label class has a static create method that uses bitmap images for the characters. The bitmap image we are using here was created with the program GlyphDesigner, and in essence, it works just as a sprite sheet does. As a matter of fact, Label extends SpriteBatchNode, so it behaves just like a batch node. You have images for all individual characters you'll need packed inside a PNG file (font.png), and then a data file (font.fnt) describing where each character is. The following screenshot shows how the font sprite sheet looks like for our game: The difference between Label and a regular SpriteBatchNode class is that the data file also feeds the Label object information on how to write with this font. In other words, how to space out the characters and lines correctly. The Label objects we are using in the game are instantiated with the name of the data file and their initial string value: _scoreDisplay = Label::createWithBMFont("font.fnt", "0"); And the value for the label is changed through the setString method: _scoreDisplay->setString("1000"); Just as with every other image in the game, we also have different versions of font.fnt and font.png in our Resources folders, one for each screen definition. FileUtils will once again do the heavy lifting of finding the correct file for the correct screen. So now let's create the labels for our game. Time for action – creating bitmap font labels Creating a bitmap font is somewhat similar to creating a batch node. Continuing with our createGameScreen method, add the following lines to the score label: _scoreDisplay = Label::createWithBMFont("font.fnt", "0"); _scoreDisplay->setAnchorPoint(Vec2(1,0.5)); _scoreDisplay->setPosition(Vec2   (_screenSize.width * 0.8f, _screenSize.height * 0.94f)); this->addChild(_scoreDisplay); And then add a label to display the energy level, and set its horizontal alignment to Right: _energyDisplay = Label::createWithBMFont("font.fnt", "100%", TextHAlignment::RIGHT); _energyDisplay->setPosition(Vec2   (_screenSize.width * 0.3f, _screenSize.height * 0.94f)); this->addChild(_energyDisplay); Add the following line for an icon that appears next to the _energyDisplay label: auto icon = Sprite::createWithSpriteFrameName ("health_icon.png"); icon->setPosition( Vec2(_screenSize.   width * 0.15f, _screenSize.height * 0.94f) ); _gameBatchNode->addChild(icon, kBackground); What just happened? We just created our first bitmap font object in Cocos2d-x. Now let's finish creating our game's sprites. Time for action – adding the final screen sprites The last sprites we need to create are the clouds, the bomb and shockwave, and our game state messages. Back to the createGameScreen method, add the clouds to the screen: for (int i = 0; i < 4; i++) { float cloud_y = i % 2 == 0 ? _screenSize.height * 0.4f : _screenSize.height * 0.5f; auto cloud = Sprite::createWithSpriteFrameName("cloud.png"); cloud->setPosition(Vec2 (_screenSize.width * 0.1f + i * _screenSize.width * 0.3f, cloud_y)); _gameBatchNode->addChild(cloud, kBackground); _clouds.pushBack(cloud); } Create the _bomb sprite; players will grow when tapping the screen: _bomb = Sprite::createWithSpriteFrameName("bomb.png"); _bomb->getTexture()->generateMipmap(); _bomb->setVisible(false);   auto size = _bomb->getContentSize();   //add sparkle inside bomb sprite auto sparkle = Sprite::createWithSpriteFrameName("sparkle.png"); sparkle->setPosition(Vec2(size.width * 0.72f, size.height *   0.72f)); _bomb->addChild(sparkle, kMiddleground, kSpriteSparkle);   //add halo inside bomb sprite auto halo = Sprite::createWithSpriteFrameName   ("halo.png"); halo->setPosition(Vec2(size.width * 0.4f, size.height *   0.4f)); _bomb->addChild(halo, kMiddleground, kSpriteHalo); _gameBatchNode->addChild(_bomb, kForeground); Then create the _shockwave sprite that appears after the _bomb goes off: _shockWave = Sprite::createWithSpriteFrameName ("shockwave.png"); _shockWave->getTexture()->generateMipmap(); _shockWave->setVisible(false); _gameBatchNode->addChild(_shockWave); Finally, add the two messages that appear on the screen, one for our intro state and one for our gameover state: _introMessage = Sprite::createWithSpriteFrameName ("logo.png"); _introMessage->setPosition(Vec2   (_screenSize.width * 0.5f, _screenSize.height * 0.6f)); _introMessage->setVisible(true); this->addChild(_introMessage, kForeground);   _gameOverMessage = Sprite::createWithSpriteFrameName   ("gameover.png"); _gameOverMessage->setPosition(Vec2   (_screenSize.width * 0.5f, _screenSize.height * 0.65f)); _gameOverMessage->setVisible(false); this->addChild(_gameOverMessage, kForeground); What just happened? There is a lot of new information regarding sprites in the previous code. So let's go over it carefully: We started by adding the clouds. We put the sprites inside a vector so we can move the clouds later. Notice that they are also part of our batch node. Next comes the bomb sprite and our first new call: _bomb->getTexture()->generateMipmap(); With this we are telling the framework to create antialiased copies of this texture in diminishing sizes (mipmaps), since we are going to scale it down later. This is optional of course; sprites can be resized without first generating mipmaps, but if you notice loss of quality in your scaled sprites, you can fix that by creating mipmaps for their texture. The texture must have size values in so-called POT (power of 2: 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, and so on). Textures in OpenGL must always be sized this way; when they are not, Cocos2d-x will do one of two things: it will either resize the texture in memory, adding transparent pixels until the image reaches a POT size, or stop the execution on an assert. With textures used for mipmaps, the framework will stop execution for non-POT textures. I add the sparkle and the halo sprites as children to the _bomb sprite. This will use the container characteristic of nodes to our advantage. When I grow the bomb, all its children will grow with it. Notice too that I use a third parameter to addChild for halo and sparkle: bomb->addChild(halo, kMiddleground, kSpriteHalo); This third parameter is an integer tag from yet another enumerated list declared in GameLayer.h. I can use this tag to retrieve a particular child from a sprite as follows: auto halo = (Sprite *)   bomb->getChildByTag(kSpriteHalo); We now have our game screen in place: Next come object pools. Time for action – creating our object pools The pools are just vectors of objects. And here are the steps to create them: Inside the createPools method, we first create a pool for meteors: void GameLayer::createPools() { int i; _meteorPoolIndex = 0; for (i = 0; i < 50; i++) { auto sprite = Sprite::createWithSpriteFrameName("meteor.png"); sprite->setVisible(false); _gameBatchNode->addChild(sprite, kMiddleground, kSpriteMeteor); _meteorPool.pushBack(sprite); } Then we create an object pool for health packs: _healthPoolIndex = 0; for (i = 0; i < 20; i++) { auto sprite = Sprite::createWithSpriteFrameName("health.png"); sprite->setVisible(false); sprite->setAnchorPoint(Vec2(0.5f, 0.8f)); _gameBatchNode->addChild(sprite, kMiddleground, kSpriteHealth); _healthPool.pushBack(sprite); } We'll use the corresponding pool index to retrieve objects from the vectors as the game progresses. What just happened? We now have a vector of invisible meteor sprites and a vector of invisible health sprites. We'll use their respective pool indices to retrieve these from the vector as needed as you'll see in a moment. But first we need to take care of actions and animations. With object pools, we reduce the number of instantiations during the main loop, and it allows us to never destroy anything that can be reused. But if you need to remove a child from a node, use ->removeChild or ->removeChildByTag if a tag is present. Actions in a nutshell If you remember, a node will store information about position, scale, rotation, visibility, and opacity of a node. And in Cocos2d-x, there is an Action class to change each one of these values over time, in effect animating these transformations. Actions are usually created with a static method create. The majority of these actions are time-based, so usually the first parameter you need to pass an action is the time length for the action. So for instance: auto fadeout = FadeOut::create(1.0f); This creates a fadeout action that will take one second to complete. You can run it on a sprite, or node, as follows: mySprite->runAction(fadeout); Cocos2d-x has an incredibly flexible system that allows us to create any combination of actions and transformations to achieve any effect we desire. You may, for instance, choose to create an action sequence (Sequence) that contains more than one action; or you can apply easing effects (EaseIn, EaseOut, and so on) to your actions. You can choose to repeat an action a certain number of times (Repeat) or forever (RepeatForever); and you can add callbacks to functions you want called once an action is completed (usually inside a Sequence action). Time for action – creating actions with Cocos2d-x Creating actions with Cocos2d-x is a very simple process: Inside our createActions method, we will instantiate the actions we can use repeatedly in our game. Let's create our first actions: void GameLayer::createActions() { //swing action for health drops auto easeSwing = Sequence::create( EaseInOut::create(RotateTo::create(1.2f, -10), 2), EaseInOut::create(RotateTo::create(1.2f, 10), 2), nullptr);//mark the end of a sequence with a nullptr _swingHealth = RepeatForever::create( (ActionInterval *) easeSwing ); _swingHealth->retain(); Actions can be combined in many different forms. Here, the retained _swingHealth action is a RepeatForever action of Sequence that will rotate the health sprite first one way, then the other, with EaseInOut wrapping the RotateTo action. RotateTo takes 1.2 seconds to rotate the sprite first to -10 degrees and then to 10. And the easing has a value of 2, which I suggest you experiment with to get a sense of what it means visually. Next we add three more actions: //action sequence for shockwave: fade out, callback when //done _shockwaveSequence = Sequence::create( FadeOut::create(1.0f), CallFunc::create(std::bind(&GameLayer::shockwaveDone, this)), nullptr); _shockwaveSequence->retain();   //action to grow bomb _growBomb = ScaleTo::create(6.0f, 1.0); _growBomb->retain();   //action to rotate sprites auto rotate = RotateBy::create(0.5f , -90); _rotateSprite = RepeatForever::create( rotate ); _rotateSprite->retain(); First, another Sequence. This will fade out the sprite and call the shockwaveDone function, which is already implemented in the class and turns the _shockwave sprite invisible when called. The last one is a RepeatForever action of a RotateBy action. In half a second, the sprite running this action will rotate -90 degrees and will do that again and again. What just happened? You just got your first glimpse of how to create actions in Cocos2d-x and how the framework allows for all sorts of combinations to accomplish any effect. It may be hard at first to read through a Sequence action and understand what's happening, but the logic is easy to follow once you break it down into its individual parts. But we are not done with the createActions method yet. Next come sprite animations. Animating a sprite in Cocos2d-x The key thing to remember is that an animation is just another type of action, one that changes the texture used by a sprite over a period of time. In order to create an animation action, you need to first create an Animation object. This object will store all the information regarding the different sprite frames you wish to use in the animation, the length of the animation in seconds, and whether it loops or not. With this Animation object, you then create a Animate action. Let's take a look. Time for action – creating animations Animations are a specialized type of action that require a few extra steps: Inside the same createActions method, add the lines for the two animations we have in the game. First, we start with the animation that shows an explosion when a meteor reaches the city. We begin by loading the frames into an Animation object: auto animation = Animation::create(); int i; for(i = 1; i <= 10; i++) { auto name = String::createWithFormat("boom%i.png", i); auto frame = SpriteFrameCache::getInstance()->getSpriteFrameByName(name->getCString()); animation->addSpriteFrame(frame); } Then we use the Animation object inside a Animate action: animation->setDelayPerUnit(1 / 10.0f); animation->setRestoreOriginalFrame(true); _groundHit = Sequence::create(    MoveBy::create(0, Vec2(0,_screenSize.height * 0.12f)),    Animate::create(animation),    CallFuncN::create(CC_CALLBACK_1(GameLayer::animationDone, this)), nullptr); _groundHit->retain(); The same steps are repeated to create the other explosion animation used when the player hits a meteor or a health pack. animation = Animation::create(); for(int i = 1; i <= 7; i++) { auto name = String::createWithFormat("explosion_small%i.png", i); auto frame = SpriteFrameCache::getInstance()->getSpriteFrameByName(name->getCString()); animation->addSpriteFrame(frame); }   animation->setDelayPerUnit(0.5 / 7.0f); animation->setRestoreOriginalFrame(true); _explosion = Sequence::create(      Animate::create(animation),    CallFuncN::create(CC_CALLBACK_1(GameLayer::animationDone, this)), nullptr); _explosion->retain(); What just happened? We created two instances of a very special kind of action in Cocos2d-x: Animate. Here is what we did: First, we created an Animation object. This object holds the references to all the textures used in the animation. The frames were named in such a way that they could easily be concatenated inside a loop (boom1, boom2, boom3, and so on). There are 10 frames for the first animation and seven for the second. The textures (or frames) are SpriteFrame objects we grab from SpriteFrameCache, which as you remember, contains all the information from the sprite_sheet.plist data file. So the frames are in our sprite sheet. Then when all frames are in place, we determine the delay of each frame by dividing the total amount of seconds we want the animation to last by the total number of frames. The setRestoreOriginalFrame method is important here. If we set setRestoreOriginalFrame to true, then the sprite will revert to its original appearance once the animation is over. For example, if I have an explosion animation that will run on a meteor sprite, then by the end of the explosion animation, the sprite will revert to displaying the meteor texture. Time for the actual action. Animate receives the Animation object as its parameter. (In the first animation, we shift the position of the sprite just before the explosion appears, so there is an extra MoveBy method.) And in both instances, I make a call to an animationDone callback already implemented in the class. It makes the calling sprite invisible: void GameLayer::animationDone (Node* pSender) { pSender->setVisible(false); } We could have used the same method for both callbacks (animationDone and shockwaveDone) as they accomplish the same thing. But I wanted to show you a callback that receives as an argument, the node that made the call and one that did not. Respectively, these are CallFuncN and CallFunc, and were used inside the action sequences we just created. Time to make our game tick! Okay, we have our main elements in place and are ready to add the final bit of logic to run the game. But how will everything work? We will use a system of countdowns to add new meteors and new health packs, as well as a countdown that will incrementally make the game harder to play. On touch, the player will start the game if the game is not running, and also add bombs and explode them during gameplay. An explosion creates a shockwave. On update, we will check against collision between our _shockwave sprite (if visible) and all our falling objects. And that's it. Cocos2d-x will take care of all the rest through our created actions and callbacks! So let's implement our touch events first. Time for action – handling touches Time to bring the player to our party: Time to implement our onTouchBegan method. We'll begin by handling the two game states, intro and game over: bool GameLayer::onTouchBegan (Touch * touch, Event * event){   //if game not running, we are seeing either intro or //gameover if (!_running) {    //if intro, hide intro message    if (_introMessage->isVisible()) {      _introMessage->setVisible(false);        //if game over, hide game over message    } else if (_gameOverMessage->isVisible()) {      SimpleAudioEngine::getInstance()->stopAllEffects();      _gameOverMessage->setVisible(false);         }       this->resetGame();    return true; } Here we check to see if the game is not running. If not, we check to see if any of our messages are visible. If _introMessage is visible, we hide it. If _gameOverMessage is visible, we stop all current sound effects and hide the message as well. Then we call a method called resetGame, which will reset all the game data (energy, score, and countdowns) to their initial values, and set _running to true. Next we handle the touches. But we only need to handle one each time so we use ->anyObject() on Set: auto touch = (Touch *)pTouches->anyObject();   if (touch) { //if bomb already growing... if (_bomb->isVisible()) {    //stop all actions on bomb, halo and sparkle    _bomb->stopAllActions();    auto child = (Sprite *) _bomb->getChildByTag(kSpriteHalo);    child->stopAllActions();    child = (Sprite *) _bomb->getChildByTag(kSpriteSparkle);    child->stopAllActions();       //if bomb is the right size, then create shockwave    if (_bomb->getScale() > 0.3f) {      _shockWave->setScale(0.1f);      _shockWave->setPosition(_bomb->getPosition());      _shockWave->setVisible(true);      _shockWave->runAction(ScaleTo::create(0.5f, _bomb->getScale() * 2.0f));      _shockWave->runAction(_shockwaveSequence->clone());      SimpleAudioEngine::getInstance()->playEffect("bombRelease.wav");      } else {      SimpleAudioEngine::getInstance()->playEffect("bombFail.wav");    }    _bomb->setVisible(false);    //reset hits with shockwave, so we can count combo hits    _shockwaveHits = 0; //if no bomb currently on screen, create one } else {    Point tap = touch->getLocation();    _bomb->stopAllActions();    _bomb->setScale(0.1f);    _bomb->setPosition(tap);    _bomb->setVisible(true);    _bomb->setOpacity(50);    _bomb->runAction(_growBomb->clone());         auto child = (Sprite *) _bomb->getChildByTag(kSpriteHalo);      child->runAction(_rotateSprite->clone());      child = (Sprite *) _bomb->getChildByTag(kSpriteSparkle);      child->runAction(_rotateSprite->clone()); } } If _bomb is visible, it means it's already growing on the screen. So on touch, we use the stopAllActions() method on the bomb and we use the stopAllActions() method on its children that we retrieve through our tags: child = (Sprite *) _bomb->getChildByTag(kSpriteHalo); child->stopAllActions(); child = (Sprite *) _bomb->getChildByTag(kSpriteSparkle); child->stopAllActions(); If _bomb is the right size, we start our _shockwave. If it isn't, we play a bomb failure sound effect; there is no explosion and _shockwave is not made visible. If we have an explosion, then the _shockwave sprite is set to 10 percent of the scale. It's placed at the same spot as the bomb, and we run a couple of actions on it: we grow the _shockwave sprite to twice the scale the bomb was when it went off and we run a copy of _shockwaveSequence that we created earlier. Finally, if no _bomb is currently visible on screen, we create one. And we run clones of previously created actions on the _bomb sprite and its children. When _bomb grows, its children grow. But when the children rotate, the bomb does not: a parent changes its children, but the children do not change their parent. What just happened? We just added part of the core logic of the game. It is with touches that the player creates and explodes bombs to stop meteors from reaching the city. Now we need to create our falling objects. But first, let's set up our countdowns and our game data. Time for action – starting and restarting the game Let's add the logic to start and restart the game. Let's write the implementation for resetGame: void GameLayer::resetGame(void) {    _score = 0;    _energy = 100;       //reset timers and "speeds"    _meteorInterval = 2.5;    _meteorTimer = _meteorInterval * 0.99f;    _meteorSpeed = 10;//in seconds to reach ground    _healthInterval = 20;    _healthTimer = 0;    _healthSpeed = 15;//in seconds to reach ground       _difficultyInterval = 60;    _difficultyTimer = 0;       _running = true;       //reset labels    _energyDisplay->setString(std::to_string((int) _energy) + "%");    _scoreDisplay->setString(std::to_string((int) _score)); } Next, add the implementation of stopGame: void GameLayer::stopGame() {       _running = false;       //stop all actions currently running    int i;    int count = (int) _fallingObjects.size();       for (i = count-1; i >= 0; i--) {        auto sprite = _fallingObjects.at(i);        sprite->stopAllActions();        sprite->setVisible(false);        _fallingObjects.erase(i);    }    if (_bomb->isVisible()) {        _bomb->stopAllActions();        _bomb->setVisible(false);        auto child = _bomb->getChildByTag(kSpriteHalo);        child->stopAllActions();        child = _bomb->getChildByTag(kSpriteSparkle);        child->stopAllActions();    }    if (_shockWave->isVisible()) {        _shockWave->stopAllActions();        _shockWave->setVisible(false);    }    if (_ufo->isVisible()) {        _ufo->stopAllActions();        _ufo->setVisible(false);        auto ray = _ufo->getChildByTag(kSpriteRay);       ray->stopAllActions();        ray->setVisible(false);    } } What just happened? With these methods we control gameplay. We start the game with default values through resetGame(), and we stop all actions with stopGame(). Already implemented in the class is the method that makes the game more difficult as time progresses. If you take a look at the method (increaseDifficulty) you will see that it reduces the interval between meteors and reduces the time it takes for meteors to reach the ground. All we need now is the update method to run the countdowns and check for collisions. Time for action – updating the game We already have the code that updates the countdowns inside the update. If it's time to add a meteor or a health pack we do it. If it's time to make the game more difficult to play, we do that too. It is possible to use an action for these timers: a Sequence action with a Delay action object and a callback. But there are advantages to using these countdowns. It's easier to reset them and to change them, and we can take them right into our main loop. So it's time to add our main loop: What we need to do is check for collisions. So add the following code: if (_shockWave->isVisible()) { count = (int) _fallingObjects.size(); for (i = count-1; i >= 0; i--) {    auto sprite = _fallingObjects.at(i);    diffx = _shockWave->getPositionX() - sprite->getPositionX();    diffy = _shockWave->getPositionY() - sprite->getPositionY();    if (pow(diffx, 2) + pow(diffy, 2) <= pow(_shockWave->getBoundingBox().size.width * 0.5f, 2)) {    sprite->stopAllActions();    sprite->runAction( _explosion->clone());    SimpleAudioEngine::getInstance()->playEffect("boom.wav");    if (sprite->getTag() == kSpriteMeteor) {      _shockwaveHits++;      _score += _shockwaveHits * 13 + _shockwaveHits * 2;    }    //play sound    _fallingObjects.erase(i); } } _scoreDisplay->setString(std::to_string(_score)); } If _shockwave is visible, we check the distance between it and each sprite in _fallingObjects vector. If we hit any meteors, we increase the value of the _shockwaveHits property so we can award the player for multiple hits. Next we move the clouds: //move clouds for (auto sprite : _clouds) { sprite->setPositionX(sprite->getPositionX() + dt * 20); if (sprite->getPositionX() > _screenSize.width + sprite->getBoundingBox().size.width * 0.5f)    sprite->setPositionX(-sprite->getBoundingBox().size.width * 0.5f); } I chose not to use a MoveTo action for the clouds to show you the amount of code that can be replaced by a simple action. If not for Cocos2d-x actions, we would have to implement logic to move, rotate, swing, scale, and explode all our sprites! And finally: if (_bomb->isVisible()) {    if (_bomb->getScale() > 0.3f) {      if (_bomb->getOpacity() != 255)        _bomb->setOpacity(255);    } } We give the player an extra visual cue to when a bomb is ready to explode by changing its opacity. What just happened? The main loop is pretty straightforward when you don't have to worry about updating individual sprites, as our actions take care of that for us. We pretty much only need to run collision checks between our sprites, and to determine when it's time to throw something new at the player. So now the only thing left to do is grab the meteors and health packs from the pools when their timers are up. So let's get right to it. Time for action – retrieving objects from the pool We just need to use the correct index to retrieve the objects from their respective vector: To retrieve meteor sprites, we'll use the resetMeteor method: void GameLayer::resetMeteor(void) {    //if too many objects on screen, return    if (_fallingObjects.size() > 30) return;       auto meteor = _meteorPool.at(_meteorPoolIndex);      _meteorPoolIndex++;    if (_meteorPoolIndex == _meteorPool.size())      _meteorPoolIndex = 0;      int meteor_x = rand() % (int) (_screenSize.width * 0.8f) + _screenSize.width * 0.1f;    int meteor_target_x = rand() % (int) (_screenSize.width * 0.8f) + _screenSize.width * 0.1f;       meteor->stopAllActions();    meteor->setPosition(Vec2(meteor_x, _screenSize.height + meteor->getBoundingBox().size.height * 0.5));    //create action    auto rotate = RotateBy::create(0.5f , -90);    auto repeatRotate = RepeatForever::create( rotate );    auto sequence = Sequence::create (                MoveTo::create(_meteorSpeed, Vec2(meteor_target_x, _screenSize.height * 0.15f)),                CallFunc::create(std::bind(&GameLayer::fallingObjectDone, this, meteor) ), nullptr);   meteor->setVisible ( true ); meteor->runAction(repeatRotate); meteor->runAction(sequence); _fallingObjects.pushBack(meteor); } We grab the next available meteor from the pool, then we pick a random start and end x value for its MoveTo action. The meteor starts at the top of the screen and will move to the bottom towards the city, but the x value is randomly picked each time. We rotate the meteor inside a RepeatForever action, and we use Sequence to move the sprite to its target position and then call back fallingObjectDone when the meteor has reached its target. We finish by adding the new meteor we retrieved from the pool to the _fallingObjects vector so we can check collisions with it. The method to retrieve the health (resetHealth) sprites is pretty much the same, except that swingHealth action is used instead of rotate. You'll find that method already implemented in GameLayer.cpp. What just happened? So in resetGame we set the timers, and we update them in the update method. We use these timers to add meteors and health packs to the screen by grabbing the next available one from their respective pool, and then we proceed to run collisions between an exploding bomb and these falling objects. Notice that in both resetMeteor and resetHealth we don't add new sprites if too many are on screen already: if (_fallingObjects->size() > 30) return; This way the game does not get ridiculously hard, and we never run out of unused objects in our pools. And the very last bit of logic in our game is our fallingObjectDone callback, called when either a meteor or a health pack has reached the ground, at which point it awards or punishes the player for letting sprites through. When you take a look at that method inside GameLayer.cpp, you will notice how we use ->getTag() to quickly ascertain which type of sprite we are dealing with (the one calling the method): if (pSender->getTag() == kSpriteMeteor) { If it's a meteor, we decrease energy from the player, play a sound effect, and run the explosion animation; an autorelease copy of the _groundHit action we retained earlier, so we don't need to repeat all that logic every time we need to run this action. If the item is a health pack, we increase the energy or give the player some points, play a nice sound effect, and hide the sprite. Play the game! We've been coding like mad, and it's finally time to run the game. But first, don't forget to release all the items we retained. In GameLayer.cpp, add our destructor method: GameLayer::~GameLayer () {       //release all retained actions    CC_SAFE_RELEASE(_growBomb);    CC_SAFE_RELEASE(_rotateSprite);    CC_SAFE_RELEASE(_shockwaveSequence);    CC_SAFE_RELEASE(_swingHealth);    CC_SAFE_RELEASE(_groundHit);    CC_SAFE_RELEASE(_explosion);    CC_SAFE_RELEASE(_ufoAnimation);    CC_SAFE_RELEASE(_blinkRay);       _clouds.clear();    _meteorPool.clear();    _healthPool.clear();    _fallingObjects.clear(); } The actual game screen will now look something like this: Now, let's take this to Android. Time for action – running the game in Android Follow these steps to deploy the game to Android: This time, there is no need to alter the manifest because the default settings are the ones we want. So, navigate to proj.android and then to the jni folder and open the Android.mk file in a text editor. Edit the lines in LOCAL_SRC_FILES to read as follows: LOCAL_SRC_FILES := hellocpp/main.cpp \                    ../../Classes/AppDelegate.cpp \                    ../../Classes/GameLayer.cpp Follow the instructions from the HelloWorld and AirHockey examples to import the game into Eclipse. Save it and run your application. This time, you can try out different size screens if you have the devices. What just happened? You just ran a universal app in Android. And nothing could have been simpler. Summary In my opinion, after nodes and all their derived objects, actions are the second best thing about Cocos2d-x. They are time savers and can quickly spice things up in any project with professional-looking animations. And I hope with the examples found in this article, you will be able to create any action you need with Cocos2d-x. Resources for Article: Further resources on this subject: Animations in Cocos2d-x [article] Moving the Space Pod Using Touch [article] Cocos2d-x: Installation [article]
Read more
  • 0
  • 0
  • 6889

article-image-classifying-real-world-examples
Packt
24 Mar 2015
32 min read
Save for later

Classifying with Real-world Examples

Packt
24 Mar 2015
32 min read
This article by the authors, Luis Pedro Coelho and Willi Richert, of the book, Building Machine Learning Systems with Python - Second Edition, focuses on the topic of classification. (For more resources related to this topic, see here.) You have probably already used this form of machine learning as a consumer, even if you were not aware of it. If you have any modern e-mail system, it will likely have the ability to automatically detect spam. That is, the system will analyze all incoming e-mails and mark them as either spam or not-spam. Often, you, the end user, will be able to manually tag e-mails as spam or not, in order to improve its spam detection ability. This is a form of machine learning where the system is taking examples of two types of messages: spam and ham (the typical term for "non spam e-mails") and using these examples to automatically classify incoming e-mails. The general method of classification is to use a set of examples of each class to learn rules that can be applied to new examples. This is one of the most important machine learning modes and is the topic of this article. Working with text such as e-mails requires a specific set of techniques and skills. For the moment, we will work with a smaller, easier-to-handle dataset. The example question for this article is, "Can a machine distinguish between flower species based on images?" We will use two datasets where measurements of flower morphology are recorded along with the species for several specimens. We will explore these small datasets using a few simple algorithms. At first, we will write classification code ourselves in order to understand the concepts, but we will quickly switch to using scikit-learn whenever possible. The goal is to first understand the basic principles of classification and then progress to using a state-of-the-art implementation. The Iris dataset The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification. The dataset is a collection of morphological measurements of several Iris flowers. These measurements will enable us to distinguish multiple species of the flowers. Today, species are identified by their DNA fingerprints, but in the 1930s, DNA's role in genetics had not yet been discovered. The following four attributes of each plant were measured: sepal length sepal width petal length petal width In general, we will call the individual numeric measurements we use to describe our data features. These features can be directly measured or computed from intermediate data. This dataset has four features. Additionally, for each plant, the species was recorded. The problem we want to solve is, "Given these examples, if we see a new flower out in the field, could we make a good prediction about its species from its measurements?" This is the supervised learning or classification problem: given labeled examples, can we design a rule to be later applied to other examples? A more familiar example to modern readers who are not botanists is spam filtering, where the user can mark e-mails as spam, and systems use these as well as the non-spam e-mails to determine whether a new, incoming message is spam or not. For the moment, the Iris dataset serves our purposes well. It is small (150 examples, four features each) and can be easily visualized and manipulated. Visualization is a good first step Datasets will grow to thousands of features. With only four in our starting example, we can easily plot all two-dimensional projections on a single page. We will build intuitions on this small example, which can then be extended to large datasets with many more features. Visualizations are excellent at the initial exploratory phase of the analysis as they allow you to learn the general features of your problem as well as catch problems that occurred with data collection early. Each subplot in the following plot shows all points projected into two of the dimensions. The outlying group (triangles) are the Iris Setosa plants, while Iris Versicolor plants are in the center (circle) and Iris Virginica are plotted with x marks. We can see that there are two large groups: one is of Iris Setosa and another is a mixture of Iris Versicolor and Iris Virginica.   In the following code snippet, we present the code to load the data and generate the plot: >>> from matplotlib import pyplot as plt >>> import numpy as np   >>> # We load the data with load_iris from sklearn >>> from sklearn.datasets import load_iris >>> data = load_iris()   >>> # load_iris returns an object with several fields >>> features = data.data >>> feature_names = data.feature_names >>> target = data.target >>> target_names = data.target_names   >>> for t in range(3): ...   if t == 0: ...       c = 'r' ...       marker = '>' ...   elif t == 1: ...       c = 'g' ...       marker = 'o' ...   elif t == 2: ...       c = 'b' ...       marker = 'x' ...   plt.scatter(features[target == t,0], ...               features[target == t,1], ...               marker=marker, ...               c=c) Building our first classification model If the goal is to separate the three types of flowers, we can immediately make a few suggestions just by looking at the data. For example, petal length seems to be able to separate Iris Setosa from the other two flower species on its own. We can write a little bit of code to discover where the cut-off is: >>> # We use NumPy fancy indexing to get an array of strings: >>> labels = target_names[target]   >>> # The petal length is the feature at position 2 >>> plength = features[:, 2]   >>> # Build an array of booleans: >>> is_setosa = (labels == 'setosa')   >>> # This is the important step: >>> max_setosa =plength[is_setosa].max() >>> min_non_setosa = plength[~is_setosa].min() >>> print('Maximum of setosa: {0}.'.format(max_setosa)) Maximum of setosa: 1.9.   >>> print('Minimum of others: {0}.'.format(min_non_setosa)) Minimum of others: 3.0. Therefore, we can build a simple model: if the petal length is smaller than 2, then this is an Iris Setosa flower; otherwise it is either Iris Virginica or Iris Versicolor. This is our first model and it works very well in that it separates Iris Setosa flowers from the other two species without making any mistakes. In this case, we did not actually do any machine learning. Instead, we looked at the data ourselves, looking for a separation between the classes. Machine learning happens when we write code to look for this separation automatically. The problem of recognizing Iris Setosa apart from the other two species was very easy. However, we cannot immediately see what the best threshold is for distinguishing Iris Virginica from Iris Versicolor. We can even see that we will never achieve perfect separation with these features. We could, however, look for the best possible separation, the separation that makes the fewest mistakes. For this, we will perform a little computation. We first select only the non-Setosa features and labels: >>> # ~ is the boolean negation operator >>> features = features[~is_setosa] >>> labels = labels[~is_setosa] >>> # Build a new target variable, is_virginica >>> is_virginica = (labels == 'virginica') Here we are heavily using NumPy operations on arrays. The is_setosa array is a Boolean array and we use it to select a subset of the other two arrays, features and labels. Finally, we build a new boolean array, virginica, by using an equality comparison on labels. Now, we run a loop over all possible features and thresholds to see which one results in better accuracy. Accuracy is simply the fraction of examples that the model classifies correctly. >>> # Initialize best_acc to impossibly low value >>> best_acc = -1.0 >>> for fi in range(features.shape[1]): ... # We are going to test all possible thresholds ... thresh = features[:,fi] ... for t in thresh: ...   # Get the vector for feature `fi` ...   feature_i = features[:, fi] ...   # apply threshold `t` ...   pred = (feature_i > t) ...   acc = (pred == is_virginica).mean() ...   rev_acc = (pred == ~is_virginica).mean() ...   if rev_acc > acc: ...       reverse = True ...       acc = rev_acc ...   else: ...       reverse = False ... ...   if acc > best_acc: ...     best_acc = acc ...     best_fi = fi ...     best_t = t ...     best_reverse = reverse We need to test two types of thresholds for each feature and value: we test a greater than threshold and the reverse comparison. This is why we need the rev_acc variable in the preceding code; it holds the accuracy of reversing the comparison. The last few lines select the best model. First, we compare the predictions, pred, with the actual labels, is_virginica. The little trick of computing the mean of the comparisons gives us the fraction of correct results, the accuracy. At the end of the for loop, all the possible thresholds for all the possible features have been tested, and the variables best_fi, best_t, and best_reverse hold our model. This is all the information we need to be able to classify a new, unknown object, that is, to assign a class to it. The following code implements exactly this method: def is_virginica_test(fi, t, reverse, example):    "Apply threshold model to a new example"    test = example[fi] > t    if reverse:        test = not test    return test What does this model look like? If we run the code on the whole data, the model that is identified as the best makes decisions by splitting on the petal width. One way to gain intuition about how this works is to visualize the decision boundary. That is, we can see which feature values will result in one decision versus the other and exactly where the boundary is. In the following screenshot, we see two regions: one is white and the other is shaded in grey. Any datapoint that falls on the white region will be classified as Iris Virginica, while any point that falls on the shaded side will be classified as Iris Versicolor. In a threshold model, the decision boundary will always be a line that is parallel to one of the axes. The plot in the preceding screenshot shows the decision boundary and the two regions where points are classified as either white or grey. It also shows (as a dashed line) an alternative threshold, which will achieve exactly the same accuracy. Our method chose the first threshold it saw, but that was an arbitrary choice. Evaluation – holding out data and cross-validation The model discussed in the previous section is a simple model; it achieves 94 percent accuracy of the whole data. However, this evaluation may be overly optimistic. We used the data to define what the threshold will be, and then we used the same data to evaluate the model. Of course, the model will perform better than anything else we tried on this dataset. The reasoning is circular. What we really want to do is estimate the ability of the model to generalize to new instances. We should measure its performance in instances that the algorithm has not seen at training. Therefore, we are going to do a more rigorous evaluation and use held-out data. For this, we are going to break up the data into two groups: on one group, we'll train the model, and on the other, we'll test the one we held out of training. The full code, which is an adaptation of the code presented earlier, is available on the online support repository. Its output is as follows: Training accuracy was 96.0%. Testing accuracy was 90.0% (N = 50). The result on the training data (which is a subset of the whole data) is apparently even better than before. However, what is important to note is that the result in the testing data is lower than that of the training error. While this may surprise an inexperienced machine learner, it is expected that testing accuracy will be lower than the training accuracy. To see why, look back at the plot that showed the decision boundary. Consider what would have happened if some of the examples close to the boundary were not there or that one of them between the two lines was missing. It is easy to imagine that the boundary will then move a little bit to the right or to the left so as to put them on the wrong side of the border. The accuracy on the training data, the training accuracy, is almost always an overly optimistic estimate of how well your algorithm is doing. We should always measure and report the testing accuracy, which is the accuracy on a collection of examples that were not used for training. These concepts will become more and more important as the models become more complex. In this example, the difference between the accuracy measured on training data and on testing data is not very large. When using a complex model, it is possible to get 100 percent accuracy in training and do no better than random guessing on testing! One possible problem with what we did previously, which was to hold out data from training, is that we only used half the data for training. Perhaps it would have been better to use more training data. On the other hand, if we then leave too little data for testing, the error estimation is performed on a very small number of examples. Ideally, we would like to use all of the data for training and all of the data for testing as well, which is impossible. We can achieve a good approximation of this impossible ideal by a method called cross-validation. One simple form of cross-validation is leave-one-out cross-validation. We will take an example out of the training data, learn a model without this example, and then test whether the model classifies this example correctly. This process is then repeated for all the elements in the dataset. The following code implements exactly this type of cross-validation: >>> correct = 0.0 >>> for ei in range(len(features)):      # select all but the one at position `ei`:      training = np.ones(len(features), bool)      training[ei] = False      testing = ~training      model = fit_model(features[training], is_virginica[training])      predictions = predict(model, features[testing])      correct += np.sum(predictions == is_virginica[testing]) >>> acc = correct/float(len(features)) >>> print('Accuracy: {0:.1%}'.format(acc)) Accuracy: 87.0% At the end of this loop, we will have tested a series of models on all the examples and have obtained a final average result. When using cross-validation, there is no circularity problem because each example was tested on a model which was built without taking that datapoint into account. Therefore, the cross-validated estimate is a reliable estimate of how well the models would generalize to new data. The major problem with leave-one-out cross-validation is that we are now forced to perform many times more work. In fact, you must learn a whole new model for each and every example and this cost will increase as our dataset grows. We can get most of the benefits of leave-one-out at a fraction of the cost by using x-fold cross-validation, where x stands for a small number. For example, to perform five-fold cross-validation, we break up the data into five groups, so-called five folds. Then you learn five models: each time you will leave one fold out of the training data. The resulting code will be similar to the code given earlier in this section, but we leave 20 percent of the data out instead of just one element. We test each of these models on the left-out fold and average the results.   The preceding figure illustrates this process for five blocks: the dataset is split into five pieces. For each fold, you hold out one of the blocks for testing and train on the other four. You can use any number of folds you wish. There is a trade-off between computational efficiency (the more folds, the more computation is necessary) and accurate results (the more folds, the closer you are to using the whole of the data for training). Five folds is often a good compromise. This corresponds to training with 80 percent of your data, which should already be close to what you will get from using all the data. If you have little data, you can even consider using 10 or 20 folds. In the extreme case, if you have as many folds as datapoints, you are simply performing leave-one-out cross-validation. On the other hand, if computation time is an issue and you have more data, 2 or 3 folds may be the more appropriate choice. When generating the folds, you need to be careful to keep them balanced. For example, if all of the examples in one fold come from the same class, then the results will not be representative. We will not go into the details of how to do this, because the machine learning package scikit-learn will handle them for you. We have now generated several models instead of just one. So, "What final model do we return and use for new data?" The simplest solution is now to train a single overall model on all your training data. The cross-validation loop gives you an estimate of how well this model should generalize. A cross-validation schedule allows you to use all your data to estimate whether your methods are doing well. At the end of the cross-validation loop, you can then use all your data to train a final model. Although it was not properly recognized when machine learning was starting out as a field, nowadays, it is seen as a very bad sign to even discuss the training accuracy of a classification system. This is because the results can be very misleading and even just presenting them marks you as a newbie in machine learning. We always want to measure and compare either the error on a held-out dataset or the error estimated using a cross-validation scheme. Building more complex classifiers In the previous section, we used a very simple model: a threshold on a single feature. Are there other types of systems? Yes, of course! Many others. To think of the problem at a higher abstraction level, "What makes up a classification model?" We can break it up into three parts: The structure of the model: How exactly will a model make decisions? In this case, the decision depended solely on whether a given feature was above or below a certain threshold value. This is too simplistic for all but the simplest problems. The search procedure: How do we find the model we need to use? In our case, we tried every possible combination of feature and threshold. You can easily imagine that as models get more complex and datasets get larger, it rapidly becomes impossible to attempt all combinations and we are forced to use approximate solutions. In other cases, we need to use advanced optimization methods to find a good solution (fortunately, scikit-learn already implements these for you, so using them is easy even if the code behind them is very advanced). The gain or loss function: How do we decide which of the possibilities tested should be returned? Rarely do we find the perfect solution, the model that never makes any mistakes, so we need to decide which one to use. We used accuracy, but sometimes it will be better to optimize so that the model makes fewer errors of a specific kind. For example, in spam filtering, it may be worse to delete a good e-mail than to erroneously let a bad e-mail through. In that case, we may want to choose a model that is conservative in throwing out e-mails rather than the one that just makes the fewest mistakes overall. We can discuss these issues in terms of gain (which we want to maximize) or loss (which we want to minimize). They are equivalent, but sometimes one is more convenient than the other. We can play around with these three aspects of classifiers and get different systems. A simple threshold is one of the simplest models available in machine learning libraries and only works well when the problem is very simple, such as with the Iris dataset. In the next section, we will tackle a more difficult classification task that requires a more complex structure. In our case, we optimized the threshold to minimize the number of errors. Alternatively, we might have different loss functions. It might be that one type of error is much costlier than the other. In a medical setting, false negatives and false positives are not equivalent. A false negative (when the result of a test comes back negative, but that is false) might lead to the patient not receiving treatment for a serious disease. A false positive (when the test comes back positive even though the patient does not actually have that disease) might lead to additional tests to confirm or unnecessary treatment (which can still have costs, including side effects from the treatment, but are often less serious than missing a diagnostic). Therefore, depending on the exact setting, different trade-offs can make sense. At one extreme, if the disease is fatal and the treatment is cheap with very few negative side-effects, then you want to minimize false negatives as much as you can. What the gain/cost function should be is always dependent on the exact problem you are working on. When we present a general-purpose algorithm, we often focus on minimizing the number of mistakes, achieving the highest accuracy. However, if some mistakes are costlier than others, it might be better to accept a lower overall accuracy to minimize the overall costs. A more complex dataset and a more complex classifier We will now look at a slightly more complex dataset. This will motivate the introduction of a new classification algorithm and a few other ideas. Learning about the Seeds dataset We now look at another agricultural dataset, which is still small, but already too large to plot exhaustively on a page as we did with Iris. This dataset consists of measurements of wheat seeds. There are seven features that are present, which are as follows: area A perimeter P compactness C = 4pA/P² length of kernel width of kernel asymmetry coefficient length of kernel groove There are three classes, corresponding to three wheat varieties: Canadian, Koma, and Rosa. As earlier, the goal is to be able to classify the species based on these morphological measurements. Unlike the Iris dataset, which was collected in the 1930s, this is a very recent dataset and its features were automatically computed from digital images. This is how image pattern recognition can be implemented: you can take images, in digital form, compute a few relevant features from them, and use a generic classification system. For the moment, we will work with the features that are given to us. UCI Machine Learning Dataset Repository The University of California at Irvine (UCI) maintains an online repository of machine learning datasets (at the time of writing, they list 233 datasets). Both the Iris and the Seeds dataset used in this article were taken from there. The repository is available online at http://archive.ics.uci.edu/ml/. Features and feature engineering One interesting aspect of these features is that the compactness feature is not actually a new measurement, but a function of the previous two features, area and perimeter. It is often very useful to derive new combined features. Trying to create new features is generally called feature engineering. It is sometimes seen as less glamorous than algorithms, but it often matters more for performance (a simple algorithm on well-chosen features will perform better than a fancy algorithm on not-so-good features). In this case, the original researchers computed the compactness, which is a typical feature for shapes. It is also sometimes called roundness. This feature will have the same value for two kernels, one of which is twice as big as the other one, but with the same shape. However, it will have different values for kernels that are very round (when the feature is close to one) when compared to kernels that are elongated (when the feature is closer to zero). The goals of a good feature are to simultaneously vary with what matters (the desired output) and be invariant with what does not. For example, compactness does not vary with size, but varies with the shape. In practice, it might be hard to achieve both objectives perfectly, but we want to approximate this ideal. You will need to use background knowledge to design good features. Fortunately, for many problem domains, there is already a vast literature of possible features and feature-types that you can build upon. For images, all of the previously mentioned features are typical and computer vision libraries will compute them for you. In text-based problems too, there are standard solutions that you can mix and match. When possible, you should use your knowledge of the problem to design a specific feature or to select which ones from the literature are more applicable to the data at hand. Even before you have data, you must decide which data is worthwhile to collect. Then, you hand all your features to the machine to evaluate and compute the best classifier. A natural question is whether we can select good features automatically. This problem is known as feature selection. There are many methods that have been proposed for this problem, but in practice very simple ideas work best. For the small problems we are currently exploring, it does not make sense to use feature selection, but if you had thousands of features, then throwing out most of them might make the rest of the process much faster. Nearest neighbor classification For use with this dataset, we will introduce a new classifier: the nearest neighbor classifier. The nearest neighbor classifier is very simple. When classifying a new element, it looks at the training data for the object that is closest to it, its nearest neighbor. Then, it returns its label as the answer. Notice that this model performs perfectly on its training data! For each point, its closest neighbor is itself, and so its label matches perfectly (unless two examples with different labels have exactly the same feature values, which will indicate that the features you are using are not very descriptive). Therefore, it is essential to test the classification using a cross-validation protocol. The nearest neighbor method can be generalized to look not at a single neighbor, but to multiple ones and take a vote amongst the neighbors. This makes the method more robust to outliers or mislabeled data. Classifying with scikit-learn We have been using handwritten classification code, but Python is a very appropriate language for machine learning because of its excellent libraries. In particular, scikit-learn has become the standard library for many machine learning tasks, including classification. We are going to use its implementation of nearest neighbor classification in this section. The scikit-learn classification API is organized around classifier objects. These objects have the following two essential methods: fit(features, labels): This is the learning step and fits the parameters of the model predict(features): This method can only be called after fit and returns a prediction for one or more inputs Here is how we could use its implementation of k-nearest neighbors for our data. We start by importing the KneighborsClassifier object from the sklearn.neighbors submodule: >>> from sklearn.neighbors import KNeighborsClassifier The scikit-learn module is imported as sklearn (sometimes you will also find that scikit-learn is referred to using this short name instead of the full name). All of the sklearn functionality is in submodules, such as sklearn.neighbors. We can now instantiate a classifier object. In the constructor, we specify the number of neighbors to consider, as follows: >>> classifier = KNeighborsClassifier(n_neighbors=1) If we do not specify the number of neighbors, it defaults to 5, which is often a good choice for classification. We will want to use cross-validation (of course) to look at our data. The scikit-learn module also makes this easy: >>> from sklearn.cross_validation import KFold   >>> kf = KFold(len(features), n_folds=5, shuffle=True) >>> # `means` will be a list of mean accuracies (one entry per fold) >>> means = [] >>> for training,testing in kf: ...   # We fit a model for this fold, then apply it to the ...   # testing data with `predict`: ...   classifier.fit(features[training], labels[training]) ...   prediction = classifier.predict(features[testing]) ... ...   # np.mean on an array of booleans returns fraction ...     # of correct decisions for this fold: ...   curmean = np.mean(prediction == labels[testing]) ...   means.append(curmean) >>> print("Mean accuracy: {:.1%}".format(np.mean(means))) Mean accuracy: 90.5% Using five folds for cross-validation, for this dataset, with this algorithm, we obtain 90.5 percent accuracy. As we discussed in the earlier section, the cross-validation accuracy is lower than the training accuracy, but this is a more credible estimate of the performance of the model. Looking at the decision boundaries We will now examine the decision boundary. In order to plot these on paper, we will simplify and look at only two dimensions. Take a look at the following plot:   Canadian examples are shown as diamonds, Koma seeds as circles, and Rosa seeds as triangles. Their respective areas are shown as white, black, and grey. You might be wondering why the regions are so horizontal, almost weirdly so. The problem is that the x axis (area) ranges from 10 to 22, while the y axis (compactness) ranges from 0.75 to 1.0. This means that a small change in x is actually much larger than a small change in y. So, when we compute the distance between points, we are, for the most part, only taking the x axis into account. This is also a good example of why it is a good idea to visualize our data and look for red flags or surprises. If you studied physics (and you remember your lessons), you might have already noticed that we had been summing up lengths, areas, and dimensionless quantities, mixing up our units (which is something you never want to do in a physical system). We need to normalize all of the features to a common scale. There are many solutions to this problem; a simple one is to normalize to z-scores. The z-score of a value is how far away from the mean it is, in units of standard deviation. It comes down to this operation: In this formula, f is the old feature value, f' is the normalized feature value, μ is the mean of the feature, and σ is the standard deviation. Both μ and σ are estimated from training data. Independent of what the original values were, after z-scoring, a value of zero corresponds to the training mean, positive values are above the mean, and negative values are below it. The scikit-learn module makes it very easy to use this normalization as a preprocessing step. We are going to use a pipeline of transformations: the first element will do the transformation and the second element will do the classification. We start by importing both the pipeline and the feature scaling classes as follows: >>> from sklearn.pipeline import Pipeline >>> from sklearn.preprocessing import StandardScaler Now, we can combine them. >>> classifier = KNeighborsClassifier(n_neighbors=1) >>> classifier = Pipeline([('norm', StandardScaler()), ('knn', classifier)]) The Pipeline constructor takes a list of pairs (str,clf). Each pair corresponds to a step in the pipeline: the first element is a string naming the step, while the second element is the object that performs the transformation. Advanced usage of the object uses these names to refer to different steps. After normalization, every feature is in the same units (technically, every feature is now dimensionless; it has no units) and we can more confidently mix dimensions. In fact, if we now run our nearest neighbor classifier, we obtain 93 percent accuracy, estimated with the same five-fold cross-validation code shown previously! Look at the decision space again in two dimensions:   The boundaries are now different and you can see that both dimensions make a difference for the outcome. In the full dataset, everything is happening on a seven-dimensional space, which is very hard to visualize, but the same principle applies; while a few dimensions are dominant in the original data, after normalization, they are all given the same importance. Binary and multiclass classification The first classifier we used, the threshold classifier, was a simple binary classifier. Its result is either one class or the other, as a point is either above the threshold value or it is not. The second classifier we used, the nearest neighbor classifier, was a natural multiclass classifier, its output can be one of the several classes. It is often simpler to define a simple binary method than the one that works on multiclass problems. However, we can reduce any multiclass problem to a series of binary decisions. This is what we did earlier in the Iris dataset, in a haphazard way: we observed that it was easy to separate one of the initial classes and focused on the other two, reducing the problem to two binary decisions: Is it an Iris Setosa (yes or no)? If not, check whether it is an Iris Virginica (yes or no). Of course, we want to leave this sort of reasoning to the computer. As usual, there are several solutions to this multiclass reduction. The simplest is to use a series of one versus the rest classifiers. For each possible label ℓ, we build a classifier of the type is this ℓ or something else? When applying the rule, exactly one of the classifiers will say yes and we will have our solution. Unfortunately, this does not always happen, so we have to decide how to deal with either multiple positive answers or no positive answers.   Alternatively, we can build a classification tree. Split the possible labels into two, and build a classifier that asks, "Should this example go in the left or the right bin?" We can perform this splitting recursively until we obtain a single label. The preceding diagram depicts the tree of reasoning for the Iris dataset. Each diamond is a single binary classifier. It is easy to imagine that we could make this tree larger and encompass more decisions. This means that any classifier that can be used for binary classification can also be adapted to handle any number of classes in a simple way. There are many other possible ways of turning a binary method into a multiclass one. There is no single method that is clearly better in all cases. The scikit-learn module implements several of these methods in the sklearn.multiclass submodule. Some classifiers are binary systems, while many real-life problems are naturally multiclass. Several simple protocols reduce a multiclass problem to a series of binary decisions and allow us to apply the binary models to our multiclass problem. This means methods that are apparently only for binary data can be applied to multiclass data with little extra effort. Summary Classification means generalizing from examples to build a model (that is, a rule that can automatically be applied to new, unclassified objects). It is one of the fundamental tools in machine. In a sense, this was a very theoretical article, as we introduced generic concepts with simple examples. We went over a few operations with the Iris dataset. This is a small dataset. However, it has the advantage that we were able to plot it out and see what we were doing in detail. This is something that will be lost when we move on to problems with many dimensions and many thousands of examples. The intuitions we gained here will all still be valid. You also learned that the training error is a misleading, over-optimistic estimate of how well the model does. We must, instead, evaluate it on testing data that has not been used for training. In order to not waste too many examples in testing, a cross-validation schedule can get us the best of both worlds (at the cost of more computation). We also had a look at the problem of feature engineering. Features are not predefined for you, but choosing and designing features is an integral part of designing a machine learning pipeline. In fact, it is often the area where you can get the most improvements in accuracy, as better data beats fancier methods. Resources for Article: Further resources on this subject: Ridge Regression [article] The Spark programming model [article] Using cross-validation [article]
Read more
  • 0
  • 0
  • 17882
Modal Close icon
Modal Close icon