Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7010 Articles
article-image-metrics-vrealize-operations
Packt
26 Dec 2014
25 min read
Save for later

Metrics in vRealize Operations

Packt
26 Dec 2014
25 min read
 In this article by Iwan 'e1' Rahabok, author of VMware vRealize Operations Performance and Capacity Management, we will learn that vSphere 5.5 comes with many counters, many more than what a physical server provides. There are new counters that do not have a physical equivalent, such as memory ballooning, CPU latency, and vSphere replication. In addition, some counters have the same name as their physical world counterpart but behave differently in vSphere. Memory usage is a common one, resulting in confusion among system administrators. For those counters that are similar to their physical world counterparts, vSphere may use different units, such as milliseconds. (For more resources related to this topic, see here.) As a result, experienced IT administrators find it hard to master vSphere counters by building on their existing knowledge. Instead of trying to relate each counter to its physical equivalent, I find it useful to group them according to their purpose. Virtualization formalizes the relationship between the infrastructure team and application team. The infrastructure team changes from the system builder to service provider. The application team no longer owns the physical infrastructure. The application team becomes a consumer of a shared service—the virtual platform. Depending on the Service Level Agreement (SLA), the application team can be served as if they have dedicated access to the infrastructure, or they can take a performance hit in exchange for a lower price. For SLAs where performance matters, the VM running in the cluster should not be impacted by any other VMs. The performance must be as good as if it is the only VM running in the ESXi. Because there are two different counter users, there are two different purposes. The application team (developers and the VM owner) only cares about their own VM. The infrastructure team has to care about both the VM and infrastructure, especially when they need to show that the shared infrastructure is not a bottleneck. One set of counters is to monitor the VM; the other set is to monitor the infrastructure. The following diagram shows the two different purposes and what we should check for each. By knowing what matters on each layer, we can better manage the virtual environment. The two-tier IT organization At the VM layer, we care whether the VM is being served well by the platform. Other VMs are irrelevant from the VM owner's point of view. A VM owner only wants to make sure his or her VM is not contending for a resource. So the key counter here is contention. Only when we are satisfied that there is no contention can we proceed to check whether the VM is sized correctly or not. Most people check for utilization first because that is what they are used to monitoring in the physical infrastructure. In a virtual environment, we should check for contention first. At the infrastructure layer, we care whether it serves everyone well. Make sure that there is no contention for resource among all the VMs in the platform. Only when the infrastructure is clear from contention can we troubleshoot a particular VM. If the infrastructure is having a hard time serving majority of the VMs, there is no point troubleshooting a particular VM. This two-layer concept is also implemented by vSphere in compute and storage architectures. For example, there are two distinct layers of memory in vSphere. There is the individual VM memory provided by the hypervisor and there is the physical memory at the host level. For an individual VM, we care whether the VM is getting enough memory. At the host level, we care whether the host has enough memory for everyone. Because of the difference in goals, we look for a different set of counters. In the previous diagram, there are two numbers shown in a large font, indicating that there are two main steps in monitoring. Each step applies to each layer (the VM layer and infrastructure layer), so there are two numbers for each step. Step 1 is used for performance management. It is useful during troubleshooting or when checking whether we are meeting performance SLAs or not. Step 2 is used for capacity management. It is useful as part of long-term capacity planning. The time period for step 2 is typically 3 months, as we are checking for overall utilization and not a one off spike. With the preceding concept in mind, we are ready to dive into more detail. Let's cover compute, network, and storage. Compute The following diagram shows how a VM gets its resource from ESXi. It is a pretty complex diagram, so let me walk you through it. The tall rectangular area represents a VM. Say this VM is given 8 GB of virtual RAM. The bottom line represents 0 GB and the top line represents 8 GB. The VM is configured with 8 GB RAM. We call this Provisioned. This is what the Guest OS sees, so if it is running Windows, you will see 8 GB RAM when you log into Windows. Unlike a physical server, you can configure a Limit and a Reservation. This is done outside the Guest OS, so Windows or Linux does not know. You should minimize the use of Limit and Reservation as it makes the operation more complex. Entitlement means what the VM is entitled to. In this example, the hypervisor entitles the VM to a certain amount of memory. I did not show a solid line and used an italic font style to mark that Entitlement is not a fixed value, but a dynamic value determined by the hypervisor. It varies every minute, determined by Limit, Entitlement, and Reservation of the VM itself and any shared allocation with other VMs running on the same host. Obviously, a VM can only use what it is entitled to at any given point of time, so the Usage counter does not go higher than the Entitlement counter. The green line shows that Usage ranges from 0 to the Entitlement value. In a healthy environment, the ESXi host has enough resources to meet the demands of all the VMs on it with sufficient overhead. In this case, you will see that the Entitlement, Usage, and Demand counters will be similar to one another when the VM is highly utilized. This is shown by the green line where Demand stops at Usage, and Usage stops at Entitlement. The numerical value may not be identical because vCenter reports Usage in percentage, and it is an average value of the sample period. vCenter reports Entitlement in MHz and it takes the latest value in the sample period. It reports Demand in MHz and it is an average value of the sample period. This also explains why you may see Usage a bit higher than Entitlement in highly-utilized vCPU. If the VM has low utilization, you will see the Entitlement counter is much higher than Usage. An environment in which the ESXi host is resource constrained is unhealthy. It cannot give every VM the resources they ask for. The VMs demand more than they are entitled to use, so the Usage and Entitlement counters will be lower than the Demand counter. The Demand counter can go higher than Limit naturally. For example, if a VM is limited to 2 GB of RAM and it wants to use 14 GB, then Demand will exceed Limit. Obviously, Demand cannot exceed Provisioned. This is why the red line stops at Provisioned because that is as high as it can go. The difference between what the VM demands and what it gets to use is the Contention counter. Contention is Demand minus Usage. So if the Contention is 0, the VM can use everything it demands. This is the ultimate goal, as performance will match the physical world. This Contention value is useful to demonstrate that the infrastructure provides a good service to the application team. If a VM owner comes to see you and says that your shared infrastructure is unable to serve his or her VM well, both of you can check the Contention counter. The Contention counter should become a part of your SLA or Key Performance Indicator (KPI). It is not sufficient to track utilization alone. When there is contention, it is possible that both your VM and ESXi host have low utilization, and yet your customers (VMs running on that host) perform poorly. This typically happens when the VMs are relatively large compared to the ESXi host. Let me give you a simple example to illustrate this. The ESXi host has two sockets and 20 cores. Hyper-threading is not enabled to keep this example simple. You run just 2 VMs, but each VM has 11 vCPUs. As a result, they will not be able to run concurrently. The hypervisor will schedule them sequentially as there are only 20 physical cores to serve 22 vCPUs. Here, both VMs will experience high contention. Hold on! You might say, "There is no Contention counter in vSphere and no memory Demand counter either." This is where vRealize Operations comes in. It does not just regurgitate the values in vCenter. It has implicit knowledge of vSphere and a set of derived counters with formulae that leverage that knowledge. You need to have an understanding of how the vSphere CPU scheduler works. The following diagram shows the various states that a VM can be in: The preceding diagram is taken from The CPU Scheduler in VMware vSphere® 5.1: Performance Study (you can find it at http://www.vmware.com/resources/techresources/10345). This is a whitepaper that documents the CPU scheduler with a good amount of depth for VMware administrators. I highly recommend you read this paper as it will help you explain to your customers (the application team) how your shared infrastructure juggles all those VMs at the same time. It will also help you pick the right counters when you create your custom dashboards in vRealize Operations. Storage If you look at the ESXi and VM metric groups for storage in the vCenter performance chart, it is not clear how they relate to one another at first glance. You have storage network, storage adapter, storage path, datastore, and disk metric groups that you need to check. How do they impact on one another? I have created the following diagram to explain the relationship. The beige boxes are what you are likely to be familiar with. You have your ESXi host, and it can have NFS Datastore, VMFS Datastore, or RDM objects. The blue colored boxes represent the metric groups. From ESXi to disk NFS and VMFS datastores differ drastically in terms of counters, as NFS is file-based while VMFS is block-based. For NFS, it uses the vmnic, and so the adapter type (FC, FCoE, or iSCSI) is not applicable. Multipathing is handled by the network, so you don't see it in the storage layer. For VMFS or RDM, you have more detailed visibility of the storage. To start off, each ESXi adapter is visible and you can check the counters for each of them. In terms of relationship, one adapter can have many devices (disk or CDROM). One device is typically accessed via two storage adapters (for availability and load balancing), and it is also accessed via two paths per adapter, with the paths diverging at the storage switch. A single path, which will come from a specific adapter, can naturally connect one adapter to one device. The following diagram shows the four paths: Paths from ESXi to storage A storage path takes data from ESXi to the LUN (the term used by vSphere is Disk), not to the datastore. So if the datastore has multiple extents, there are four paths per extent. This is one reason why I did not use more than one extent, as each extent adds four paths. If you are not familiar with extent, Cormac Hogan explains it well on this blog post: http://blogs.vmware.com/vsphere/2012/02/vmfs-extents-are-they-bad-or-simply-misunderstood.html For VMFS, you can see the same counters at both the Datastore level and the Disk level. Their value will be identical if you follow the recommended configuration to create a 1:1 relationship between a datastore and a LUN. This means you present an entire LUN to a datastore (use all of its capacity). The following screenshot shows how we manage the ESXi storage. Click on the ESXi you need to manage, select the Manage tab, and then the Storage subtab. In this subtab, we can see the adapters, devices, and the host cache. The screen shows an ESXi host with the list of its adapters. I have selected vmhba2, which is an FC HBA. Notice that it is connected to 5 devices. Each device has 4 paths, so I have 20 paths in total. ESXi adapter Let's move on to the Storage Devices tab. The following screenshot shows the list of devices. Because NFS is not a disk, it does not appear here. I have selected one of the devices to show its properties. ESXi device If you click on the Paths tab, you will be presented with the information shown in the next screenshot, including whether a path is active. Note that not all paths carry I/O; it depends on your configuration and multipathing software. Because each LUN typically has four paths, path management can be complicated if you have many LUNs. ESXi paths The story is quite different on the VM layer. A VM does not see the underlying shared storage. It sees local disks only. So regardless of whether the underlying storage is NFS, VMFS, or RDM, it sees all of them as virtual disks. You lose visibility in the physical adapter (for example, you cannot tell how many IOPSs on vmhba2 are coming from a particular VM) and physical paths (for example, how many disk commands travelling on that path are coming from a particular VM). You can, however, see the impact at the Datastore level and the physical Disk level. The Datastore counter is especially useful. For example, if you notice that your IOPS is higher at the Datastore level than at the virtual Disk level, this means you have a snapshot. The snapshot IO is not visible at the virtual Disk level as the snapshot is stored on a different virtual disk. From VM to disk Counters in vCenter and vRealize Operations We compared the metric groups between vCenter and vRealize Operations. We know that vRealize Operations provides a lot more detail, especially for larger objects such as vCenter, data center, and cluster. It also provides information about the distributed switch, which is not displayed in vCenter at all. This makes it useful for the big-picture analysis. We will now look at individual counters. To give us a two-dimensional analysis, I would not approach it from the vSphere objects' point of view. Instead, we will examine the four key types of metrics (CPU, RAM, network, and storage). For each type, I will provide my personal take on what I think is a good guidance for their value. For example, I will give guidance on a good value for CPU contention based on what I have seen in the field. This is not an official VMware recommendation. I will state the official recommendation or popular recommendation if I am aware of it. You should spend time understanding vCenter counters and esxtop counters. This section of the article is not meant to replace the manual. I would encourage you to read the vSphere documentation on this topic, as it gives you the required foundation while working with vRealize Operations. The following are the links to this topic: The link for vSphere 5.5 is http://pubs.vmware.com/vsphere-55/index.jsp#com.vmware.vsphere.monitoring.doc/GUID-12B1493A-5657-4BB3-8935-44B6B8E8B67C.html. If this link does not work, visit https://www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-pubs.html and then navigate to ESXi and vCenter Server 5.5 Documentation | vSphere Monitoring and Performance | Monitoring Inventory Objects with Performance Charts. The counters are documented in the vSphere API. You can find it at http://pubs.vmware.com/vsphere-55/index.jsp#com.vmware.wssdk.apiref.doc/vim.PerformanceManager.html. If this link has changed and no longer works, open the vSphere online documentation and navigate to vSphere API/SDK Documentation | vSphere Management SDK | vSphere Web Services SDK Documentation | VMware vSphere API Reference | Managed Object Types | P. Here, choose Performance Manager from the list under the letter P. The esxtop manual provides good information on the counters. You can find it at https://www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-pubs.html. You should also be familiar with the architecture of ESXi, especially how the scheduler works. vCenter has a different collection interval (sampling period) depending upon the timeline you are looking at. Most of the time you are looking at the real-time statistic (chart), as other timelines do not have enough counters. You will notice right away that most of the counters become unavailable once you choose a timeline. In the real-time chart, each data point has 20 seconds' worth of data. That is as accurate as it gets in vCenter. Because all other performance management tools (including vRealize Operations) get their data from vCenter, they are not getting anything more granular than this. As mentioned previously, esxtop allows you to sample down to a minimum of 2 seconds. Speaking of esxtop, you should be aware that not all counters are exposed in vCenter. For example, if you turn on 3D graphics, there is a separate SVGA thread created for that VM. This can consume CPU and it will not show up in vCenter. The Mouse, Keyboard, Screen (MKS) threads, which give you the console, also do not show up in vCenter. The next screenshot shows how you lose most of your counters if you choose a timespan other than real time. In the case of CPU, you are basically left with two counters, as Usage and Usage in MHz cover the same thing. You also lose the ability to monitor per core, as the target objects only list the host now and not the individual cores. Counters are lost beyond 1 hour Because the real-time timespan only lasts for 1 hour, the performance troubleshooting has to be done at the present moment. If the performance issue cannot be recreated, there is no way to troubleshoot in vCenter. This is where vRealize Operations comes in, as it keeps your data for a much longer period. I was able to perform troubleshooting for a client on a problem that occurred more than a month ago! vRealize Operations takes data every 5 minutes. This means it is not suitable to troubleshoot performance that does not last for 5 minutes. In fact, if the performance issue only lasts for 5 minutes, you may not get any alert, because the collection may happen exactly in the middle of those 5 minutes. For example, let's assume the CPU is idle from 08:00:00 to 08:02:30, spikes from 08:02:30 to 08:07:30, and then again is idle from 08:07:30 to 08:10:00. If vRealize Operations is collecting at exactly 08:00, 08:05, and 08:10, you will not see the spike as it is spread over two data points. This means, for vRealize Operations to pick up the spike in its entirety without any idle data, the spike has to last for 10 minutes or more. In some metrics, the unit is actually 20 seconds. vRealize Operations averages a set of 20-second data points into a single 5-minute data point. The Rollups column is important. Average means the average of 5 minutes in the case of vRealize Operations. The summation value is actually an average for those counters where accumulation makes more sense. An example is CPU Ready time. It gets accumulated over the sampling period. Over a period of 20 seconds, a VM may accumulate 200 milliseconds of CPU ready time. This translates into 1 percent, which is why I said it is similar to average, as you lose the peak. Latest, on the other hand, is different. It takes the last value of the sampling period. For example, in the sampling for 20 seconds, it takes the value between 19 and 20 seconds. This value can be lower or higher than the average of the entire 20-second period. So what is missing here is the peak of the sampling period. In the 5-minute period, vRealize Operations does not collect low, average, and high from vCenter. It takes average only. Let's talk about the Units column now. Some common units are milliseconds, MHz, percent, KBps, and KB. Some counters are shown in MHz, which means you need to know your ESXi physical CPU frequency. This can be difficult due to CPU power saving features, which lower the CPU frequency when the demand is low. In large environments, this can be operationally difficult as you have different ESXi hosts from different generations (and hence, are likely to sport a different GHz). This is also the reason why I state that the cluster is the smallest logical building block. If your cluster has ESXi hosts with different frequencies, these MHz-based counters can be difficult to use, as the VMs get vMotion-ed by DRS. vRealize Operations versus vCenter I mentioned earlier that vRealize Operations does not simply regurgitate what vCenter has. Some of the vSphere-specific characteristics are not properly understood by traditional management tools. Partial understanding can lead to misunderstanding. vRealize Operations starts by fully understanding the unique behavior of vSphere, then simplifying it by consolidating and standardizing the counters. For example, vRealize Operations creates derived counters such as Contention and Workload, then applies them to CPU, RAM, disk, and network. Let's take a look at one example of how partial information can be misleading in a troubleshooting scenario. It is common for customers to invest in an ESXi host with plenty of RAM. I've seen hosts with 256 to 512 GB of RAM. One reason behind this is the way vCenter displays information. In the following screenshot, vCenter is giving me an alert. The host is running on high memory utilization. I'm not showing the other host, but you can see that it has a warning, as it is high too. The screenshots are all from vCenter 5.0 and vCenter Operations 5.7, but the behavior is still the same in vCenter 5.5 Update 2 and vRealize Operations 6.0. vSphere 5.0 – Memory alarm I'm using vSphere 5.0 and vCenter Operations 5.x to show the screenshots as I want to provide an example of the point I stated earlier, which is the rapid change of vCloud Suite. The first step is to check if someone has modified the alarm by reducing the threshold. The next screenshot shows that utilization above 95 percent will trigger an alert, while utilization above 90 percent will trigger a warning. The threshold has to be breached by at least 5 minutes. The alarm is set to a suitably high configuration, so we will assume the alert is genuinely indicating a high utilization on the host. vSphere 5.0 – Alarm settings Let's verify the memory utilization. I'm checking both the hosts as there are two of them in the cluster. Both are indeed high. The utilization for vmsgesxi006 has gone down in the time taken to review the Alarm Settings tab and move to this view, so both hosts are now in the Warning status. vSphere 5.0 – Hosts tab Now we will look at the vmsgesxi006 specification. From the following screenshot, we can see it has 32 GB of physical RAM, and RAM usage is 30747 MB. It is at 93.8 percent utilization. vSphere – Host's summary page Since all the numbers shown in the preceding screenshot are refreshed within minutes, we need to check with a longer timeline to make sure this is not a one-time spike. So let's check for the last 24 hours. The next screenshot shows that the utilization was indeed consistently high. For the entire 24-hour period, it has consistently been above 92.5 percent, and it hits 95 percent several times. So this ESXi host was indeed in need of more RAM. Deciding whether to add more RAM is complex; there are many factors to be considered. There will be downtime on the host, and you need to do it for every host in the cluster since you need to maintain a consistent build cluster-wide. Because the ESXi is highly utilized, I should increase the RAM significantly so that I can support more VMs or larger VMs. Buying bigger DIMMs may mean throwing away the existing DIMMs, as there are rules restricting the mixing of DIMMs. Mixing DIMMs also increases management complexity. The new DIMM may require a BIOS update, which may trigger a change request. Alternatively, the large DIMM may not be compatible with the existing host, in which case I have to buy a new box. So a RAM upgrade may trigger a host upgrade, which is a larger project. Before jumping in to a procurement cycle to buy more RAM, let's double-check our findings. It is important to ask what is the host used for? and who is using it?. In this example scenario, we examined a lab environment, the VMware ASEAN lab. Let's check out the memory utilization again, this time with the context in mind. The preceding graph shows high memory utilization over a 24-hour period, yet no one was using the lab in the early hours of the morning! I am aware of this as I am the lab administrator. We will now turn to vCenter Operations for an alternative view. The following screenshot from vCenter Operations 5 tells a different story. CPU, RAM, disk, and network are all in the healthy range. Specifically for RAM, it has 97 percent utilization but 32 percent demand. Note that the Memory chart is divided into two parts. The upper one is at the ESXi level, while the lower one shows individual VMs in that host. The upper part is in turn split into two. The green rectangle (Demand) sits on top of a grey rectangle (Usage). The green rectangle shows a healthy figure at around 10 GB. The grey rectangle is much longer, almost filling the entire area. The lower part shows the hypervisor and the VMs' memory utilization. Each little green box represents one VM. On the bottom left, note the KEY METRICS section. vCenter Operations 5 shows that Memory | Contention is 0 percent. This means none of the VMs running on the host is contending for memory. They are all being served well! vCenter Operations 5 – Host's details page I shared earlier that the behavior remains the same in vCenter 5.5. So, let's take a look at how memory utilization is shown in vCenter 5.5. The next screenshot shows the counters provided by vCenter 5.5. This is from a different ESXi host, as I want to provide you with a second example. Notice that the ballooning is 0, so there is no memory pressure for this host. This host has 48 GB of RAM. About 26 GB has been mapped to VM or VMkernel, which is shown by the Consumed counter (the highest line in the chart; notice that the value is almost constant). The Usage counter shows 52 percent because it takes from Consumed. The active memory is a lot lower, as you can see from the line at the bottom. Notice that the line is not a simple straight line, as the value goes up and down. This proves that the Usage counter is actually the Consumed counter. vCenter 5.5 Update 1 memory counters At this point, some readers might wonder whether that's a bug in vCenter. No, it is not. There are situations in which you want to use the consumed memory and not the active memory. In fact, some applications may not run properly if you use active memory. Also, technically, it is not a bug as the data it gives is correct. It is just that additional data will give a more complete picture since we are at the ESXi level and not at the VM level. vRealize Operations distinguishes between the active memory and consumed memory and provides both types of data. vCenter uses the Consumed counter for utilization for the ESXi host. As you will see later in this article, vCenter uses the Active counter for utilization for VM. So the Usage counter has a different formula in vCenter depending upon the object. This makes sense as they are at different levels. vRealize Operations uses the Active counter for utilization. Just because a physical DIMM on the motherboard is mapped to a virtual DIMM in the VM, it does not mean it is actively used (read or write). You can use that DIMM for other VMs and you will not incur (for practical purposes) performance degradation. It is common for Microsoft Windows to initialize pages upon boot with zeroes, but never use them subsequently. For further information on this topic, I would recommend reviewing Kit Colbert's presentation on Memory in vSphere at VMworld, 2012. The content is still relevant for vSphere 5.x. The title is Understanding Virtualized Memory Performance Management and the session ID is INF-VSP1729. You can find it at http://www.vmworld.com/docs/DOC-6292. If the link has changed, the link to the full list of VMworld 2012 sessions is http://www.vmworld.com/community/sessions/2012/. Not all performance management tools understand this vCenter-specific characteristic. They would have given you a recommendation to buy more RAM. Summary In this article, we covered the world of counters in vCenter and vRealize Operations. The counters were analyzed based on their four main groupings (CPU, RAM, disk, and network). We also covered each of the metric groups, which maps to the corresponding objects in vCenter. For the counters, we also shared how they are related, and how they differ. Resources for Article: Further resources on this subject: Backups in the VMware View Infrastructure [Article] VMware vCenter Operations Manager Essentials - Introduction to vCenter Operations Manager [Article] An Introduction to VMware Horizon Mirage [Article]
Read more
  • 0
  • 0
  • 11746

article-image-using-phpstorm-team
Packt
26 Dec 2014
11 min read
Save for later

Using PhpStorm in a Team

Packt
26 Dec 2014
11 min read
In this article by Mukund Chaudhary and Ankur Kumar, authors of the book PhpStorm Cookbook, we will cover the following recipes: Getting a VCS server Creating a VCS repository Connecting PhpStorm to a VCS repository Storing a PhpStorm project in a VCS repository (For more resources related to this topic, see here.) Getting a VCS server The first action that you have to undertake is to decide which version of VCS you are going to use. There are a number of systems available, such as Git and Subversion (commonly known as SVN). It is free and open source software that you can download and install on your development server. There is another system named concurrent versions system (CVS). Both are meant to provide a code versioning service to you. SVN is newer and supposedly faster than CVS. Since SVN is the newer system and in order to provide information to you on the latest matters, this text will concentrate on the features of Subversion only. Getting ready So, finally that moment has arrived when you will start off working in a team by getting a VCS system for you and your team. The installation of SVN on the development system can be done in two ways: easy and difficult. The difficult step can be skipped without consideration because that is for the developers who want to contribute to the Subversion system. Since you are dealing with PhpStorm, you need to remember the easier way because you have a lot more to do. How to do it... The installation step is very easy. There is this aptitude utility available with Debian-based systems, and there is the Yum utility available with Red Hat-based systems. Perform the following steps: You just need to issue the command apt-get install subversion. The operating system's package manager will do the remaining work for you. In a very short time, after flooding the command-line console with messages, you will have the Subversion system installed. To check whether the installation was successful, you need to issue the command whereis svn. If there is a message, it means that you installed Subversion successfully. If you do not want to bear the load of installing Subversion on your development system, you can use commercial third-party servers. But that is more of a layman's approach to solving problems, and no PhpStorm cookbook author will recommend that you do that. You are a software engineer; you should not let go easily. How it works... When you install the version control system, you actually install a server that provides the version control service to a version control client. The subversion control service listens for incoming connections from remote clients on port number 3690 by default. There's more... If you want to install the older companion, CVS, you can do that in a similar way, as shown in the following steps: You need to download the archive for the CVS server software. You need to unpack it from the archive using your favorite unpacking software. You can move it to another convenient location since you will not need to disturb this folder in the future. You then need to move into the directory, and there will start your compilation process. You need to do #. /configure to create the make targets. Having made the target, you need to enter #make install to complete the installation procedure. Due to it being older software, you might have to compile from the source code as the only alternative. Creating a VCS repository More often than not, a PHP programmer is expected to know some system concepts because it is often required to change settings for the PHP interpreter. The changes could be in the form of, say, changing the execution time or adding/removing modules, and so on. In order to start working in a team, you are going to get your hands dirty with system actions. Getting ready You will have to create a new repository on the development server so that PhpStorm can act as a client and get connected. Here, it is important to note the difference between an SVN client and an SVN server—an SVN client can be any of these: a standalone client or an embedded client such as an IDE. The SVN server, on the other hand, is a single item. It is a continuously running process on a server of your choice. How to do it... You need to be careful while performing this activity as a single mistake can ruin your efforts. Perform the following steps: There is a command svnadmin that you need to know. Using this command, you can create a new directory on the server that will contain the code base in it. Again, you should be careful when selecting a directory on the server as it will appear in your SVN URL for the rest part of your life. The command should be executed as: svnadmin create /path/to/your/repo/ Having created a new repository on the server, you need to make certain settings for the server. This is just a normal phenomenon because every server requires a configuration. The SVN server configuration is located under /path/to/your/repo/conf/ with the name svnserve.conf. Inside the file, you need to make three changes. You need to add these lines at the bottom of the file: anon-access = none auth-access = write password-db = passwd There has to be a password file to authorize a list of users who will be allowed to use the repository. The password file in this case will be named passwd (the default filename). The contents in the file will be a number of lines, each containing a username and the corresponding password in the form of username = password. Since these files are scanned by the server according to a particular algorithm, you don't have the freedom to leave deliberate spaces in the file—there will be error messages displayed in those cases. Having made the appropriate settings, you can now make the SVN service run so that an SVN client can access it. You need to issue the command svnserve -d to do that. It is always good practice to keep checking whether what you do is correct. To validate proper installation, you need to issue the command svn ls svn://user@host/path/to/subversion/repo/. The output will be as shown in the following screenshot:   How it works... The svnadmin command is used to perform admin tasks on the Subversion server. The create option creates a new folder on the server that acts as the repository for access from Subversion clients. The configuration file is created by default at the time of server installation. The contents that are added to the file are actually the configuration directives that control the behavior of the Subversion server. Thus, the settings mentioned prevent anonymous access and restrict the write operations to certain users whose access details are mentioned in a file. The command svnserve is again a command that needs to be run on the server side and which starts the instance of the server. The -d switch mentions that the server should be run as a daemon (system process). This also means that your server will continue running until you manually stop it or the entire system goes down. Again, you can skip this section if you have opted for a third-party version control service provider. Connecting PhpStorm to a VCS repository The real utility of software is when you use it. So, having installed the version control system, you need to be prepared to use it. Getting ready With SVN being client-server software, having installed the server, you now need a client. Again, you will have difficulty searching for a good SVN client. Don't worry; the client has been factory-provided to you inside PhpStorm. The PhpStorm SVN client provides you with features that accelerate your development task by providing you detailed information about the changes made to the code. So, go ahead and connect PhpStorm to the Subversion repository you created. How to do it... In order to connect PhpStorm to the Subversion repository, you need to activate the Subversion view. It is available at View | Tool Windows | Svn Repositories. Perform the following steps to activate the Subversion view: 1. Having activated the Subversion view, you now need to add the repository location to PhpStorm. To do that, you need to use the + symbol in the top-left corner in the view you have opened, as shown in the following screenshot: Upon selecting the Add option, there is a question asked by PhpStorm about the location of the repository. You need to provide the full location of the repository. Once you provide the location, you will be able to see the repository in the same Subversion view in which you have pressed the Add button. Here, you should always keep in mind the correct protocol to use. This depends on the way you installed the Subversion system on the development machine. If you used the default installation by installing from the installer utility (apt-get or aptitude), you need to specify svn://. If you have configured SVN to be accessible via SSH, you need to specify svn+ssh://. If you have explicitly configured SVN to be used with the Apache web server, you need to specify http://. If you configured SVN with Apache over the secure protocol, you need to specify https://. Storing a PhpStorm project in a VCS repository Here comes the actual start of the teamwork. Even if you and your other team members have connected to the repository, what advantage does it serve? What is the purpose solved by merely connecting to the version control repository? Correct. The actual thing is the code that you work on. It is the code that earns you your bread. Getting ready You should now store a project in the Subversion repository so that the other team members can work and add more features to your code. It is time to add a project to version control. It is not that you need to start a new project from scratch to add to the repository. Any project, any work that you have done and you wish to have the team work on now can be added to the repository. Since the most relevant project in the current context is the cooking project, you can try adding that. There you go. How to do it... In order to add a project to the repository, perform the following steps: You need to use the menu item provided at VCS | Import into version control | Share project (subversion). PhpStorm will ask you a question, as shown in the following screenshot: Select the correct hierarchy to define the share target—the correct location where your project will be saved. If you wish to create the tags and branches in the code base, you need to select the checkbox for the same. It is good practice to provide comments to the commits that you make. The reason behind this is apparent when you sit down to create a release document. It also makes the change more understandable for the other team members. PhpStorm then asks you the format you want the working copy to be in. This is related to the version of the version control software. You just need to smile and select the latest version number and proceed, as shown in the following screenshot:   Having done that, PhpStorm will now ask you to enter your credentials. You need to enter the same credentials that you saved in the configuration file (see the Creating a VCS repository recipe) or the credentials that your service provider gave you. You can ask PhpStorm to save the credentials for you, as shown in the following screenshot:   How it works... Here it is worth understanding what is going on behind the curtains. When you do any Subversion related task in PhpStorm, there is an inbuilt SVN client that executes the commands for you. Thus, when you add a project to version control, the code is given a version number. This makes the version system remember the state of the code base. In other words, when you add the code base to version control, you add a checkpoint that you can revisit at any point in future for the time the code base is under the same version control system. Interesting phenomenon, isn't it? There's more... If you have installed the version control software yourself and if you did not make the setting to store the password in encrypted text, PhpStorm will provide you a warning about it, as shown in the following screenshot: Summary We got to know about version control systems, step-by-step process to create a VCS repository, and connecting PhpStorm to a VCS repository. Resources for Article:  Further resources on this subject: FuelPHP [article] A look into the high-level programming operations for the PHP language [article] PHP Web 2.0 Mashup Projects: Your Own Video Jukebox: Part 1 [article]
Read more
  • 0
  • 0
  • 4515

Packt
24 Dec 2014
3 min read
Save for later

Introduction to Veeam® ONE™ Business View

Packt
24 Dec 2014
3 min read
In this article, by Kevin L. Sapp, author of the book Managing Virtual Infrastructure with Veeam® ONE™, we will have a look at how Veeam® ONE™ Business View allows you to group and manage your virtual infrastructure in business containers. This is helpful to split machines into function, priority, or any other descriptive category you would like. Veeam® ONE™ Business View displays the categorized information about VMs, clusters, and hosts in business terms. This perspective allows you to plan, control, and analyze the changes in the virtual environment. We will also have a look at data collection. (For more resources related to this topic, see here.) Data collection The data required to create the business topology is periodically collected from the connected virtual infrastructure servers. The data collection is usually run at a scheduled interval. However, you can also run the data collection manually. By default, after a virtual infrastructure server is connected to Veeam® ONE™, the collection is scheduled to run on a weekday at 2 a.m. If required, you can adjust the data collection schedule or switch to the manual collection mode to start each data collection session manually. Scheduling the data collection The best way to automate the collection of data is by creating a schedule for a specific VM server. To change the collection mode to Scheduled and to specify the time settings, use the following steps: Open the Veeam® ONE™ Business View web application by either double-clicking on the desktop icon or connecting to the Veeam® ONE™ server using a browser with the URL http://servername : 1340 by default. Click on the Configuration link located in the upper-right corner of the screen. Click on the VI Management Servers menu option located on the left-hand side of the screen.   Select the Run mode option for the server that you would like to change the schedule for.   While scheduling the data collection for the VM server, perform the following steps: Select the Periodically every option if you plan to run the data collection at a desired interval Select the Daily at this time option if you plan to run the data collection at a specific time of the day or week Once the schedule has been created, click on OK. Collecting data manually The following steps are needed to perform a manual collection of the virtual environment data. Use this procedure to collect data manually: Click on the Session History menu item on the left-hand side of the screen.   Click on the Run Now button for the server that you wish to run the data collection manually. The data collection normally takes a few minutes to run. However, it can vary based on the size and complexity of your infrastructure.   View the details of the session data by clicking on the server from the list shown in Session History. Summary In this article, we explained Veeam® ONE™ Business View. We discussed the steps needed to plan, control, and analyze the changes in the virtual environment. Resources for Article: Further resources on this subject: Configuring vShield App [article] Backups in the VMware View Infrastructure [article] Introduction to Veeam® Backup & Replication for VMware [article]
Read more
  • 0
  • 0
  • 8465

article-image-replication
Packt
24 Dec 2014
5 min read
Save for later

Cassandra High Availability: Replication

Packt
24 Dec 2014
5 min read
This article by Robbie Strickland, the author of Cassandra High Availability, describes the data replication architecture used in Cassandra. Replication is perhaps the most critical feature of a distributed data store, as it would otherwise be impossible to make any sort of availability guarantee in the face of a node failure. As you already know, Cassandra employs a sophisticated replication system that allows fine-grained control over replica placement and consistency guarantees. In this article, we'll explore Cassandra's replication mechanism in depth. Let's start with the basics: how Cassandra determines the number of replicas to be created and where to locate them in the cluster. We'll begin the discussion with a feature that you'll encounter the very first time you create a keyspace: the replication factor. (For more resources related to this topic, see here.) The replication factor On the surface, setting the replication factor seems to be a fundamentally straightforward idea. You configure Cassandra with the number of replicas you want to maintain (during keyspace creation), and the system dutifully performs the replication for you, thus protecting you when something goes wrong. So by defining a replication factor of three, you will end up with a total of three copies of the data. There are a number of variables in this equation. Let's start with the basic mechanics of setting the replication factor. Replication strategies One thing you'll quickly notice is that the semantics to set the replication factor depend on the replication strategy you choose. The replication strategy tells Cassandra exactly how you want replicas to be placed in the cluster. There are two strategies available: SimpleStrategy: This strategy is used for single data center deployments. It is fine to use this for testing, development, or simple clusters, but discouraged if you ever intend to expand to multiple data centers (including virtual data centers such as those used to separate analysis workloads). NetworkTopologyStrategy: This strategy is used when you have multiple data centers, or if you think you might have multiple data centers in the future. In other words, you should use this strategy for your production cluster. SimpleStrategy As a way of introducing this concept, we'll start with an example using SimpleStrategy. The following Cassandra Query Language (CQL) block will allow us to create a keyspace called AddressBook with three replicas: CREATE KEYSPACE AddressBookWITH REPLICATION = {   'class' : 'SimpleStrategy',   'replication_factor' : 3}; The data is assigned to a node via a hash algorithm, resulting in each node owning a range of data. Let's take another look at the placement of our example data on the cluster. Remember the keys are first names, and we determined the hash using the Murmur3 hash algorithm. The primary replica for each key is assigned to a node based on its hashed value. Each node is responsible for the region of the ring between itself (inclusive) and its predecessor (exclusive). While using SimpleStrategy, Cassandra will locate the first replica on the owner node (the one determined by the hash algorithm), then walk the ring in a clockwise direction to place each additional replica, as follows: Additional replicas are placed in adjacent nodes when using manually assigned tokens In the preceding diagram, the keys in bold represent the primary replicas (the ones placed on the owner nodes), with subsequent replicas placed in adjacent nodes, moving clockwise from the primary. Although each node owns a set of keys based on its token range(s), there is no concept of a master replica. In Cassandra, unlike make other database designs, every replica is equal. This means reads and writes can be made to any node that holds a replica of the requested key. If you have a small cluster where all nodes reside in a single rack inside one data center, SimpleStrategy will do the job. This makes it the right choice for local installations, development clusters, and other similar simple environments where expansion is unlikely because there is no need to configure a snitch (which will be covered later in this section). For production clusters, however, it is highly recommended that you use NetworkTopologyStrategy instead. This strategy provides a number of important features for more complex installations where availability and performance are paramount. NetworkTopologyStrategy When it's time to deploy your live cluster, NetworkTopologyStrategy offers two additional properties that make it more suitable for this purpose: Rack awareness: Unlike SimpleStrategy, which places replicas naively, this feature attempts to ensure that replicas are placed in different racks, thus preventing service interruption or data loss due to failures of switches, power, cooling, and other similar events that tend to affect single racks of machines. Configurable snitches: A snitch helps Cassandra to understand the topology of the cluster. There are a number of snitch options for any type of network configuration. Here's a basic example of a keyspace using NetworkTopologyStrategy: CREATE KEYSPACE AddressBookWITH REPLICATION = {   'class' : 'NetworkTopologyStrategy',   'dc1' : 3,   'dc2' : 2}; In this example, we're telling Cassandra to place three replicas in a data center called dc1 and two replicas in a second data center called dc2. Summary In this article, we introduced the foundational concepts of replication and consistency. In our discussion, we outlined the importance of the relationship between replication factor and consistency level, and their impact on performance, data consistency, and availability. By now, you should be able to make sound decisions specific to your use cases. This article might serve as a handy reference in the future as it can be challenging to keep all these details in mind. Resources for Article: Further resources on this subject: An overview of architecture and modeling in Cassandra [Article] Basic Concepts and Architecture of Cassandra [Article] About Cassandra [Article]
Read more
  • 0
  • 0
  • 2972

article-image-using-frameworks
Packt
24 Dec 2014
24 min read
Save for later

Using Frameworks

Packt
24 Dec 2014
24 min read
In this article by Alex Libby, author of the book Responsive Media in HTML5, we will cover the following topics: Adding responsive media to a CMS Implementing responsive media in frameworks such as Twitter Bootstrap Using the Less CSS preprocessor to create CSS media queries Ready? Let's make a start! (For more resources related to this topic, see here.) Introducing our three examples Throughout this article, we've covered a number of simple, practical techniques to make media responsive within our sites—these are good, but nothing beats seeing these principles used in a real-world context, right? Absolutely! To prove this, we're going to look at three examples throughout this article, using technologies that you are likely to be familiar with: WordPress, Bootstrap, and Less CSS. Each demo will assume a certain level of prior knowledge, so it may be worth reading up a little first. In all three cases, we should see that with little effort, we can easily add responsive media to each one of these technologies. Let's kick off with a look at working with WordPress. Adding responsive media to a CMS We will begin the first of our three examples with a look at using the ever popular WordPress system. Created back in 2003, WordPress has been used to host sites by small independent traders all the way up to Fortune 500 companies—this includes some of the biggest names in business such as eBay, UPS, and Ford. WordPress comes in two flavors; the one we're interested in is the self-install version available at http://www.wordpress.org. This example assumes you have a local installation of WordPress installed and working; if not, then head over to http://codex.wordpress.org/Installing_WordPress and follow the tutorial to get started. We will also need a DOM inspector such as Firebug installed if you don't already have it. It can be downloaded from http://www.getfirebug.com if you need to install it. If you only have access to WordPress.com (the other flavor of WordPress), then some of the tips in this section may not work, due to limitations in that version of WordPress. Okay, assuming we have WordPress set up and running, let's make a start on making uploaded media responsive. Adding responsive media manually It's at this point that you're probably thinking we have to do something complex when working in WordPress, right? Wrong! As long as you use the Twenty Fourteen core theme, the work has already been done for you. For this exercise, and the following sections, I will assume you have installed and/or activated WordPress' Twenty Fourteen theme. Don't believe me? It's easy to verify: try uploading an image to a post or page in WordPress. Resize the browser—you should see the image shrink or grow in size as the browser window changes size. If we take a look at the code elsewhere using Firebug, we can also see the height: auto set against a number of the img tags; this is frequently done for responsive images to ensure they maintain the correct proportions. The responsive style seems to work well in the Twenty Fourteen theme; if you are using an older theme, we can easily apply the same style rule to images stored in WordPress when using that theme. Fixing a responsive issue So far, so good. Now, we have the Twenty Fourteen theme in place, we've uploaded images of various sizes, and we try resizing the browser window ... only to find that the images don't seem to grow in size above a certain point. At least not well—what gives? Well, it's a classic trap: we've talked about using percentage values to dynamically resize images, only to find that we've shot ourselves in the foot (proverbially speaking, of course!). The reason? Let's dive in and find out using the following steps: Browse to your WordPress installation and activate Firebug using F12. Switch to the HTML tab and select your preferred image. In Firebug, look for the <header class="entry-header"> line, then look for the following line in the rendered styles on the right-hand side of the window: .site-content .entry-header, .site-content .entry-content,   .site-content .entry-summary, .site-content .entry-meta,   .page-content {    margin: 0 auto; max-width: 474px; } The keen-eyed amongst you should hopefully spot the issue straightaway—we're using percentages to make the sizes dynamic for each image, yet we're constraining its parent container! To fix this, change the highlighted line as indicated: .site-content .entry-header, .site-content .entry-content,   .site-content .entry-summary, .site-content .entry-meta,   .page-content {    margin: 0 auto; max-width: 100%; } To balance the content, we need to make the same change to the comments area. So go ahead and change max-width to 100% as follows: .comments-area { margin: 48px auto; max-width: 100%;   padding: 0 10px; } If we try resizing the browser window now, we should see the image size adjust automatically. At this stage, the change is not permanent. To fix this, we would log in to WordPress' admin area, go to Appearance | Editor and add the adjusted styles at the foot of the Stylesheet (style.css) file. Let's move on. Did anyone notice two rather critical issues with the approach used here? Hopefully, you must have spotted that if a large image is used and then resized to a smaller size, we're still working with large files. The alteration we're making has a big impact on the theme, even though it is only a small change. Even though it proves that we can make images truly responsive, it is the kind of change that we would not necessarily want to make without careful consideration and plenty of testing. We can improve on this. However, making changes directly to the CSS style sheet is not ideal; they could be lost when upgrading to a newer version of the theme. We can improve on this by either using a custom CSS plugin to manage these changes or (better) using a plugin that tells WordPress to swap an existing image for a small one automatically if we resize the window to a smaller size. Using plugins to add responsive images A drawback though, of using a theme such as Twenty Fourteen, is the resizing of images. While we can grow or shrink an image when resizing the browser window, we are still technically altering the size of what could potentially be an unnecessarily large image! This is considered bad practice (and also bad manners!)—browsing on a desktop with a fast Internet connection as it might not have too much of an impact; the same cannot be said for mobile devices, where we have less choice. To overcome this, we need to take a different approach—get WordPress to automatically swap in smaller images when we reach a particular size or breakpoint. Instead of doing this manually using code, we can take advantage of one of the many plugins available that offer responsive capabilities in some format. I feel a demo coming on. Now's a good time to take a look at one such plugin in action: Let's start by downloading our plugin. For this exercise, we'll use the PictureFill.WP plugin by Kyle Ricks, which is available at https://wordpress.org/plugins/picturefillwp/. We're going to use the version that uses Picturefill.js version 2. This is available to download from https://github.com/kylereicks/picturefill.js.wp/tree/master. Click on Download ZIP to get the latest version. Log in to the admin area of your WordPress installation and click on Settings and then Media. Make sure your image settings for Thumbnail, Medium, and Large sizes are set to values that work with useful breakpoints in your design. We then need to install the plugin. In the admin area, go to Plugins | Add New to install the plugin and activate it in the normal manner. At this point, we will have installed responsive capabilities in WordPress—everything is managed automatically by the plugin; there is no need to change any settings (except maybe the image sizes we talked about in step 2). Switch back to your WordPress frontend and try resizing the screen to a smaller size. Press F12 to activate Firebug and switch to the HTML tab. Press Ctrl + Shift + C (or Cmd + Shift + C for Mac users) to toggle the element inspector; move your mouse over your resized image. If we've set the right image sizes in WordPress' admin area and the window is resized correctly, we can expect to see something like the following screenshot: To confirm we are indeed using a smaller image, right-click on the image and select View Image Info; it will display something akin to the following screenshot: We should now have a fully functioning plugin within our WordPress installation. A good tip is to test this thoroughly, if only to ensure we've set the right sizes for our breakpoints in WordPress! What happens if WordPress doesn't refresh my thumbnail images properly? This can happen. If you find this happening, get hold of and install the Regenerate Thumbnails plugin to resolve this issue; it's available at https://wordpress.org/plugins/regenerate-thumbnails/. Adding responsive videos using plugins Now that we can add responsive images to WordPress, let's turn our attention to videos. The process of adding them is a little more complex; we need to use code to achieve the best effect. Let's examine our options. If you are hosting your own videos, the simplest way is to add some additional CSS style rules. Although this removes any reliance on JavaScript or jQuery using this method, the result isn't perfect and will need additional styles to handle the repositioning of the play button overlay. Although we are working locally, we should remember the note from earlier in this article; changes to the CSS style sheet may be lost when upgrading. A custom CSS plugin should be used, if possible, to retain any changes. To use a CSS-only solution, it only requires a couple of steps: Browse to your WordPress theme folder and open a copy of styles.css in your text editor of choice. Add the following lines at the end of the file and save it: video { width: 100%; height: 100%; max-width: 100%; } .wp-video { width: 100% !important; } .wp-video-shortcode {width: 100% !important; } Close the file. You now have the basics in place for responsive videos. At this stage, you're probably thinking, "great, my videos are now responsive. I can handle the repositioning of the play button overlay myself, no problem"; sounds about right? Thought so and therein lies the main drawback of this method! Repositioning the overlay shouldn't be too difficult. The real problem is in the high costs of hardware and bandwidth that is needed to host videos of any reasonable quality and that even if we were to spend time repositioning the overlay, the high costs would outweigh any benefit of using a CSS-only solution. A far better option is to let a service such as YouTube do all the hard work for you and to simply embed your chosen video directly from YouTube into your pages. The main benefit of this is that YouTube's servers do all the hard work for you. You can take advantage of an increased audience and YouTube will automatically optimize the video for the best resolution possible for the Internet connections being used by your visitors. Although aimed at beginners, wpbeginner.com has a useful article located at http://www.wpbeginner.com/beginners-guide/why-you-should-never-upload-a-video-to-wordpress/, on the pros and cons of why self-hosting videos isn't recommended and that using an external service is preferable. Using plugins to embed videos Embedding videos from an external service into WordPress is ironically far simpler than using the CSS method. There are dozens of plugins available to achieve this, but one of the simplest to use (and my personal favorite) is FluidVids, by Todd Motto, available at http://github.com/toddmotto/fluidvids/. To get it working in WordPress, we need to follow these steps using a video from YouTube as the basis for our example: Browse to your WordPress' theme folder and open a copy of functions.php in your usual text editor. At the bottom, add the following lines: add_action ( 'wp_enqueue_scripts', 'add_fluidvid' );   function add_fluidvid() { wp_enqueue_script( 'fluidvids',     get_stylesheet_directory_uri() .     '/lib/js/fluidvids.js', array(), false, true ); } Save the file, then log in to the admin area of your WordPress installation. Navigate to Posts | Add New to add a post and switch to the Text tab of your Post Editor, then add http://www.youtube.com/watch?v=Vpg9yizPP_g&hd=1 to the editor on the page. Click on Update to save your post, then click on View post to see the video in action. There is no need to further configure WordPress—any video added from services such as YouTube or Vimeo will be automatically set as responsive by the FluidVids plugin. At this point, try resizing the browser window. If all is well, we should see the video shrink or grow in size, depending on how the browser window has been resized: To prove that the code is working, we can take a peek at the compiled results within Firebug. We will see something akin to the following screenshot: For those of us who are not feeling quite so brave (!), there is fortunately a WordPress plugin available that will achieve the same results, without configuration. It's available at https://wordpress.org/plugins/fluidvids/ and can be downloaded and installed using the normal process for WordPress plugins. Let's change track and move onto our next demo. I feel a need to get stuck in some coding, so let's take a look at how we can implement responsive images in frameworks such as Bootstrap. Implementing responsive media in Bootstrap A question—as developers, hands up if you have not heard of Bootstrap? Good—not too many hands going down Why have I asked this question, I hear you say? Easy—it's to illustrate that in popular frameworks (such as Bootstrap), it is easy to add basic responsive capabilities to media, such as images or video. The exact process may differ from framework to framework, but the result is likely to be very similar. To see what I mean, let's take a look at using Bootstrap for our second demo, where we'll see just how easy it is to add images and video to our Bootstrap-enabled site. If you would like to explore using some of the free Bootstrap templates that are available, then http://www.startbootstrap.com/ is well worth a visit! Using Bootstrap's CSS classes Making images and videos responsive in Bootstrap uses a slightly different approach to what we've examined so far; this is only because we don't have to define each style property explicitly, but instead simply add the appropriate class to the media HTML for it to render responsively. For the purposes of this demo, we'll use an edited version of the Blog Page example, available at http://www.getbootstrap.com/getting-started/#examples; a copy of the edited version is available on the code download that accompanies this article. Before we begin, go ahead and download a copy of the Bootstrap Example folder that is in the code download. Inside, you'll find the CSS, image and JavaScript files needed, along with our HTML markup file. Now that we have our files, the following is a screenshot of what we're going to achieve over the course of our demo: Let's make a start on our example using the following steps: Open up bootstrap.html and look for the following lines (around lines 34 to 35):    <p class="blog-post-meta">January 1, 2014 by <a href="#">Mark</a></p>      <p>This blog post shows a few different types of content that's supported and styled with Bootstrap.         Basic typography, images, and code are all         supported.</p> Immediately below, add the following code—this contains markup for our embedded video, using Bootstrap's responsive CSS styling: <div class="bs-example"> <div class="embed-responsive embed-responsive-16by9">    <iframe allowfullscreen="" src="http://www.youtube.com/embed/zpOULjyy-n8?rel=0" class="embed-responsive-item"></iframe> </div> </div> With the video now styled, let's go ahead and add in an image—this will go in the About section on the right. Look for these lines, on or around lines 74 and 75:    <h4>About</h4>      <p>Etiam porta <em>sem malesuada magna</em> mollis euismod. Cras mattis consectetur purus sit amet       fermentum. Aenean lacinia bibendum nulla sed       consectetur.</p> Immediately below, add in the following markup for our image: <a href="#" class="thumbnail"> <img src="http://placehold.it/350x150" class="img-responsive"> </a> Save the file and preview the results in a browser. If all is well, we can see our video and image appear, as shown at the start of our demo. At this point, try resizing the browser—you should see the video and placeholder image shrink or grow as the window is resized. However, the great thing about Bootstrap is that the right styles have already been set for each class. All we need to do is apply the correct class to the appropriate media file—.embed-responsive embed-responsive-16by9 for videos or .img-responsive for images—for that image or video to behave responsively within our site. In this example, we used Bootstrap's .img-responsive class in the code; if we have a lot of images, we could consider using img { max-width: 100%; height: auto; } instead. So far, we've worked with two popular examples of frameworks in the form of WordPress and Bootstrap. This is great, but it can mean getting stuck into a lot of CSS styling, particularly if we're working with media queries, as we saw earlier in the article! Can we do anything about this? Absolutely! It's time for a brief look at CSS preprocessing and how this can help with adding responsive media to our pages. Using Less CSS to create responsive content Working with frameworks often means getting stuck into a lot of CSS styling; this can become awkward to manage if we're not careful! To help with this, and for our third scenario, we're going back to basics to work on an alternative way of rendering CSS using the Less CSS preprocessing language. Why? Well, as a superset (or extension) of CSS, Less allows us to write our styles more efficiently; it then compiles them into valid CSS. The aim of this example is to show that if you're already using Less, then we can still apply the same principles that we've covered throughout this article, to make our content responsive. It should be noted that this exercise does assume a certain level of prior experience using Less; if this is the first time, you may like to peruse my article, Learning Less, by Packt Publishing. There will be a few steps involved in making the changes, so the following screenshot gives a heads-up on what it will look like, once we've finished: You would be right. If we play our cards right, there should indeed be no change in appearance; working with Less is all about writing CSS more efficiently. Let's see what is involved: We'll start by extracting copies of the Less CSS example from the code download that accompanies this article—inside it, we'll find our HTML markup, reset style sheet, images, and video needed for our demo. Save the folder locally to your PC. Next, add the following styles in a new file, saving it as responsive.less in the css subfolder—we'll start with some of the styling for the base elements, such as the video and banner: #wrapper {width: 96%; max-width: 45rem; margin: auto;   padding: 2%} #main { width: 60%; margin-right: 5%; float: left } #video-wrapper video { max-width: 100%; } #banner { background-image: url('../img/abstract-banner- large.jpg'); height: 15.31rem; width: 45.5rem; max-width:   100%; float: left; margin-bottom: 15px; } #skipTo { display: none; li { background: #197a8a }; }   p { font-family: "Droid Sans",sans-serif; } aside { width: 35%; float: right; } footer { border-top: 1px solid #ccc; clear: both; height:   30px; padding-top: 5px; } We need to add some basic formatting styles for images and links, so go ahead and add the following, immediately below the #skipTo rule: a { text-decoration: none; text-transform: uppercase } a, img { border: medium none; color: #000; font-weight: bold; outline: medium none; } Next up comes the navigation for our page. These styles control the main navigation and the Skip To… link that appears when viewed on smaller devices. Go ahead and add these style rules immediately below the rules for a and img: header { font-family: 'Droid Sans', sans-serif; h1 { height: 70px; float: left; display: block; fontweight: 700; font-size: 2rem; } nav { float: right; margin-top: 40px; height: 22px; borderradius: 4px; li { display: inline; margin-left: 15px; } ul { font-weight: 400; font-size: 1.1rem; } a { padding: 5px 5px 5px 5px; &:hover { background-color: #27a7bd; color: #fff; borderradius: 4px; } } } } We need to add the media query that controls the display for smaller devices, so go ahead and add the following to a new file and save it as media.less in the css subfolder. We'll start with setting the screen size for our media query: @smallscreen: ~"screen and (max-width: 30rem)";   @media @smallscreen { p { font-family: "Droid Sans", sans-serif; }      #main, aside { margin: 0 0 10px; width: 100%; }    #banner { margin-top: 150px; height: 4.85rem; max-width: 100%; background-image: url('../img/abstract-     banner-medium.jpg'); width: 45.5rem; } Next up comes the media query rule that will handle the Skip To… link at the top of our resized window:    #skipTo {      display: block; height: 18px;      a {         display: block; text-align: center; color: #fff; font-size: 0.8rem;        &:hover { background-color: #27a7bd; border-radius: 0; height: 20px }      }    } We can't forget the main navigation, so go ahead and add the following line of code immediately below the block for #skipTo:    header {      h1 { margin-top: 20px }      nav {        float: left; clear: left; margin: 0 0 10px; width:100%;        li { margin: 0; background: #efefef; display:block; margin-bottom: 3px; height: 40px; }        a {          display: block; padding: 10px; text-align:center; color: #000;          &:hover {background-color: #27a7bd; border-radius: 0; padding: 10px; height: 20px; }        }     }    } } At this point, we should then compile the Less style sheet before previewing the results of our work. If we launch responsive.html in a browser, we'll see our mocked up portfolio page appear as we saw at the beginning of the exercise. If we resize the screen to its minimum width, its responsive design kicks in to reorder and resize elements on screen, as we would expect to see. Okay, so we now have a responsive page that uses Less CSS for styling; it still seems like a lot of code, right? Working through the code in detail Although this seems like a lot of code for a simple page, the principles we've used are in fact very simple and are the ones we already used earlier in the article. Not convinced? Well, let's look at it in more detail—the focus of this article is on responsive images and video, so we'll start with video. Open the responsive.css style sheet and look for the #video-wrapper video class: #video-wrapper video { max-width: 100%; } Notice how it's set to a max-width value of 100%? Granted, we don't want to resize a large video to a really small size—we would use a media query to replace it with a smaller version. But, for most purposes, max-width should be sufficient. Now, for the image, this is a little more complicated. Let's start with the code from responsive.less: #banner { background-image: url('../img/abstract-banner- large.jpg'); height: 15.31rem; width: 45.5rem; max-width: 100%; float: left; margin-bottom: 15px; } Here, we used the max-width value again. In both instances, we can style the element directly, unlike videos where we have to add a container in order to style it. The theme continues in the media query setup in media.less: @smallscreen: ~"screen and (max-width: 30rem)"; @media @smallscreen { ... #banner { margin-top: 150px; background-image: url('../img/abstract-banner-medium.jpg'); height: 4.85rem;     width: 45.5rem; max-width: 100%; } ... } In this instance, we're styling the element to cover the width of the viewport. A small point of note; you might ask why we are using the rem values instead of the percentage values when styling our image? This is a good question—the key to it is that when using pixel values, these do not scale well in responsive designs. However, the rem values do scale beautifully; we could use percentage values if we're so inclined, although they are best suited to instances where we need to fill a container that only covers part of the screen (as we did with the video for this demo). An interesting article extolling the virtues of why we should use rem units is available at http://techtime.getharvest.com/blog/in-defense-of-rem-units - it's worth a read. Of particular note is a known bug with using rem values in Mobile Safari, which should be considered when developing for mobile platforms; with all of the iPhones available, its usage could be said to be higher than Firefox! For more details, head over to http://wtfhtmlcss.com/#rems-mobile-safari. Transferring to production use Throughout this exercise, we used Less to compile our styles on the fly each time. This is okay for development purposes, but is not recommended for production use. Once we've worked out the requisite styles needed for our site, we should always look to precompile them into valid CSS before uploading the results into our site. There are a number of options available for this purpose; two of my personal favorites are Crunch! available at http://www.crunchapp.net and the Less2CSS plugin for Sublime Text available at https://github.com/timdouglas/sublime-less2css. You can learn more about precompiling Less code from my new article, Learning Less.js, by Packt Publishing. Summary Wow! We've certainly covered a lot; it shows that adding basic responsive capabilities to media need not be difficult. Let's take a moment to recap on what you learned. We kicked off this article with an introduction to three real-word scenarios that we would then cover. Our first scenario looked at using WordPress. We covered how although we can add simple CSS styling to make images and videos responsive, the preferred method is to use one of the several plugins available to achieve the same result. Our next scenario visited the all too familiar framework known as Twitter Bootstrap. In comparison, we saw that this is a much easier framework to work with, in that styles have been predefined and that all we needed to do was add the right class to the right selector. Our third and final scenario went completely the opposite way, with a look at using the Less CSS preprocessor to handle the styles that we would otherwise have manually created. We saw how easy it was to rework the styles we originally created earlier in the article to produce a more concise and efficient version that compiled into valid CSS with no apparent change in design. Well, we've now reached the end of the book; all good things must come to an end at some point! Nonetheless, I hope you've enjoyed reading the book as much as I have writing it. Hopefully, I've shown that adding responsive media to your sites need not be as complicated as it might first look and that it gives you a good grounding to develop something more complex using responsive media. Resources for Article: Further resources on this subject: Styling the Forms [article] CSS3 Animation [article] Responsive image sliders [article]
Read more
  • 0
  • 41
  • 24202

article-image-analyzing-data
Packt
24 Dec 2014
13 min read
Save for later

Analyzing Data

Packt
24 Dec 2014
13 min read
In this article by Amarpreet Singh Bassan and Debarchan Sarkar, authors of Mastering SQL Server 2014 Data Mining, we will begin our discussion with an introduction to the data mining life cycle, and this article will focus on its first three stages. You are expected to have basic understanding of the Microsoft business intelligence stack and familiarity of terms such as extract, transform, and load (ETL), data warehouse, and so on. (For more resources related to this topic, see here.) Data mining life cycle Before going into further details, it is important to understand the various stages of the data mining life cycle. The data mining life cycle can be broadly classified into the following steps: Understanding the business requirement. Understanding the data. Preparing the data for the analysis. Preparing the data mining models. Evaluating the results of the analysis prepared with the models. Deploying the models to the SQL Server Analysis Services Server. Repeating steps 1 to 6 in case the business requirement changes. Let's look at each of these stages in detail. The first and foremost task that needs to be well defined even before beginning the mining process is to identify the goals. This is a crucial part of the data mining exercise and you need to understand the following questions: What and whom are we targeting? What is the outcome we are targeting? What is the time frame for which we have the data and what is the target time period that our data is going to forecast? What would the success measures look like? Let's define a classic problem and understand more about the preceding questions. We can use them to discuss how to extract the information rather than spending our time on defining the schema. Consider an instance where you are a salesman for the AdventureWorks Cycle company, and you need to make predictions that could be used in marketing the products. The problem sounds simple and straightforward, but any serious data miner would immediately come up with many questions. Why? The answer lies in the exactness of the information being searched for. Let's discuss this in detail. The problem statement comprises the words predictions and marketing. When we talk about predictions, there are several insights that we seek, namely: What is it that we are predicting? (for example: customers, product sales, and so on) What is the time period of the data that we are selecting for prediction? What time period are we going to have the prediction for? What is the expected outcome of the prediction exercise? From the marketing point of view, several follow-up questions that must be answered are as follows: What is our target for marketing, a new product or an older product? Is our marketing strategy product centric or customer centric? Are we going to market our product irrespective of the customer classification, or are we marketing our product according to customer classification? On what timeline in the past is our marketing going to be based on? We might observe that there are many questions that overlap the two categories and therefore, there is an opportunity to consolidate the questions and classify them as follows: What is the population that we are targeting? What are the factors that we will actually be looking at? What is the time period of the past data that we will be looking at? What is the time period in the future that we will be considering the data mining results for? Let's throw some light on these aspects based on the AdventureWorks example. We will get answers to the preceding questions and arrive at a more refined problem statement. What is the population that we are targeting? The target population might be classified according to the following aspects: Age Salary Number of kids What are the factors that we are actually looking at? They might be classified as follows: Geographical location: The people living in hilly areas would prefer All Terrain Bikes (ATB) and the population on plains would prefer daily commute bikes. Household: The people living in posh areas would look for bikes with the latest gears and also look for accessories that are state of the art, whereas people in the suburban areas would mostly look for budgetary bikes. Affinity of components: The people who tend to buy bikes would also buy some accessories. What is the time period of the past data that we would be looking at? Usually, the data that we get is quite huge and often consists of the information that we might very adequately label as noise. In order to sieve effective information, we will have to determine exactly how much into the past we should look; for example, we can look at the data for the past year, past two years, or past five years. We also need to decide the future data that we will consider the data mining results for. We might be looking at predicting our market strategy for an upcoming festive season or throughout the year. We need to be aware that market trends change and so does people's needs and requirements. So we need to keep a time frame to refresh our findings to an optimal; for example, the predictions from the past 5 years data can be valid for the upcoming 2 or 3 years depending upon the results that we get. Now that we have taken a closer look into the problem, let's redefine the problem more accurately. AdventureWorks has several stores in various locations and based on the location, we would like to get an insight on the following: Which products should be stocked where? Which products should be stocked together? How much of the products should be stocked? What is the trend of sales for a new product in an area? It is not necessary that we will get answers to all the detailed questions but even if we keep looking for the answers to these questions, there would be several insights that we will get, which will help us make better business decisions. Staging data In this phase, we collect data from all the sources and dump them into a common repository, which can be any database system such as SQL Server, Oracle, and so on. Usually, an organization might have various applications to keep track of the data from various departments, and it is quite possible that all these applications might use a different database system to store the data. Thus, the staging phase is characterized by dumping the data from all the other data storage systems to a centralized repository. Extract, transform, and load This term is most common when we talk about data warehouse. As it is clear, ETL has the following three parts: Extract: The data is extracted from a different source database and other databases that might contain the information that we seek Transform: Some transformation is applied to the data to fit the operational needs, such as cleaning, calculation, removing duplicates, reformatting, and so on Load: The transformed data is loaded into the destination data store database We usually believe that the ETL is only required till we load the data onto the data warehouse but this is not true. ETL can be used anywhere that we feel the need to do some transformation of data as shown in the following figure: Data warehouse As evident from the preceding figure, the next stage is the data warehouse. The AdventureWorksDW database is the outcome of the ETL applied to the staging database, which is AdventureWorks. We will now discuss the concepts of data warehousing and some best practices and then relate to these concepts with the help of AdventureWorksDW database. Measures and dimensions There are a few common terminologies you will encounter as you enter the world of data warehousing. They are as follows: Measure: Any business entity that can be aggregated or whose values can be ascertained in a numerical value is termed as measure, for example, sales, number of products, and so on Dimension: This is any business entity that lends some meaning to the measures, for example, in an organization, the quantity of goods sold is a measure but the month is a dimension Schema A schema, basically, determines the relationship of the various entities with each other. There are essentially two types of schema, namely: Star schema: This is a relationship where the measures have a direct relationship with the dimensions. Let's look at an instance wherein a seller has several stores that sell several products. The relationship of the tables based on the star schema will be as shown in the following screenshot: Snowflake schema: This is a relationship wherein the measures may have a direct and indirect relationship with the dimensions. We will be designing a snowflake schema if we want a more detailed drill down of the data. Snowflake schema usually would involve hierarchies, as shown in the following screenshot: Data mart While a data warehouse is a more organization-wide repository of data, extracting data from such a huge repository might well be an uphill task. We segregate the data according to the department or the specialty that the data belongs to, so that we have much smaller sections of the data to work with and extract information from. We call these smaller data warehouses data marts. Let's consider the sales for AdventureWorks cycles. To make any predictions on the sales of AdventureWorks, we will have to group all the tables associated with the sales together in a data mart. Based on the AdventureWorks database, we have the following table in the AdventureWorks sales data mart. The Internet sales facts table has the following data: [ProductKey][OrderDateKey][DueDateKey][ShipDateKey][CustomerKey][PromotionKey][CurrencyKey][SalesTerritoryKey][SalesOrderNumber][SalesOrderLineNumber][RevisionNumber][OrderQuantity][UnitPrice][ExtendedAmount][UnitPriceDiscountPct][DiscountAmount][ProductStandardCost][TotalProductCost][SalesAmount][TaxAmt][Freight][CarrierTrackingNumber][CustomerPONumber][OrderDate][DueDate][ShipDate] From the preceding column, we can easily identify that if we need to separate the tables to perform the sales analysis alone, we can safely include the following: Product: This provides the following data: [ProductKey][ListPrice] Date: This provides the following data: [DateKey] Customer: This provides the following data: [CustomerKey] Currency: This provides the following data: [CurrencyKey] Sales territory: This provides the following data: [SalesTerritoryKey] The preceding data will provide the relevant dimensions and the facts that are already contained in the FactInternetSales table and hence, we can easily perform all the analysis pertaining to the sales of the organization. Refreshing data Based on the nature of the business and the requirements of the analysis, refreshing of data can be done either in parts wherein new or incremental data is added to the tables, or we can refresh the entire data wherein the tables are cleaned and filled with new data, which consists of the old and new data. Let's discuss the preceding points in the context of the AdventureWorks database. We will take the employee table to begin with. The following is the list of columns in the employee table: [BusinessEntityID],[NationalIDNumber],[LoginID],[OrganizationNode],[OrganizationLevel],[JobTitle],[BirthDate],[MaritalStatus],[Gender],[HireDate],[SalariedFlag],[VacationHours],[SickLeaveHours],[CurrentFlag],[rowguid],[ModifiedDate] Considering an organization in the real world, we do not have a large number of employees leaving and joining the organization. So, it will not really make sense to have a procedure in place to reload the dimensions, prior to SQL 2008. When it comes to managing the changes in the dimensions table, Slowly Changing Dimensions (SCD) is worth a mention. We will briefly look at the SCD here. There are three types of SCD, namely: Type 1: The older values are overwritten by new values Type 2: A new row specifying the present value for the dimension is inserted Type 3: The column specifying TimeStamp from which the new value is effective is updated Let's take the example of HireDate as a method of keeping track of the incremental loading. We will also have to maintain a small table that will keep a track of the data that is loaded from the employee table. So, we create a table as follows: Create table employee_load_status(HireDate DateTime,LoadStatus varchar); The following script will load the employee table from the AdventureWorks database to the DimEmployee table in the AdventureWorksDW database: With employee_loaded_date(HireDate) as(select ISNULL(Max(HireDate),to_date('01-01-1900','MM-DD-YYYY')) fromemployee_load_status where LoadStatus='success'Union AllSelect ISNULL(min(HireDate),to_date('01-01-1900','MM-DD-YYYY')) fromemployee_load_status where LoadStatus='failed')Insert into DimEmployee select * from employee where HireDate>=(select Min(HireDate) from employee_loaded_date); This will reload all the data from the date of the first failure till the present day. A similar procedure can be followed to load the fact table but there is a catch. If we look at the sales table in the AdventureWorks table, we see the following columns: [BusinessEntityID],[TerritoryID],[SalesQuota],[Bonus],[CommissionPct],[SalesYTD],[SalesLastYear],[rowguid],[ModifiedDate] The SalesYTD column might change with every passing day, so do we perform a full load every day or do we perform an incremental load based on date? This will depend upon the procedure used to load the data in the sales table and the ModifiedDate column. Assuming the ModifiedDate column reflects the date on which the load was performed, we also see that there is no table in the AdventureWorksDW that will use the SalesYTD field directly. We will have to apply some transformation to get the values of OrderQuantity, DateOfShipment, and so on. Let's look at this with a simpler example. Consider we have the following sales table: Name SalesAmount Date Rama 1000 11-02-2014 Shyama 2000 11-02-2014 Consider we have the following fact table: id SalesAmount Datekey We will have to think of whether to apply incremental load or a complete reload of the table based on our end needs. So the entries for the incremental load will look like this: id SalesAmount Datekey ra 1000 11-02-2014 Sh 2000 11-02-2014 Ra 4000 12-02-2014 Sh 5000 13-02-2014 Also, a complete reload will appear as shown here: id TotalSalesAmount Datekey Ra 5000 12-02-2014 Sh 7000 13-02-2014 Notice how the SalesAmount column changes to TotalSalesAmount depending on the load criteria. Summary In this article, we've covered the first three steps of any data mining process. We've considered the reasons why we would want to undertake a data mining activity and identified the goal we have in mind. We then looked to stage the data and cleanse it. Resources for Article: Further resources on this subject: Hadoop and SQL [Article] SQL Server Analysis Services – Administering and Monitoring Analysis Services [Article] SQL Server Integration Services (SSIS) [Article]
Read more
  • 0
  • 0
  • 2929
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-building-middle-tier
Packt
23 Dec 2014
34 min read
Save for later

Building the Middle-Tier

Packt
23 Dec 2014
34 min read
In this article by Kerri Shotts , the author of the book PhoneGap for Enterprise covered how to build a web server that bridges the gap between our database backend and our mobile application. If you browse any Cordova/PhoneGap forum, you'll often come across posts asking how to connect to and query a backend database. In this article, we will look at the reasons why it is necessary to interact with your backend database using an intermediary service. If the business logic resides within the database, the middle-tier might be a very simple layer wrapping the data store, but it can also implement a significant portion of business logic as well. The middle-tier also usually handles session authentication logic. Although many enterprise projects will already have a middle-tier in place, it's useful to understand how a middle-tier works, and how to implement one if you ever need to build a solution from the ground up. In this article, we'll focus heavily on these topics: Typical middle-tier architecture Designing a RESTful-like API Implementing a RESTful-like hypermedia API using Node.js Connecting to the backend database Executing queries Handling authentication using Passport Building API handlers You are welcome to implement your middle-tier using any technology with which you are comfortable. The topics that we will cover in this article can be applied to any middle-tier platform. Middle-tier architecture It's tempting, especially for simple applications, to have the desire to connect your mobile app directly to your data store. This is an incredibly bad idea, which means your data store is vulnerable and exposed to attacks from the outside world (unless you require the user to log in to a VPN). It also means that your mobile app has a lot of code dedicated solely to querying your data store, which makes for a tightly coupled environment. If you ever want to change your database platform or modify the table structures, you will need to update the app, and any app that wasn't updated will stop working. Furthermore, if you want another system to access the data, for example, a reporting solution, you will need to repeat the same queries and logic already implemented in your app in order to ensure consistency. For these reasons alone, it's a bad idea to directly connect your mobile app to your backend database. However, there's one more good reason: Cordova has no nonlocal database drivers whatsoever. Although it's not unusual for a desktop application to make a direct connection to your database on an internal network, Cordova has no facility to load a database driver to interface directly with an Oracle or MySQL database. This means that you must build an intermediary service to bridge the gap from your database backend to your mobile app. No middle-tier is exactly the same, but for web and mobile apps, this intermediary service—also called an application server—is typically a relatively simple web server. This server accepts incoming requests from a client (our mobile app or a website), processes them, and returns the appropriate results. In order to do so, the web server parses these requests using a variety of middleware (security, session handling, cookie handling, request parsing, and so on) and then executes the appropriate request handler for the request. This handler then needs to pass this request on to the business logic handler, which, in our case, lives on the database server. The business logic will determine how to react to the request and returns the appropriate data to the request handler. The request handler transforms this data into something usable by the client, for example, JSON or XML, and returns it to the client. The middle-tier provides an Application Programming Interface (API). Beyond authentication and session handling, the middle-tier provides a set of reusable components that perform specific tasks by delegating these tasks to lower tiers. As an example, one of the components of our Tasker app is named get-task-comments. Provided the user is properly authenticated, the component will request a specific task from the business logic and return the attached comments. Our mobile app (or any other consumer) only needs to know how to call get-task-comments. This decouples the client from the database and ensures that we aren't unnecessarily repeating code. The flow of request and response looks a lot like the following figure: Designing a RESTful-like API A mobile app interfaces with your business logic and data store via an API provided by the application server middle-tier. Exactly how this API is implemented and how the client uses it is up to the developers of the system. In the past, this has often meant using web services (over HTTP) with information interchange via Simple Object Access Protocol (SOAP). Recently, RESTful APIs have become the norm when working with web and mobile applications. These APIs conform to the following constraints: Client/Server: Clients are not concerned with how data is stored, (that's the server's job), and servers are not concerned with state (that's the client's job). They should be able to be developed and/or replaced completely independently of each other (low coupling) as long as the API remains the same. Stateless: Each request should have the necessary information contained within it so that the server can properly handle the request. The server isn't concerned about session states; this is the sole domain of the client. Cacheable: Responses must specify if they can be cached or not. Proper management of this can greatly improve performance and scalability. Layered: The client shouldn't be able to tell if there are any intermediary servers between it and the server. This ensures that additional servers can be inserted into the chain to provide caching, security, load balancing, and so on. Code-on-demand: This is an optional constraint. The server can send the necessary code to handle the response to the client. For a mobile PhoneGap app, this might involve sending a small snippet of JavaScript, for example, to handle how to display and interact with a Facebook post. Uniform Interface: Resources are identified by a Uniform Resource Identifier (URI), for example, https://pge-as.example.com/task/21 refers to the task with an identifier of 21. These resources can be expressed in any number of formats to facilitate data interchange. Furthermore, when the client has the resource (in whatever representation it is provided), the client should also have enough information to manipulate the resource. Finally, the representation should indicate valid state transitions by providing links that the client can use to navigate the state tree of the system. There are many good web APIs in production, but often they fail to address the last constraint very well. They might represent resources using URIs, but typically the client is expected to know all the endpoints of the API and how to transition between them without the server telling the client how to do so. This means that the client is tightly coupled to the API. If the URIs or the API change, then the client breaks. RESTful APIs should instead provide all the valid state transitions with each response. This lets the client reduce its coupling by looking for specific actions rather than assuming that a specific URI request will work. Properly implemented, the underlying URIs could change and the client app would be unaffected. The only thing that needs to be constant is the entry URI to the API. There are many good examples of these kinds of APIs, PayPal's is quite good as are many others. The responses from these APIs always contain enough information for the client to advance to the next state in the chain. So in the case of PayPal, a response will always contain enough information to advance to the next step of the monetary transaction. Because the response contains this information, the client only needs to look at the response rather than having the URI of the next step hardcoded. RESTful APIs aren't standardized; one API might provide links to the next state in one format, while another API might use a different format. That said, there are several attempts to create a standard response format, Collection+JSON is just one example. The lack of standardization in the response format isn't as bad as it sounds; the more important issue is that as long as your app understands the response format, it can be decoupled from the URI structure of your API and its resources. The API becomes a list of methods with explicit transitions rather than a list of URIs alone. As long as the action names remain the same, the underlying URIs can be changed without affecting the client. This works well when it comes to most APIs where authorization is provided using an API key or an encoded token. For example, an API will often require authorization via OAuth 2.0. Your code asks for the proper authorization first, and upon each subsequent request, it passes an appropriate token that enables access to the requested resource. Where things become problematic, and why we're calling our API RESTful-like, is when it comes to the end user authentication. Whether the user of our mobile app recognizes it or not, they are an immediate consumer of our API. Because the data itself is protected based upon the roles and access of each particular user, users must authenticate themselves prior to accessing any data. When an end user is involved with authentication, the idea of sessions is inevitably required largely for the end user's convenience. Some sessions can be incredibly short-lived, for example, many banks will terminate a session if no activity is seen for 10 minutes, while others can be long-lived, and others might even be effectively eternal until explicitly revoked by the user. Regardless of the session length, the fact that a session is present indicates that the server must often store some information about state. Even if this information applies only to the user's authentication and session validity, it still violates the second rule of RESTful APIs. Tasker's web API, then, is a RESTful-like API. In everything except session handling and authentication, our API is like any other RESTful API. However, when it comes to authentication, the server maintains some state in order to ensure that users are properly authenticated. In the case of Tasker, the maintained state is limited. Once a user authenticates, a unique single-use token and an Hash Message Authentication Code (HMAC) secret are generated and returned to the client. This token is expected to be sent with the next API request and this request is expected to be signed with the HMAC secret. Upon completion of this API request, a new token is generated. Each token expires after a specified amount of time, or can be expired immediately by an explicit logout. Each token is stored in the backend, which means we violate the stateless rule. Our tokens are just a cryptographically random series of bytes, and because of this, there's nothing in the token that can be used to identify the user. This means we need to maintain the valid tokens and their user associations in the database. If the token contained user-identifiable information, we could technically avoid maintaining state, but this also means that the token could be forged if the attacker knew how tokens were constructed. A random token, on the other hand, means that there's no method of construction that can fool the server; the attacker will have to be very lucky to guess it right. Since Tasker's tokens are continually expiring after a short period of time and are continually regenerated upon each request, guessing a token is that much more difficult. Of course, it's not impossible for an attacker to get lucky and guess the right token on the first try, but considering the amount of entropy in most usernames and passwords, it's more likely that the attacker could guess the user's password than they could guess the correct token. Because these tokens are managed by the backend, our Tasker's API isn't truly stateless, and so it's not truly RESTful, hence the term RESTful-like. If you want to implement your API as a pure RESTful API, feel free. If your API is like that of many other APIs (such as Twitter, PayPal, Facebook, and so on), you'll probably want to do so. All this sounds well and good, but how should we go about designing and defining our API? Here's how I suggest going about it: Identify the resources. In Tasker, the resources are people, tasks, and task comments. Essentially, these are the data models. (If you take security into account, Tasker also has user and role resources in addition to sessions.) Define how the URI should represent the resource. For example, Bob Smith might be represented by /person/bob-smith or /person/29481. Query parameters are also acceptable: /person?administeredBy=john-doe will refer to the set of all individuals who have John Doe as their administrator. If this helps, think of each instance of a resource and each collection of these resources as web pages each having their own URL. Identify the actions that can be performed for each resource. For example, a task can be created and modified by the owner of the task. This task can be assigned to another user. A task's status and progress can be updated by both the owner and the assignee. With RESTful APIs, these actions are typically handled by using the HTTP verbs (also known as methods) GET, POST, PUT, and DELETE. Others can also be used, such as OPTIONS, PATCH, and so on. We'll cover in a moment how these usually line up against typical Create, Read, Update, Delete (CRUD) operations. Identify the state transitions that are valid for resources. As an example, a client's first steps might be to request a list of all tasks assigned to a particular user. As part of the response, it should be given URIs that indicate how the app should retrieve information about a particular task. Furthermore, within this single task's response, there should be information that tells the client how to modify the task. Most APIs generally mirror the typical CRUD operations. The following is how the HTTP verbs line up against the familiar CRUD counterparts for a collection of items: HTTP verb CRUD operation Description GET READ This returns the collection of items in the desired format. Often can be filtered and sorted via query parameters. POST CREATE This creates an item within the collection. The return result includes the URI for the new resource. DELETE N/A This is not typically used at the collection level, unless one wants to remove the entire collection. PUT N/A This is not typically used at the collection level, though it can be used to update/replace each item in the collection.  The same verbs are used for items within a collection: HTTP verb CRUD operation Description GET READ This returns a specific item, given the ID. POST N/A This is not typically used at the item level. DELETE DELETE This deletes a specific item, given the ID. PUT UPDATE This updates an existing item. Sometimes PATCH is used to update only specific properties of the item.  Here's an example of a state transition diagram for a portion of the Tasker API along with the corresponding HTTP verbs: Now that we've determined the states and the valid transitions, we're ready to start modeling the API and the responses it should generate. This is particularly useful before you start coding, as one will often notice issues with the API during this phase, and it's far easier to fix them now rather than after a lot of code has been written (or worse, after the API is in production). How you model your API is up to you. If you want to create a simple text document that describes the various requests and expected responses, that's fine. You can also use any number of tools that aid in modeling your API. Some even allow you to provide mock responses for testing. Some of these are identified as follows: RAML (http://raml.org): This is a markup language to model RESTful-like APIs. You can build API models using any text editor, but there is also an API designer online. Apiary (http://apiary.io): Apiary uses a markdown-like language (API blueprint) to model APIs. If you're familiar with markdown, you shouldn't have much trouble using this service. API mocking and automated testing are also provided. Swagger (http://swagger.io): This is similar to RAML, where it uses YAML as the modeling language. Documentation and client code can be generated directly from the API model. Building our API using Node.js In this section, we'll cover connecting our web service to our Oracle database, handling user authentication and session management using Passport, and defining handlers for state transitions. You'll definitely want to take a look at the /tasker-srv directory in the code package for this book, which contains the full web server for Tasker. In the following sections, we've only highlighted some snippets of the code. Connecting to the backend database Node.js's community has provided a large number of database drivers, so chances are good that whatever your backend, Node.js has a driver available for it. In our example app, we're using an Oracle database as the backend, which means we'll be using the oracle driver (https://www.npmjs.org/package/oracle). Connecting to the database is actually pretty easy, the following code shows how: var oracle = require("oracle"); oracle.connect ( { hostname: "localhost", port: 1521, database: "xe", user: "tasker", password: "password" }, function (err, client) { if (err) { /* error; return or next(err) */ } /* query the database; when done call client.close() */ }); In the real world, a development version of our server will be using a test database, and a production version of our server will use the production database. To facilitate this, we made the connection information configurable. The /config/development.json and /config/production.json files contain connection information, and the main code simply requests the configuration information when making a connection, the following code line is used to get the configuration information: oracle.connect ( config.get ( "oracle" ), … ); Since we're talking about the real world, we also need to recognize that database connections are slow and they need to be pooled in order to improve performance as well as permit parallel execution. To do this, we added the generic-pool NPM module (https://www.npmjs.org/package/generic-pool) and added the following code to app.js: var clientPool = pool.Pool( { name: "oracle", create: function ( cb ) {    return new oracle.connect( config.get("oracle"),      function ( err, client ) {        cb ( err, client );      }    ) }, destroy: function ( client ) {    try {      client.close();    } catch (err) {      // do nothing, but if we don't catch the error,      // the server crashes    } }, max: 5, min: 1, idleTimeoutMillis: 30000 }); Because our pool will always contain at least one connection, we need to ensure that when the process exits, the pool is properly drained, as follows: process.on("exit", function () { clientPool.drain( function () {    clientPool.destroyAllNow(); }); }); On its own, this doesn't do much yet. We need to ensure that the pool is available to the entire app: app.set ( "client-pool", clientPool ); Executing queries We've built our business logic in the Oracle database using PL/SQL stored procedures and functions. In PL/SQL, functions can return table-like structures. While this is similar in concept to a view, writing a function using PL/SQL provides us more flexibility. As such, our queries won't actually be talking to the base tables, they'll be talking to functions that return results based on the user's authorization. This means that we don't need additional conditions in a WHERE clause to filter based on the user's authorization, which helps eliminate code duplication. Regardless of the previous statement, executing queries and stored procedures is done using the same method, that is execute. Before we can execute anything, we need to first acquire a client connection from the pool. To this end, we added a small set of database utility methods; you can see the code in the /db-utils directory. The query utility method is shown in the following code snippet: DBUtils.prototype.query = function ( sql, bindParameters, cb ) { var self = this,    clientPool = self._clientPool,    deferred = Q.defer();    clientPool.acquire( function ( err, client ) {    if ( err ) {    winston.error("Failed to acquire connection.");      if ( cb ) {        cb( new Error( err ) );      else {        deferred.reject( err );      }    }    try {      client.execute( sql, bindParameters,        function ( err, results ) {          if ( err ) {            clientPool.release( client );            if ( cb ) {            cb( new Error( err ) );            } else {              deferred.reject( err );            }           }          clientPool.release( client );            if ( cb ) {            cb( err, results );          } else {            deferred.resolve( results );          }        } );      }      catch ( err2 ) {      try {        clientPool.release( client );      }      catch ( err3 ) {        // can't do anything...      }      if ( cb ) {        cb( err2 );      } else { deferred.reject( err2 );      }    } } ); if ( !cb ) {    return deferred.promise; } }; It's then possible to retrieve the results to an arbitrary query using the preceding method, as shown in the following code snippet: dbUtil.query( "SELECT * FROM " + "table(tasker.task_mgmt.get_task(:1,:2))", [ taskId, req.user.userId ] ) .then( function ( results ) { // if no results, return 404 not found if ( results.length === 0 ) {    return next( Errors.HTTP_NotFound() ); } // create a new task with the database results // (will be in first row) req.task = new Task( results[ 0 ] ); return next(); } ) .catch( function ( err ) { return next( new Error( err ) ); } ) .done(); The query used in the preceding code is an example of calling a stored function that returns a table structure. The results of the SELECT statement will depend on parameters (taskId and username), and get_task will decide what data can be returned based on the user's authorization. Using Passport to handle authentication and sessions Although we've implemented our own authentication protocol, it's better that we use one that has already been well vetted and is well understood as well as one that suits our particular needs. In our case, we needed the demo to stand on its own without a lot of additional services, and as such, we built our own protocol. Even so, we chose a well known cryptographic method (PBKDF2), and are using a large number of iterations and large key lengths. In order to implement authentication easily in Node.js, you'll probably want to use Passport (https://www.npmjs.org/package/passport). It has a large community, and supports a large number of authentication schemes. If at all possible, try to use third-party authentication systems as often as possible (for example, LDAP, AD, Kerberos, and so on). In our case, because our authentication method is custom, we chose to use the passport-req strategy (https://www.npmjs.org/package/passport-req). Since Tasker's authentication is token-based, we will use this to inspect a custom header that the client will use to pass us the authentication token. The following is a simplified diagram of how Tasker's authentication process works: Please don't use our authentication strategy for anything that requires high levels of security. It's just an example, and isn't guaranteed to be secure in any way. Before we can actually use Passport, we need to define how our authentication strategy actually works. We do this by calling passport.use in our app.js file: var passport = require("passport"); var ReqStrategy = require("passport-req").Strategy; var Session = require("./models/session"); passport.use ( new ReqStrategy ( function ( req, done ) {    var clientAuthToken = req.headers["x-auth-token"];    var session = new Session ( new DBUtils ( clientPool ) );    session.findSession( clientAuthToken )    .then( function ( results ) {    if ( !results ) { return done( null, false ); }    done( null, results );    } )    .catch( function ( err ) {    return done( err );    } )    .done(); } )); In the preceding code, we've given Passport a new authentication strategy. Now, whenever Passport needs to authenticate a request, it will call this small section of code. You might be wondering what's going on in findSession. Here's the code: Session.prototype.findSession = function ( clientAuthToken, cb ) { var self = this, deferred = Q.defer(); // if no token, no sense in continuing if ( typeof clientAuthToken === "undefined" ) {    if ( cb ) { return cb( null, false ); }    else { deferred.reject(); } } // an auth token is of the form 1234.ABCDEF10284128401ABC13... var clientAuthTokenParts = clientAuthToken.split( "." ); if ( !clientAuthTokenParts ) {    if ( cb ) { return cb( null, false ); }    else { deferred.reject(); } } // no auth token, no session. // get the parts var sessionId = clientAuthTokenParts[ 0 ], authToken = clientAuthTokenParts[ 1 ]; // ask the database via dbutils if the token is recognized self._dbUtils.execute( "CALL tasker.security.verify_token (:1, :2, :3, :4, :5 ) INTO :6", [ sessionId, authToken, // authorization token self._dbUtils.outVarchar2( { size: 32 } ), self._dbUtils.outVarchar2( { size: 4000 } ), self._dbUtils.outVarchar2( { size: 4000 } ), self._dbUtils.outVarchar2( { size: 1 } ) ] ) .then( function ( results ) {    // returnParam3 has a Y or N; Y is good auth    if ( results.returnParam3 === "Y" ) {      // notify callback of successful auth      var user = {        userId:   results.returnParam, sessionId: sessionId,        nextToken: results.returnParam1,        hmacSecret: results.returnParam2      };      if ( cb ) { cb( null, user ) }      else { deferred.resolve( user ); }    } else {      // auth failed      if ( cb ) { cb( null, false ); } else { deferred.reject(); }    } } ) .catch( function ( err ) {    if ( cb ) { return cb( err, false ); }    else { deferred.reject(); } } ) .done(); if ( !cb ) { return deferred.promise; } }; The dbUtils.execute() method is a wrapper method around the Oracle query method we covered in the Executing queries section. Once a session has been retrieved from the database, Passport will want to serialize the user. This is usually just the user's ID, but we serialize a little more (which, from the preceding code, is the user's ID, session ID, and the HMAC secret): passport.serializeUser(function( user, done ) { done (null, user); }); The serializeUser method is called after a successful authentication and it must be present, or an error will occur. There's also a deserializeUser method if you're using typical Passport sessions: this method is designed to restore the user information from the Passport session. Before any of this will work, we also need to tell Express to use the Passport middleware: app.use ( passport.initialize() ); Passport makes handling authentication simple, but it also provides session support as well. While we don't use it for Tasker, you can use it to support a typical session-based username/password authentication system quite easily with a single line of code: app.use ( passport.session() ); If you're intending to use sessions with Passport, make sure you also provide a deserializeUser method. Next, we need to implement the code to authenticate a user with their username and password. Remember, we initially require the user to log in using their username and password, and once authenticated, we handle all further requests using tokens. To do this, we need to write a portion of our API code. Building API handlers We won't cover the entire API in this section, but we will cover a couple of small pieces, especially as they pertain to authentication and retrieving data. First, we've codified our API in /tasker-srv/api-def in the code package for this book. You'll also want to take a look at /tasker-srv/api-utils to see how we parse out this data structure into useable routes for the Express router. Basically, we codify our API by building a simple structure: [ { "route": "/auth", "actions": [ … ] }, { "route": "/task", "actions": [ … ] }, { "route": "/task/{:taskId}", "params": [ … ],   "actions": [ … ] }, … ] Each route can have any number of actions and parameters. Parameters are equivalent to the Express Router's parameters. In the preceding example, {:taskId} is a parameter that will take on the value of whatever is in that particular location in the URI. For example, /task/21 will result in taskId with the value of 21. This is useful for our actions because each action can then assume that the parameters have already been parsed, so any actions on the /task/{:taskId} route will already have task information at hand. The parameters are defined as follows: { "name": "taskId", "type": "number", "description": "…", "returns": [ … ], "securedBy": "tasker-auth", "handler": function (req, res, next, taskId) {…} } Actions are defined as follows: { "title": "Task", "action": "get-task", "verb": "get", "description": { … }, // hypermedia description "returns": [ … ],     // http status codes that are returned "example": { … },     // example response "href": "/task/{taskId}", "template": true, "accepts": [ "application/json", … ], "sends": [ "application/json", … ], "securedBy": "tasker-auth", "hmac": "tasker-256", "store": { … }, "query-parameters": { … }, "handler": function ( req, res, next ) { … } } Each handler is called whenever that particular route is accessed by a client using the correct HTTP verbs (identified by verb in the prior code). This allows us to write a handler for each specific state transition in our API, which is nicer than having to write a large method that's responsible for the entire route. It also makes describing the API using hypermedia that much simpler, since we can require a portion of the API and call a simple utility method (/tasker-srv/api-utils/index.js) to generate the description for the client. Since we're still working on how to handle authentication, here's how the API definition for the POST /auth route looks (the complete version is located at /tasker-srv/api-def/auth/login.js): action = {    "title": "Authenticate User",    "action": "login",    "description": [ … ], "example":     { … },    "returns":     {      200: "User authenticated; see information in body.",      401: "Incorrect username or password.", …    },    "verb": "post", "href": "/auth",    "accepts": [ "application/json", … ],    "sends": [ "application/json", … ],    "csrf": "tasker-csrf",    "store": {      "body": [ { name: "session-id", key: "sessionId" },      { name: "hmac-secret", key: "hmacSecret" },      { name: "user-id", key: "userId" },      { name: "next-token", key: "nextToken" } ]    },    "template": {      "user-id": {        "title": "User Name", "key": "userId",        "type": "string", "required": true,        "maxLength": 32, "minLength": 1      },      "candidate-password": {        "title": "Password", "key": "candidatePassword",        "type": "string", "required": true,        "maxLength": 255, "minLength": 1      }    }, The earlier code is largely documentation (but it is returned to the client when they request this resource). The following code handler is what actually performs the authentication:    "handler": function ( req, res, next ) {      var session = new Session( new DBUtils(      req.app.get( "client-pool" ) ) ),        username,        password;      // does our input validate?      var validationResults =       objUtils.validate( req.body, action.template );      if ( !validationResults.validates ) {        return next(         Errors.HTTP_Bad_Request( validationResults.message ) );      }      // got here -- good; copy the values out      username = req.body.userId;      password = req.body.candidatePassword;      // create a session with the username and password      session.createSession( username, password )        .then( function ( results ) {          // no session? bad username or password          if ( !results ) {            return next( Errors.HTTP_Unauthorized() );          }        // return the session information to the client        var o = {          sessionId: results.sessionId,          hmacSecret: results.hmacSecret,          userId:   results.userId,          nextToken: results.nextToken,          _links:   {}, _embedded: {}       };        // generate hypermedia        apiUtils.generateHypermediaForAction(         action, o._links, security, "self" );          [ require( "../task/getTaskList" ),          require( "../task/getTask" ), …          require( "../auth/logout" )          ].forEach( function ( apiAction ) {            apiUtils.generateHypermediaForAction(            apiAction, o._links, security );          } );          resUtils.json( res, 200, o );        } )        .catch( function ( err ) {          return next( err );          } )        .done();      }    }; The session.createSession method looks very similar to session.findSession, as shown in the following code: Session.prototype.createSession = function ( userName, candidatePassword, cb ) { var self = this, deferred = Q.defer(); if ( typeof userName === "undefined" || typeof candidatePassword === "undefined" ) {    if ( cb ) { return cb( null, false ); }    else { deferred.reject(); } } // attempt to authenticate self._dbUtils.execute( "CALL tasker.security.authenticate_user( :1, :2, :3," + " :4, :5 ) INTO :6", [ userName, candidatePassword, self._dbUtils.outVarchar2( { size: 4000 }, self._dbUtils.outVarchar2( { size: 4000 } ), self._dbUtils.outVarchar2( { size: 4000 } ), self._dbUtils.outVarchar2( { size: 1 } ] ) .then( function ( results ) {    // ReturnParam3 has Y or N; Y is good auth    if ( results.returnParam3 === "Y" ) {      // notify callback of auth info      var user = {        userId:   userName,        sessionId: results.returnParam,        nextToken: results.returnParam1,        hmacSecret: results.returnParam2      };      if ( cb ) { cb( null, user ); }      else { deferred.resolve( user ); }    } else {      // auth failed      if ( cb ) { cb( null, false ); }      else { deferred.reject(); }    } } ) .catch( function ( err ) {    if ( cb ) { return cb( err, false ) }    else { deferred.reject(); } } ) .done(); if ( !cb ) { return deferred.promise; } }; Once the API is fully codified, we need to go back to app.js and tell Express that it should use the API's routes: app.use ( "/", apiUtils.createRouterForApi(apiDef, checkAuth)); We also add a global variable so that whenever an API section needs to return the entire API as a hypermedia structure, it can do so without traversing the entire API again: app.set( "x-api-root", apiUtils.generateHypermediaForApi( apiDef, securityDef ) ); The checkAuth method shown previously is pretty simple; all it does is ensure that we don't authenticate more than once in a single request: function checkAuth ( req, res, next ) { if (req.isAuthenticated()) {    return next(); } passport.authenticate ( "req" )(req, res, next); } You might be wondering where we're actually forcing our handlers to use authentication. There's actually a bit of magic in /tasker-srv/api-utils. I've highlighted the relevant portions: createRouterForApi:function (api, checkAuthFn) { var router = express.Router(); // process each route in the api; a route consists of the // uri (route) and a series of verbs (get, post, etc.) api.forEach ( function ( apiRoute ) {    // add params    if ( typeof apiRoute.params !== "undefined" ) {      apiRoute.params.forEach ( function ( param ) {        if (typeof param.securedBy !== "undefined" ) {          router.param( param.name, function ( req, res,          next, v) {            return checkAuthFn( req, res,            param.handler.bind(this, req, res, next, v) );          });        } else {          router.param(param.name, param.handler);        }      });    }    var uri = apiRoute.route;    // create a new route with the uri    var route = router.route ( uri );    // process through each action    apiRoute.actions.forEach ( function (action) {      // just in case we have more than one verb, split them out      var verbs = action.verb.split(",");      // and add the handler specified to the route      // (if it's a valid verb)      verbs.forEach ( function (verb) {        if (typeof route[verb] === "function") {          if (typeof action.securedBy !== "undefined") {            route[verb]( checkAuthFn, action.handler );          } else {            route[verb]( action.handler );          }        }      });    }); }); return router; }; Once you've finished writing even a few handlers, you should be able to verify that the system works by posting requests to your API. First, make sure your server has started ; we use the following code line to start the server: export NODE_ENV=development; npm start For some of the routes, you could just load up a browser and point it at your server. If you type https://localhost:4443/ in your browser, you should see a response that looks a lot like this: If you're thinking this looks styled, you're right. The Tasker API generates responses based on the client's requested format. The browser requests data in HTML, and so our API generates a styled HTML page as a response. For an app, the response is JSON because the app requests that the response be in JSON. If you want to see how this works, see /tasker-srv/res-utils/index.js. If you want to actually send and receive data, though, you'll want to get a REST client rather than using the browser. There are many good free clients: Firefox has a couple of good clients as does Chrome. Or you can find a native client for your operating system. Although you can do everything with curl on the command prompt, RESTful clients are much easier to use and often offer useful features, such as dynamic variables, various authentication methods built in, and many can act as simple automated testers. Summary In this article, we've covered how to build a web server that bridges the gap between our database backend and our mobile application. We've provided an overview of RESTful-like APIs, and we've also quickly shown how to implement such a web API using Node.js. We've also covered authentication and session handling using Passport. Resources for Article:  Further resources on this subject: Building Mobile Apps [article] Adding a Geolocation Trigger to the Salesforce Account Object [article] Introducing SproutCore [article]
Read more
  • 0
  • 0
  • 8221

article-image-best-practices
Packt
23 Dec 2014
29 min read
Save for later

Best Practices

Packt
23 Dec 2014
29 min read
This article by Prabath Siriwardena, author of Mastering Apache Maven 3 focuses on best practices associated with all the core concepts. The following best practices are essential ingredients in creating a successful/productive build environment. The following criteria will help you evaluate the efficiency of your Maven project if you are mostly dealing with a large-scale, multi-module project: The time it takes for a developer to get started with a new project and add it to the build system The effort it requires to upgrade a version of a dependency across all the project modules The time it takes to build the complete project with a fresh local Maven repository The time it takes to do a complete offline build The time it takes to update the versions of Maven artifacts produced by the project, for example, from 1.0.0-SNAPSHOT to 1.0.0 The effort it requires for a completely new developer to understand what your Maven build does The effort it requires to introduce a new Maven repository The time it takes to execute unit tests and integration tests (For more resources related to this topic, see here.) Dependency management In the following example, you will notice that the dependency versions are added to each and every dependency defined in the application POM file: <dependencies>   <dependency>     <groupId>com.nimbusds</groupId>     <artifactId>nimbus-jose-jwt</artifactId>     <version>2.26</version>   </dependency>   <dependency>     <groupId>commons-codec</groupId>     <artifactId>commons-codec</artifactId>     <version>1.2</version>   </dependency> </dependencies> Imagine that you have a set of application POM files in a multi-module Maven project that has the same set of dependencies. If you have duplicated the artifact version with each and every dependency, then to upgrade to the latest dependency, you need to update all the POM files, which could easily lead to a mess. Not just that, if you have different versions of the same dependency used in different modules of the same project, then it's going to be a debugging nightmare in case of an issue. With proper dependency management, we can overcome both the previous issues. If it's a multi-module Maven project, you need to introduce the  dependencyManagement configuration element in the parent POM so that it will be inherited by all the other child modules: <dependencyManagement>   <dependencies>     <dependency>       <groupId>com.nimbusds</groupId>       <artifactId>nimbus-jose-jwt</artifactId>       <version>2.26</version>     </dependency>     <dependency>       <groupId>commons-codec</groupId>       <artifactId>commons-codec</artifactId>       <version>1.2</version>     </dependency>   </dependencies> </dependencyManagement> Once you define dependencies under the dependencyManagement section as shown in the previous code, you only need to refer a dependency by its groupId and artifactId tags. The version tag is picked from the appropriate the dependencyManagement section: <dependencies>   <dependency>     <groupId>com.nimbusds</groupId>     <artifactId>nimbus-jose-jwt</artifactId>   <dependency>     <groupId>commons-codec</groupId>     <artifactId>commons-codec</artifactId>   </dependency> </dependencies> With the previous code snippet, if you want to upgrade or downgrade a dependency, you only need to change the version of the dependency under the dependencyManagement section. The same principle applies to plugins as well. If you have a set of plugins, which are used across multiple modules, you should define them under the pluginManagement section of the parent module. In this way, you can downgrade or upgrade plugin versions seamlessly just by changing the pluginManagement section of the parent POM, as shown in the following code: <pluginManagement>   <plugins>     <plugin>       <artifactId>maven-resources-plugin</artifactId>       <version>2.4.2</version>     </plugin>     <plugin>       <artifactId>maven-site-plugin</artifactId>       <version>2.0-beta-6</version>     </plugin>     <plugin>       <artifactId>maven-source-plugin</artifactId>       <version>2.0.4</version>     </plugin>     <plugin>       <artifactId>maven-surefire-plugin</artifactId>       <version>2.13</version     </plugin   </plugins> </pluginManagement> Once you define the plugins in the pluginManagement section, as shown in the previous code, you only need to refer a plugin from its groupId (optional) and the artifactId tags. The version tag is picked from the appropriate pluginManagement section: <plugins>   <plugin>     <artifactId>maven-resources-plugin</artifactId>     <executions>……</executions>   </plugin>   <plugin>     <artifactId>maven-site-plugin</artifactId>     <executions>……</executions>   </plugin>   <plugin>     <artifactId>maven-source-plugin</artifactId>     <executions>……</executions>   </plugin>   <plugin>     <artifactId>maven-surefire-plugin</artifactId>     <executions>……</executions>   </plugin> </plugins> Defining a parent module In most of the multi-module Maven projects, there are many things that are shared across multiple modules. Dependency versions, plugins versions, properties, and repositories are only some of them. It is a common as well as a best practice to create a separate module called parent, and in its POM file, define everything in common. The packaging type of this POM file is pom. The artifact generated by the pom packaging type is itself a POM file. The following are a few examples: The Apache Axis2 project, available at http://svn.apache.org/repos/asf/axis/axis2/java/core/trunk/modules/parent/ The WSO2 Carbon project, available at https://svn.wso2.org/repos/wso2/carbon/platform/trunk/parent/ Not all the projects follow this approach. Some just keep the parent POM file under the root directory (not under the parent module). The following are a couple of examples: The Apache Synapse project, available at http://svn.apache.org/repos/asf/synapse/trunk/java/pom.xml The Apache HBase project, available at http://svn.apache.org/repos/asf/hbase/trunk/pom.xml Both approaches deliver the same results. However, the first one is much preferred. With the first approach, the parent POM file only defines the shared resources across different Maven modules in the project, while there is another POM file at the root of the project, which defines all the modules to be included in the project build. With the second approach, you define all the shared resources as well as all the modules to be included in the project build in the same POM file, which is under the project's root directory. The first approach is better than the second one, based on the separation of concerns principle. POM properties There are six types of properties that you can use within a Maven application POM file: Built-in properties Project properties Local settings Environment variables Java system properties Custom properties It is always recommended that you use properties, instead of hardcoding values in application POM files. Let''s look at a few examples. Let's take the application POM file inside the Apache Axis2 distribution module, available at http://svn.apache.org/repos/asf/axis/axis2/java/core/trunk/modules/distribution/pom.xml. This defines all the artifacts created in the Axis2 project that need to be included in the final distribution. All the artifacts share the same groupId tag as well as the version tag of the distribution module. This is a common scenario in most of the multi-module Maven projects. Most of the modules (if not all) share the same groupId tag and the version tag: <dependencies>   <dependency>     <groupId>org.apache.axis2</groupId>     <artifactId>axis2-java2wsdl</artifactId>     <version>${project.version}</version>   </dependency>   <dependency>     <groupId>org.apache.axis2</groupId>     <artifactId>axis2-kernel</artifactId>     <version>${project.version}</version>   </dependency>   <dependency>     <groupId>org.apache.axis2</groupId>     <artifactId>axis2-adb</artifactId>     <version>${project.version}</version>   </dependency> </dependencies> In the previous configuration, instead of duplicating the version element, Axis2 uses the project property ${project.version}. When Maven finds this project property, it reads the value from the project POM version element. If the project POM file does not have a version element, then Maven will try to read it from the immediate parent POM file. The benefit here is, when you upgrade your project version some day, you only need to upgrade the version element of the distribution POM file (or its parent). The previous configuration is not perfect; it can be further improved in the following manner: <dependencies>   <dependency>     <groupId>${project.groupId}</groupId>     <artifactId>axis2-java2wsdl</artifactId>     <version>${project.version}</version>   </dependency>   <dependency>     <groupId>${project.groupId}</groupId>     <artifactId>axis2-kernel</artifactId>     <version>${project.version}</version>   </dependency>   <dependency>     <groupId>${project.groupId}</groupId>     <artifactId>axis2-adb</artifactId>     <version>${project.version}</version>   </dependency> </dependencies> Here, we also replace the hardcoded value of groupId in all the dependencies with the project property ${project.groupid}. When Maven finds this project property, it reads the value from the project POM groupId element. If the project POM file does not have a groupId element, then Maven will try to read it from the immediate parent POM file. The following lists out some of the Maven built-in properties and project properties: project.version: This refers to the value of the version element of the project POM file project.groupId: This refers to the value of the groupId element of the project POM file project.artifactId: This refers to the value of the artifactId element of the project POM file project.name: This refers to the value of the name element of the project POM file project.description: This refers to the value of the description element of the project POM file project.baseUri: This refers to the path of the project's base directory The following is an example that shows the usage of this project property. Here, we have a system dependency that needs to be referred from a filesystem path: <dependency>   <groupId>org.apache.axis2.wso2</groupId>   <artifactId>axis2</artifactId>   <version>1.6.0.wso2v2</version>   <scope>system</scope>   <systemPath>${project.basedir}/lib/axis2-1.6.jar</systemPath> </dependency> In addition to the project properties, you can also read properties from the USER_HOME/.m2/settings.xml file. For example, if you want to read the path to the local Maven repository, you can use the ${settings.localRepository} property. In the same way, with the same pattern, you can read any of the configuration elements that are defined in the settings.xml file. The environment variables defined in the system can be read using the env prefix, within an application POM file. The ${env.M2_HOME} property will return the path to the Maven home, while ${env.java_home} returns the path to the Java home directory. These properties will be quite useful within certain Maven plugins. Maven also lets you define your own set of custom properties. Custom properties are mostly used when defining dependency versions. You should not scatter custom properties all over the place. The ideal place to define them is the parent POM file in a multi-module Maven project, which will then be inherited by all the other child modules. If you look at the parent POM file of the WSO2 Carbon project, you will find a large set of custom properties, which are defined in https://svn.wso2.org/repos/wso2/carbon/platform/branches/turing/parent/pom.xml. The following lists out some of them: <properties>   <rampart.version>1.6.1-wso2v10</rampart.version>   <rampart.mar.version>1.6.1-wso2v10</rampart.mar.version>   <rampart.osgi.version>1.6.1.wso2v10</rampart.osgi.version> </properties> When you add a dependency to the Rampart JAR, you do not need to specify the version there. Just refer it by the ${rampart.version} property name. Also, keep in mind that all the custom defined properties are inherited and can be overridden in any child POM file: <dependency>   <groupId>org.apache.rampart.wso2</groupId>   <artifactId>rampart-core</artifactId>   <version>${rampart.version}</version> </dependency> Avoiding repetitive groupId  and version tags and inherit from the parent POM In a multi-module Maven project, most of the modules (if not all) share the same groupId and version elements. In this case, you can avoid adding the version and groupId elements to your application POM file. These will be automatically inherited from the corresponding parent POM. If you look at axis2-kernel (which is a module of the Apache Axis2 project), you will find that no groupId or version is defined at http://svn.apache.org/repos/asf/axis/axis2/java/core/trunk/modules/kernel/pom.xml. Maven reads them from the parent POM file: <project>   <modelVersion>4.0.0</modelVersion>   <parent>     <groupId>org.apache.axis2</groupId>     <artifactId>axis2-parent</artifactId>     <version>1.7.0-SNAPSHOT</version>     <relativePath>../parent/pom.xml</relativePath>   </parent>   <artifactId>axis2-kernel</artifactId>   <name>Apache Axis2 - Kernel</name> </project> Following naming conventions When defining coordinates for your Maven project, you must always follow the naming conventions. The value of the groupId element should follow the same naming convention you use in Java package names. It has to be a domain name (the reverse of the domain name) that you own, or at least your project is developed under. The following lists out some of the naming conventions related to groupId: The name of the groupId element has to be in lower case. Use the reverse of a domain name that can be used to uniquely identify your project. This will also help to avoid collisions between artifacts produced by different projects. Avoid using digits or special characters (that is, org.wso2.carbon.identity-core). Do not try to group two words into a single word by camel casing (that is, org.wso2.carbon.identityCore). Make sure that all the subprojects developed under different teams in the same company finally inherit from the same groupId element and extend the name of the parent groupId element rather than defining their own. Let's go through some examples. You will notice that all the open source projects developed under Apache Software Foundation (ASF) use the same parent groupId (org.apache) and define their own groupId elements, which extend from the parent: The Apache Axis2 project uses org.apache.axis2, which inherits from the org.apache parent groupId The Apache Synapse project uses org.apache.synapse, which inherits from the org.apache parent groupId The Apache ServiceMix project uses org.apache.servicemix, which inherits from the org.apache parent groupId The WSO2 Carbon project uses org.wso2.carbon Apart from the groupId element, you should also follow the naming conventions while defining artifactIds. The following lists out some of the naming conventions related to artifactId: The name of the artifactId has to be in lower case. Avoid duplicating the value of groupId inside the artifactId element. If you find a need to start your artifactId with the value of groupId element and add something to the end, then you need to revisit the structure of your project. You might need to add more module groups. Avoid using special characters (that is, #, $, &, %, and so on). Do not try to group two words into a single word by camel casing (that is, identityCore). Following naming conventions for version is also equally important. The version of a given Maven artifact can be divided into four categories: <Major version>.<Minor version>.<Incremental version>-<Build   number or the qualifier> The major version reflects the introduction of a new major feature. A change in the major version of a given artifact can also mean that the new changes are not necessarily backward compatible with the previously released artifact. The minor version reflects an introduction of a new feature to the previously released version, in a backward compatible manner. The incremental version reflects a bug-fixed release of the artifact. The build number can be the revision number from the source code repository. This versioning convention is not just for Maven artifacts. Apple did a major release of its iOS mobile operating system in September 2014: iOS 8.0.0. Soon after the release, they discovered a critical bug in it that had an impact on cellular network connectivity and TouchID on iPhone. Then, they released iOS 8.0.1 as a patch release to fix the issues. Let's go through some of the examples: The Apache Axis2 1.6.0 release, available at http://svn.apache.org/repos/asf/axis/axis2/java/core/tags/v1.6.0/pom.xml. The Apache Axis2 1.6.2 release, available at http://svn.apache.org/repos/asf/axis/axis2/java/core/tags/v1.6.2/pom.xml. Apache Axis2 1.7.0-SNAPSHOT release, available at http://svn.apache.org/repos/asf/axis/axis2/java/core/trunk/pom.xml. SNAPSHOT releases are done from the trunk of the source repository with the latest available code. Apache Synapse 2.1.0-wso2v5 release, available at http://svn.wso2.org/repos/wso2/tags/carbon/3.2.3/dependencies/synapse/2.1.0-wso2v5/pom.xml. Here, the Synapse code is maintained under the WSO2 source repository and not under the Apache repository. In this case, we use the wso2v5 classifier to make it different from the same artifact produced by Apache Synapse. Maven profiles When do we need Maven profiles and why is it a best practice? Think about a large-scale multi-module Maven project. One of the best examples I am aware of is the WSO2 Carbon project. If you look at the application POM file available at http://svn.wso2.org/repos/wso2/tags/carbon/3.2.3/components/pom.xml, you will notice that there are more than hundred modules. Also, if you go deeper into each module, you will further notice that there are more modules within them: http://svn.wso2.org/repos/wso2/tags/carbon/3.2.3/components/identity/pom.xml. As a developer of the WSO2 Carbon project, you do not need to build all these modules. In this specific example, different groups of the modules are later aggregated into build multiple products. However, a given product does not need to build all the modules defined in the parent POM file. If you are a developer in a product team, you only need to worry about building the set of modules related to your product; if not, it's an utter waste of productive time. Maven profiles help you to do this. With Maven profiles, you can activate a subset of configurations defined in your application POM file, based on some criteria. If we take the same example we took previously, you will find that multiple profiles are defined under the <profiles> element: http://svn.wso2.org/repos/wso2/tags/carbon/3.2.3/components/pom.xml. Each profile element defines the set of modules that is relevant to it and identified by a unique ID. Also for each module, you need to define a criterion to activate it, under the activation element. By setting the value of the activeByDefault element to true, we make sure that the corresponding profile will get activated when no other profile is picked. In this particular example, if we just execute mvn clean install, the profile with the default ID will get executed. Keep in mind that the magic here does not lie on the name of the profile ID, default, but on the value of the activeByDefault element, which is set to true for the default profile. The value of the id element can be of any name: <profiles>   <profile>     <id>product-esb</id>     <activation>       <property>         <name>product</name>         <value>esb</value>       </property>     </activation>     <modules></modules>   </profile>   <profile>     <id>product-greg</id>     <activation>       <property>         <name>product</name>         <value>greg</value>       </property>     </activation>     <modules></modules>   </profile>   <profile>     <id>product-is</id>     <activation>       <property>         <name>product</name>         <value>is</value>       </property>     </activation     <modules></modules>   </profile>   <profile>     <id>default</id>     <activation>       <activeByDefault>true</activeByDefault>     </activation>     <modules></modules>   </profile> </profiles> If I am a member of the WSO2 Identity Server (IS) team, then I will execute the build in the following manner: $ mvn clean install –Dproduct=is Here, we pass the system property product with the value is. If you look at the activation criteria for all the profiles, all are based on the system property: product. If the value of the system property is is, then Maven will pick the build profile corresponding to the Identity Server: <activation>   <property>     <name>product</name>     <value>is</value>   </property> </activation> You also can define an activation criterion to execute a profile, in the absence of a property. For example, the following configuration shows how to activate a profile if the product property is missing: <activation>   <property>     <name>!product</name>   </property> </activation> The profile activation criteria can be based on a system property, the JDK version, or an operating system parameter where you run the build. The following sample configuration shows how to activate a build profile for JDK 1.6: <activation>   <jdk>1.6</jdk> </activation> The following sample configuration shows how to activate a build profile based on operating system parameters: <activation>   <os>     <name>mac os x</name>     <family>mac</family>     <arch>x86_64</arch>    <version>10.8.5</version>   </os> </activation> The following sample configuration shows how to activate a build profile based on the presence or absence of a file: <activation>   <file>     <exists>……</exists>     <missing>……</missing>   </file> </activation> In addition to the activation configuration, you can also execute a Maven profile just by its ID, which is defined within the id element. In this case, you need a prefix; use the profile ID with –P, as shown in the following command: $ mvn clean install -Pproduct-is Think twice before you write your own plugin Maven is all about plugins! There is a plugin out there for almost everything. If you find a need to write a plugin, spend some time researching on the Web to see whether you can find something similar—the chances are very high. You can also find a list of available Maven plugins at http://maven.apache.org/plugins. The Maven release plugin Releasing a project requires a lot of repetitive tasks. The objective of the Maven release plugin is to automate them. The release plugin defines the following eight goals, which are executed in two stages, preparing the release and performing the release: release:clean: This goal cleans up after a release preparation release:prepare: This goal prepares for a release in Software Configuration Management (SCM) release:prepare-with-pom: This goal prepares for a release in SCM and generates release POMs by fully resolving the dependencies release:rollback: This goal rolls back to a previous release release:perform: This goal performs a release from SCM release:stage: This goal performs a release from SCM into a staging folder/repository release:branch: This goal creates a branch of the current project with all versions updated release:update-versions: This goal updates versions in the POM(s) The preparation stage will complete the following tasks with the release:prepare goal: Verify that all the changes in the source code are committed. Make sure that there are no SNAPSHOT dependencies. During the project development phase, we use SNAPSHOT dependencies but at the time of the release, all dependencies should be changed to a released version. Change the version of project POM files from SNAPSHOT to a concrete version number. Change the SCM information in the project POM to include the final destination of the tag. Execute all the tests against the modified POM files. Commit the modified POM files to the SCM and tag the code with the version name. Change the version of POM files in the trunk to a SNAPSHOT version and then commit the modified POM files to the trunk. Finally, the release will be performed with the release:perform goal. This will check the code from the release tag in the SCM and run a set of predefined goals: site, deploy-site. The maven-release-plugin is not defined in the super POM and should be explicitly defined in your project POM file. The releaseProfiles configuration element defines the profiles to be released and the goals configuration element defines the plugin goals to be executed during the release:perform goal. In the following configuration, the deploy goal of the maven-deploy-plugin and the single goal of the maven-assembly-plugin will get executed: <plugin>   <artifactId>maven-release-plugin</artifactId>   <version>2.5</version>   <configuration>     <releaseProfiles>release</releaseProfiles>     <goals>deploy assembly:single</goals>   </configuration> </plugin> More details about the Maven release plugin are available at http://maven.apache.org/maven-release/maven-release-plugin/. The Maven enforcer plugin The Maven enforcer plugin lets you control or enforce constraints in your build environment. These could be the Maven version, Java version, operating system parameters, and even user-defined rules. The plugin defines two goals: enforce and displayInfo. The enforcer:enforce goal will execute all the defined rules against all the modules in a multi-module Maven project, while enforcer:displayInfo will display the project compliance details with respect to the standard rule set. The maven-enforcer-plugin is not defined in the super POM and should be explicitly defined in your project POM file: <plugins>   <plugin>     <groupId>org.apache.maven.plugins</groupId>     <artifactId>maven-enforcer-plugin</artifactId>     <version>1.3.1</version>     <executions>       <execution>         <id>enforce-versions</id>         <goals>           <goal>enforce</goal>         </goals>         <configuration>           <rules>             <requireMavenVersion>               <version>3.2.1</version>             </requireMavenVersion>             <requireJavaVersion>               <version>1.6</version>             </requireJavaVersion>             <requireOS>               <family>mac</family>             </requireOS>           </rules>         </configuration>       </execution>     </executions>   </plugin> </plugins> The previous plugin configuration enforces the Maven version to be 3.2.1, the Java version to be 1.6, and the operating system to be in the Mac family. The Apache Axis2 project uses the enforcer plugin to make sure that no application POM file defines Maven repositories. All the artifacts required by Axis2 are expected to be in the Maven central repository. The following configuration element is extracted from http://svn.apache.org/repos/asf/axis/axis2/java/core/trunk/modules/parent/pom.xml. Here, it bans all the repositories and plugin repositories, except snapshot repositories: <plugin>   <artifactId>maven-enforcer-plugin</artifactId>   <version>1.1</version>   <executions>     <execution>       <phase>validate</phase>       <goals>         <goal>enforce</goal>       </goals>       <configuration>         <rules>           <requireNoRepositories>             <banRepositories>true</banRepositories>             <banPluginRepositories>true</banPluginRepositories>             <allowSnapshotRepositories>true</allowSnapshotRepositories>             <allowSnapshotPluginRepositories>true</allowSnapshotPluginRepositories>           </requireNoRepositories>         </rules>       </configuration>     </execution>   </executions> </plugin> In addition to the standard rule set ships with the enforcer plugin, you can also define your own rules. More details about how to write custom rules are available at http://maven.apache.org/enforcer/enforcer-api/writing-a-custom-rule.html. Avoid using un-versioned plugins If you have associated a plugin with your application POM, without a version, then Maven will download the corresponding maven-metadata.xml file and store it locally. Only the latest released version of the plugin will be downloaded and used in the project. This can easily create certain uncertainties. Your project might work fine with the current version of a plugin, but later if there is a new release of the same plugin, your Maven project will start to use the latest one automatically. This can result in unpredictable behaviors and lead to a debugging mess. It is always recommended that you specify the plugin version along with the plugin configuration. You can enforce this as a rule, with the Maven enforcer plugin, as shown in the following code: <plugin>   <groupId>org.apache.maven.plugins</groupId>   <artifactId>maven-enforcer-plugin</artifactId>   <version>1.3.1</version>   <executions>     <execution>       <id>enforce-plugin-versions</id>       <goals>         <goal>enforce</goal>       </goals>       <configuration>         <rules>           <requirePluginVersions>             <message>………… <message>             <banLatest>true</banLatest>             <banRelease>true</banRelease>             <banSnapshots>true</banSnapshots>             <phases>clean,deploy,site</phases>             <additionalPlugins>               <additionalPlugin>                 org.apache.maven.plugins:maven-eclipse-plugin               </additionalPlugin>               <additionalPlugin>                 org.apache.maven.plugins:maven-reactor-plugin               </additionalPlugin>             </additionalPlugins>             <unCheckedPluginList>               org.apache.maven.plugins:maven-enforcer-plugin,org.apache.maven.plugins:maven-idea-plugin             </unCheckedPluginList>           </requirePluginVersions>         </rules>       </configuration>     </execution>   </executions> </plugin> The following points explain each of the key configuration elements defined in the previous code: message: Use this to define an optional message to the user if the rule execution fails. banLatest: Use this to restrict the use of LATEST as the  version for any plugin. banRelease: Use this to restrict the use of RELEASE as the version for any plugin. banSnapshots: Use this to restrict the use of SNAPSHOT plugins. banTimestamps: Use this to restrict the use of SNAPSHOT plugins with the timestamp version. phases: This is a comma-separated list of phases that should be used to find lifecycle plugin bindings. The default value is clean,deploy,site. additionalPlugins: This is a list of additional plugins to enforce to have versions. These plugins might not be defined in application POM files, but are used anyway, such as help and eclipse. The plugins should be specified in the groupId:artifactId form. unCheckedPluginList: This is a comma-separated list of plugins to skip version checking. You can read more details about the requirePluginVersions rule at http://maven.apache.org/enforcer/enforcer-rules/requirePluginVersions.html. Using exclusive and inclusive routes When Maven asks for an artifact from a Nexus proxy repository, Nexus knows where to look at exactly. For example, say we have a proxy repository that runs at http://localhost:8081/nexus/content/repositories/central/, which internally points to the remote repository running at https://repo1.maven.org/maven2/. As there is one-to-one mapping between the proxy repository and the corresponding remote repository, Nexus can route the requests without much trouble. However, if Maven looks for an artifact via a Nexus group repository, then Nexus has to iterate through all the repositories in that group repository to find the exact artifact. There can be cases where we have even more than 20 repositories in a single group repository, which can easily bring delays at the client side. To optimize artifact discovery in group repositories, we need to set correct inclusive/exclusive routing rules. Avoid having both release and snapshot repositories in the same group repository With the Nexus repository manager, you can group both the release repositories and snapshot repositories together into a single group repository. This is treated as an extreme malpractice. Ideally, you should be able to define distinct update policies for release repositories and snapshot repositories. Summary In this article, we looked at and highlighted some of the best practices to be followed in a large-scale development project with Maven. It is always recommended to follow best practices, as it will drastically improve developer productivity and will reduce any maintenance nightmares. Resources for Article: Further resources on this subject: APACHE MAVEN AND M2ECLIPSE [article] Coverage With Apache Karaf Pax Exam Tests [article] Apache Karaf – Provisioning and Clusters [article]
Read more
  • 0
  • 0
  • 5835

article-image-hadoop-and-sql
Packt
23 Dec 2014
61 min read
Save for later

Hadoop and SQL

Packt
23 Dec 2014
61 min read
In this article by Garry Turkington and Gabriele Modena, the author of the book Learning Hadoop 2. MapReduce is a powerful paradigm that enables complex data processing that can reveal valuable insights. However, it does require a different mindset and some training and experience on the model of breaking processing analytics into a series of map and reduce steps. There are several products that are built atop Hadoop to provide higher-level or more familiar views of the data held within HDFS, and Pig is a very popular one. This article will explore the other most common abstraction implemented atop Hadoop SQL. (For more resources related to this topic, see here.) In this article, we will cover the following topics: What the use cases for SQL on Hadoop are and why it is so popular HiveQL, the SQL dialect introduced by Apache Hive Using HiveQL to perform SQL-like analysis of the Twitter dataset How HiveQL can approximate common features of relational databases such as joins and views How HiveQL allows the incorporation of user-defined functions into its queries How SQL on Hadoop complements Pig Other SQL-on-Hadoop products such as Impala and how they differ from Hive Why SQL on Hadoop Until now, we saw how to write Hadoop programs using the MapReduce APIs and how Pig Latin provides a scripting abstraction and a wrapper for custom business logic by means of UDFs. Pig is a very powerful tool, but its dataflow-based programming model is not familiar to most developers or business analysts. The traditional tool of choice for such people to explore data is SQL. Back in 2008, Facebook released Hive, the first widely used implementation of SQL on Hadoop. Instead of providing a way of more quickly developing map and reduce tasks, Hive offers an implementation of HiveQL, a query language based on SQL. Hive takes HiveQL statements and immediately and automatically translates the queries into one or more MapReduce jobs. It then executes the overall MapReduce program and returns the results to the user. This interface to Hadoop not only reduces the time required to produce results from data analysis, it also significantly widens the net as to who can use Hadoop. Instead of requiring software development skills, anyone who's familiar with SQL can use Hive. The combination of these attributes is that HiveQL is often used as a tool for business and data analysts to perform ad hoc queries on the data stored on HDFS. With Hive, the data analyst can work on refining queries without the involvement of a software developer. Just as with Pig, Hive also allows HiveQL to be extended by means of user-defined functions, enabling the base SQL dialect to be customized with business-specific functionality. Other SQL-on-Hadoop solutions Though Hive was the first product to introduce and support HiveQL, it is no longer the only one. Later in this article, we will also discuss Impala, released in 2013 and already a very popular tool, particularly for low-latency queries. There are others, but we will mostly discuss Hive and Impala as they have been the most successful. While introducing the core features and capabilities of SQL on Hadoop, however, we will give examples using Hive; even though Hive and Impala share many SQL features, they also have numerous differences. We don't want to constantly have to caveat each new feature with exactly how it is supported in Hive compared to Impala. We'll generally be looking at aspects of the feature set that is common to both, but if you use both products, it's important to read the latest release notes to understand the differences. Prerequisites Before diving into specific technologies, let's generate some data that we'll use in the examples throughout this article. We'll create a modified version of a former Pig script as the main functionality for this. The script in this article assumes that the Elephant Bird JARs used previously are available in the /jar directory on HDFS. The full source code is at https://github.com/learninghadoop2/book-examples/ch7/extract_for_hive.pig, but the core of extract_for_hive.pig is as follows: -- load JSON data tweets = load '$inputDir' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); -- Tweets tweets_tsv = foreach tweets { generate    (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt,    (chararray)$0#'id_str', (chararray)$0#'text' as text,    (chararray)$0#'in_reply_to', (boolean)$0#'retweeted' as is_retweeted, (chararray)$0#'user'#'id_str' as user_id, (chararray)$0#'place'#'id' as place_id; } store tweets_tsv into '$outputDir/tweets' using PigStorage('u0001'); -- Places needed_fields = foreach tweets {    generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt,      (chararray)$0#'id_str' as id_str, $0#'place' as place; } place_fields = foreach needed_fields { generate    (chararray)place#'id' as place_id,    (chararray)place#'country_code' as co,    (chararray)place#'country' as country,    (chararray)place#'name' as place_name,    (chararray)place#'full_name' as place_full_name,    (chararray)place#'place_type' as place_type; } filtered_places = filter place_fields by co != ''; unique_places = distinct filtered_places; store unique_places into '$outputDir/places' using PigStorage('u0001');   -- Users users = foreach tweets {    generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)$0#'id_str' as id_str, $0#'user' as user; } user_fields = foreach users {    generate    (chararray)CustomFormatToISO(user#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)user#'id_str' as user_id, (chararray)user#'location' as user_location, (chararray)user#'name' as user_name, (chararray)user#'description' as user_description, (int)user#'followers_count' as followers_count, (int)user#'friends_count' as friends_count, (int)user#'favourites_count' as favourites_count, (chararray)user#'screen_name' as screen_name, (int)user#'listed_count' as listed_count;   } unique_users = distinct user_fields; store unique_users into '$outputDir/users' using PigStorage('u0001'); Have a look at the following code: $ pig –f extract_for_hive.pig –param inputDir=<json input> -param outputDir=<output path> The preceding code writes data into three separate TSV files for the tweet, user, and place information. Notice that in the store command, we pass an argument when calling PigStorage. This single argument changes the default field separator from a tab character to unicode value U0001, or you can also use Ctrl +C + A. This is often used as a separator in Hive tables and will be particularly useful to us as our tweet data could contain tabs in other fields. Overview of Hive We will now show how you can import data into Hive and run a query against the table abstraction Hive provides over the data. In this example, and in the remainder of the article, we will assume that queries are typed into the shell that can be invoked by executing the hive command. Even though the classic CLI tool for Hive was the tool with the same name, it is specifically called hive (all lowercase); recently a client called Beeline also became available and will likely be the preferred CLI client in the near future. When importing any new data into Hive, there is generally a three-stage process, as follows: Create the specification of the table into which the data is to be imported Import the data into the created table Execute HiveQL queries against the table Most of the HiveQL statements are direct analogues to similarly named statements in standard SQL. We assume only a passing knowledge of SQL throughout this article, but if you need a refresher, there are numerous good online learning resources. Hive gives a structured query view of our data, and to enable that, we must first define the specification of the table's columns and import the data into the table before we can execute any queries. A table specification is generated using a CREATE statement that specifies the table name, the name and types of its columns, and some metadata about how the table is stored: CREATE table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; The statement creates a new table tweet defined by a list of names for columns in the dataset and their data type. We specify that fields are delimited by a tab character t and that the format used to store data is TEXTFILE. Data can be imported from a location in HDFS tweets/ into hive using the LOAD DATA statement: LOAD DATA INPATH 'tweets' OVERWRITE INTO TABLE tweets; By default, data for Hive tables is stored on HDFS under /user/hive/warehouse. If a LOAD statement is given a path to data on HDFS, it will not simply copy the data into /user/hive/warehouse, but will move it there instead. If you want to analyze data on HDFS that is used by other applications, then either create a copy or use the EXTERNAL mechanism that will be described later. Once data has been imported into Hive, we can run queries against it. For instance: SELECT COUNT(*) FROM tweets; The preceding code will return the total number of tweets present in the dataset. HiveQL, like SQL, is not case sensitive in terms of keywords, columns, or table names. By convention, SQL statements use uppercase for SQL language keywords, and we will generally follow this when using HiveQL within files, as will be shown later. However, when typing interactive commands, we will frequently take the line of least resistance and use lowercase. If you look closely at the time taken by the various commands in the preceding example, you'll notice that loading data into a table takes about as long as creating the table specification, but even the simple count of all rows takes significantly longer. The output also shows that table creation and the loading of data do not actually cause MapReduce jobs to be executed, which explains the very short execution times. The nature of Hive tables Although Hive copies the data file into its working directory, it does not actually process the input data into rows at that point. Neither the CREATE TABLE nor LOAD DATA statements don't truly create concrete table data as such; instead, they produce the metadata that will be used when Hive generates MapReduce jobs to access the data conceptually stored in the table but actually residing on HDFS. Even though the HiveQL statements refer to a specific table structure, it is Hive's responsibility to generate code that correctly maps this to the actual on-disk format in which the data files are stored. This might seem to suggest that Hive isn't a real database; this is true, it isn't. Whereas a relational database will require a table schema to be defined before data is ingested and then ingest only data that conforms to that specification, Hive is much more flexible. The less concrete nature of Hive tables means that schemas can be defined based on the data as it has already arrived and not on some assumption of how the data should be, which might prove to be wrong. Though changeable data formats are troublesome regardless of technology, the Hive model provides an additional degree of freedom in handling the problem when, not if, it arises. Hive architecture Until version 2, Hadoop was primarily a batch system. MapReduce jobs tend to have high latency and overhead derived from submission and scheduling. Internally, Hive compiles HiveQL statements into MapReduce jobs. Hive queries have traditionally been characterized by high latency. This has changed with the Stinger initiative and the improvements introduced in Hive 0.13 that we will discuss later. Hive runs as a client application that processes HiveQL queries, converts them into MapReduce jobs, and submits these to a Hadoop cluster either to native MapReduce in Hadoop 1 or to the MapReduce Application Master running on YARN in Hadoop 2. Regardless of the model, Hive uses a component called the metastore, in which it holds all its metadata about the tables defined in the system. Ironically, this is stored in a relational database dedicated to Hive's usage. In the earliest versions of Hive, all clients communicated directly with the metastore, but this meant that every user of the Hive CLI tool needed to know the metastore username and password. HiveServer was created to act as a point of entry for remote clients, which could also act as a single access-control point and which controlled all access to the underlying metastore. Because of limitations in HiveServer, the newest way to access Hive is through the multi-client HiveServer2. HiveServer2 introduces a number of improvements over its predecessor, including user authentication and support for multiple connections from the same client. More information can be found at https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2. Instances of HiveServer and HiveServer2 can be manually executed with the hive --service hiveserver and hive --service hiveserver2 commands, respectively. In the examples we saw before and in the remainder of this article, we implicitly use HiveServer to submit queries via the Hive command-line tool. HiveServer2 comes with Beeline. For compatibility and maturity reasons, Beeline being relatively new, both tools are available on Cloudera and most other major distributions. The Beeline client is part of the core Apache Hive distribution and so is also fully open source. Beeline can be executed in embedded version with the following command: $ beeline -u jdbc:hive2:// Data types HiveQL supports many of the common data types provided by standard database systems. These include primitive types, such as float, double, int, and string, through to structured collection types that provide the SQL analogues to types such as arrays, structs, and unions (structs with options for some fields). Since Hive is implemented in Java, primitive types will behave like their Java counterparts. We can distinguish Hive data types into the following five broad categories: Numeric: tinyint, smallint, int, bigint, float, double, and decimal Date and time: timestamp and date String: string, varchar, and char Collections: array, map, struct, and uniontype Misc: boolean, binary, and NULL DDL statements HiveQL provides a number of statements to create, delete, and alter databases, tables, and views. The CREATE DATABASE <name> statement creates a new database with the given name. A database represents a namespace where table and view metadata is contained. If multiple databases are present, the USE <database name> statement specifies which one to use to query tables or create new metadata. If no database is explicitly specified, Hive will run all statements against the default database. The following SHOW [DATABASES, TABLES, VIEWS] statement displays the databases currently available within a data warehouse and which table and view metadata is present within the database currently in use: CREATE DATABASE twitter; SHOW databases; USE twitter; SHOW TABLES; The CREATE TABLE [IF NOT EXISTS] <name> statement creates a table with the given name. As alluded to earlier, what is really created is the metadata representing the table and its mapping to files on HDFS as well as a directory in which to store the data files. If a table or view with the same name already exists, Hive will raise an exception. Both table and column names are case insensitive. In older versions of Hive (0.12 and earlier), only alphanumeric and underscore characters were allowed in table and column names. As of Hive 0.13, the system supports unicode characters in column names. Reserved words, such as load and create, need to be escaped by backticks (the ` character) to be treated literally. The EXTERNAL keyword specifies that the table exists in resources out of Hive's control, which can be a useful mechanism to extract data from another source at the beginning of a Hadoop-based Extract-Transform-Load (ETL) pipeline. The LOCATION clause specifies where the source file (or directory) is to be found. The EXTERNAL keyword and LOCATION clause have been used in the following code: CREATE EXTERNAL TABLE tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/tweets'; This table will be created in metastore, but the data will not be copied into the /user/hive/warehouse directory. Note that Hive has no concept of primary key or unique identifier. Uniqueness and data normalization are aspects to be addressed before loading data into the data warehouse. The CREATE VIEW <view name> … AS SELECT statement creates a view with the given name. For example, we might want to create a view to isolate retweets from other messages, as follows: CREATE VIEW retweets COMMENT 'Tweets that have been retweeted' AS SELECT * FROM tweets WHERE retweeted = true; Unless otherwise specified, column names are derived from the defining SELECT statement. Hive does not currently support materialized views. The DROP TABLE and DROP VIEW statements remove both metadata and data for a given table or view. When dropping an EXTERNAL table or a view, only metadata will be removed and the actual data files will not be affected. Hive allows table metadata to be altered via the ALTER TABLE statement, which can be used to change a column type, name, position, and comment or to add and replace columns. When adding columns, it is important to remember that only metadata will be changed and not the dataset itself. This means that if we were to add a column in the middle of the table, which didn't exist in older files, then, while selecting from older data, we might get wrong values in the wrong columns. This is because we would be looking at old files with a new format. Similarly, ALTER VIEW <view name> AS <select statement> changes the definition of an existing view. File formats and storage The data files underlying a Hive table are no different from any other file on HDFS. Users can directly read the HDFS files in the Hive tables using other tools. They can also use other tools to write to HDFS files that can be loaded into Hive through CREATE EXTERNAL TABLE or through LOAD DATA INPATH. Hive uses the Serializer and Deserializer classes, SerDe, as well as FileFormat to read and write table rows. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified in a CREATE TABLE statement. The DELIMITED clause instructs the system to read delimited files. Delimiter characters can be escaped using the ESCAPED BY clause. Hive currently uses the following FileFormat classes to read and write HDFS files: TextInputFormat and HiveIgnoreKeyTextOutputFormat: These will read/write data in plain text file format SequenceFileInputFormat and SequenceFileOutputFormat: These classes read/write data in the Hadoop SequenceFile format Additionally, the following SerDe classes can be used to serialize and deserialize data: MetadataTypedColumnsetSerDe: This will read/write delimited records such as CSV or tab-separated records ThriftSerDe, and DynamicSerDe: These will read/write Thrift objects JSON As of version 0.13, Hive ships with the native org.apache.hive.hcatalog.data.JsonSerDe JSON SerDe. For older versions of Hive, Hive-JSON-Serde (found at https://github.com/rcongiu/Hive-JSON-Serde) is arguably one of the most feature-rich JSON serialization/deserialization modules. We can use either module to load JSON tweets without any need for preprocessing and just define a Hive schema that matches the content of a JSON document. In the following example, we use Hive-JSON-Serde. As with any third-party module, we load the SerDe JARS into Hive with the following code: ADD JAR JAR json-serde-1.3-jar-with-dependencies.jar; Then, we issue the usual create statement, as follows: CREATE EXTERNAL TABLE tweets (    contributors string,    coordinates struct <      coordinates: array <float>,      type: string>,    created_at string,    entities struct <      hashtags: array <struct <            indices: array <tinyint>,            text: string>>, … ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION 'tweets'; With this SerDe, we can map nested documents (such as entities or users) to the struct or map types. We tell Hive that the data stored at LOCATION 'tweets' is text (STORED AS TEXTFILE) and that each row is a JSON object (ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'). In Hive 0.13 and later, we can express this property as ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'. Manually specifying the schema for complex documents can be a tedious and error-prone process. The hive-json module (found at https://github.com/hortonworks/hive-json) is a handy utility to analyze large documents and generate an appropriate Hive schema. Depending on the document collection, further refinement might be necessary. In our example, we used a schema generated with hive-json that maps the tweets JSON to a number of struct data types. This allows us to query the data using a handy dot notation. For instance, we can extract the screen name and description fields of a user object with the following code: SELECT user.screen_name, user.description FROM tweets_json LIMIT 10; Avro AvroSerde (https://cwiki.apache.org/confluence/display/Hive/AvroSerDe) allows us to read and write data in Avro format. Starting from 0.14, Avro-backed tables can be created using the STORED AS AVRO statement, and Hive will take care of creating an appropriate Avro schema for the table. Prior versions of Hive are a bit more verbose. This dataset was created using Pig's AvroStorage class, which generated the following schema: { "type":"record", "name":"record", "fields": [    {"name":"topic","type":["null","int"]},    {"name":"source","type":["null","int"]},    {"name":"rank","type":["null","float"]} ] } The structure is quite self-explanatory. The table structure is captured in an Avro record, which contains header information (a name and optional namespace to qualify the name) and an array of the fields. Each field is specified with its name and type as well as an optional documentation string. For a few of the fields, the type is not a single value, but instead a pair of values, one of which is null. This is an Avro union, and this is the idiomatic way of handling columns that might have a null value. Avro specifies null as a concrete type, and any location where another type might have a null value needs to be specified in this way. This will be handled transparently for us when we use the following schema. With this definition, we can now create a Hive table that uses this schema for its table specification, as follows: CREATE EXTERNAL TABLE tweets_pagerank ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'='{    "type":"record",    "name":"record",    "fields": [        {"name":"topic","type":["null","int"]},        {"name":"source","type":["null","int"]},        {"name":"rank","type":["null","float"]}    ] }') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '${data}/ch5-pagerank'; Then, look at the following table definition from within Hive (note also that HCatalog): DESCRIBE tweets_pagerank; OK topic                 int                   from deserializer   source               int                   from deserializer   rank                 float                 from deserializer In the ddl, we told Hive that data is stored in the Avro format using AvroContainerInputFormat and AvroContainerOutputFormat. Each row needs to be serialized and deserialized using org.apache.hadoop.hive.serde2.avro.AvroSerDe. The table schema is inferred by Hive from the Avro embedded in avro.schema.literal. Alternatively, we can store a schema on HDFS and have Hive read it to determine the table structure. Create the preceding schema in a file called pagerank.avsc—this is the standard file extension for Avro schemas. Then place it on HDFS; we want to have a common location for schema files such as /schema/avro. Finally, define the table using the avro.schema.url SerDe property WITH SERDEPROPERTIES ('avro.schema.url'='hdfs://<namenode>/schema/avro/pagerank.avsc'). If Avro dependencies are not present in the classpath, we need to add the Avro MapReduce JAR to our environment before accessing individual fields. Within Hive, for example, on the Cloudera CDH5 VM: ADD JAR /opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar; We can also use this table like any other. For instance, we can query the data to select the user and topic pairs with a high PageRank: SELECT source, topic from tweets_pagerank WHERE rank >= 0.9; We will see how Avro and avro.schema.url play an instrumental role in enabling schema migrations. Columnar stores Hive can also take advantage of columnar storage via the ORC (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) and Parquet (https://cwiki.apache.org/confluence/display/Hive/Parquet) formats. If a table is defined with very many columns, it is not unusual for any given query to only process a small subset of these columns. But even in SequenceFile, each full row and all its columns will be read from the disk, decompressed, and processed. This consumes a lot of system resources for data that we know in advance is not of interest. Traditional relational databases also store data on a row basis, and a type of database called columnarchanged this to be column-focused. In the simplest model, instead of one file for each table, there would be one file for each column in the table. If a query only needed to access five columns in a table with 100 columns in total, then only the files for those five columns will be read. Both ORC and Parquet use this principle as well as other optimizations to enable much faster queries. Queries Tables can be queried using the familiar SELECT … FROM statement. The WHERE statement allows the specification of filtering conditions, GROUP BY aggregates records, ORDER BY specifies sorting criteria, and LIMIT specifies the number of records to retrieve. Aggregate functions, such as count and sum, can be applied to aggregated records. For instance, the following code returns the top 10 most prolific users in the dataset: SELECT user_id, COUNT(*) AS cnt FROM tweets GROUP BY user_id ORDER BY cnt DESC LIMIT 10 The following are the top 10 most prolific users in the dataset: NULL 7091 1332188053 4 959468857 3 1367752118 3 362562944 3 58646041 3 2375296688 3 1468188529 3 37114209 3 2385040940 3 This allows us to identify the number of tweets, 7,091, with no user object. We can improve the readability of the hive output by setting the following code: SET hive.cli.print.header=true; This will instruct hive, though not beeline, to print column names as part of the output. You can add the command to the .hiverc file usually found in the root of the executing user's home directory to have it apply to all hive CLI sessions. HiveQL implements a JOIN operator that enables us to combine tables together. In the Prerequisites section, we generated separate datasets for the user and place objects. Let's now load them into hive using external tables. We first create a user table to store user data, as follows: CREATE EXTERNAL TABLE user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/users'; We then create a place table to store location data, as follows: CREATE EXTERNAL TABLE place ( place_id string, country_code string, country string, `name` string, full_name string, place_type string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/places'; We can use the JOIN operator to display the names of the 10 most prolific users, as follows: SELECT tweets.user_id, user.name, COUNT(tweets.user_id) AS cnt FROM tweets JOIN user ON user.user_id = tweets.user_id GROUP BY tweets.user_id, user.user_id, user.name ORDER BY cnt DESC LIMIT 10; Only equality, outer, and left (semi) joins are supported in Hive. Notice that there might be multiple entries with a given user ID but different values for the followers_count, friends_count, and favourites_count columns. To avoid duplicate entries, we count only user_id from the tweets tables. Alternatively, we can rewrite the previous query as follows: SELECT tweets.user_id, u.name, COUNT(*) AS cnt FROM tweets join (SELECT user_id, name FROM user GROUP BY user_id, name) u ON u.user_id = tweets.user_id GROUP BY tweets.user_id, u.name ORDER BY cnt DESC LIMIT 10;   Instead of directly joining the user table, we execute a subquery, as follows: SELECT user_id, name FROM user GROUP BY user_id, name; The subquery extracts unique user IDs and names. Note that Hive has limited support for subqueries. Historically, only permitting a subquery in the FROM clause of a SELECT statement. Hive 0.13 has added limited support for subqueries within the WHERE clause also. HiveQL is an ever-evolving rich language, a full exposition of which is beyond the scope of this article. A description of its query and ddl capabilities can be found at https://cwiki.apache.org/confluence/display/Hive/LanguageManual. Structuring Hive tables for given workloads Often Hive isn't used in isolation, instead tables are created with particular workloads in mind or with needs invoked in ways that are suitable for inclusion in automated processes. We'll now explore some of these scenarios. Partitioning a table With columnar file formats, we explained the benefits of excluding unneeded data as early as possible when processing a query. A similar concept has been used in SQL for some time: table partitioning. When creating a partitioned table, a column is specified as the partition key. All values with that key are then stored together. In Hive's case, different subdirectories for each partition key are created under the table directory in the warehouse location on HDFS. It's important to understand the cardinality of the partition column. With too few distinct values, the benefits are reduced as the files are still very large. If there are too many values, then queries might need a large number of files to be scanned to access all the required data. Perhaps, the most common partition key is one based on date. We could, for example, partition our user table from earlier based on the created_at column, that is, the date the user was first registered. Note that since partitioning a table by definition affects its file structure, we create this table now as a non-external one, as follows: CREATE TABLE partitioned_user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at_date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; To load data into a partition, we can explicitly give a value for the partition in which to insert the data, as follows: INSERT INTO TABLE partitioned_user PARTITION( created_at_date = '2014-01-01') SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count FROM user; This is at best verbose, as we need a statement for each partition key value; if a single LOAD or INSERT statement contains data for multiple partitions, it just won't work. Hive also has a feature called dynamic partitioning, which can help us here. We set the following three variables: SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict; SET hive.exec.max.dynamic.partitions.pernode=5000; The first two statements enable all partitions (nonstrict option) to be dynamic. The third one allows 5,000 distinct partitions to be created on each mapper and reducer node. We can then simply use the name of the column to be used as the partition key, and Hive will insert data into partitions depending on the value of the key for a given row: INSERT INTO TABLE partitioned_user PARTITION( created_at_date ) SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user; Even though we use only a single partition column, here we can partition a table by multiple column keys; just have them as a comma-separated list in the PARTITIONED BY clause. Note that the partition key columns need to be included as the last columns in any statement being used to insert into a partitioned table as in the preceding code. We use Hive's to_date function to convert the created_at timestamp to a YYYY-MM-DD formatted string. Partitioned data is stored in HDFS as /path/to/warehouse/<database>/<table>/key=<value>. In our example, the partitioned_user table structure will look like /user/hive/warehouse/default/partitioned_user/created_at=2014-04-01. If data is added directly to the filesystem, for instance, by some third-party processing tool or by hadoop fs -put, the metastore won't automatically detect the new partitions. The user will need to manually run an ALTER TABLE statement such as the following for each newly added partition: ALTER TABLE <table_name> ADD PARTITION <location>; Using the MSCK REPAIR TABLE <table_name>; statement, all metadata for all partitions not currently present in the metastore will be added. On EMR, this is equivalent to executing the following code: ALTER TABLE <table_name> RECOVER PARTITIONS; Notice that both statements will work also with EXTERNAL tables. In the following article, we will see how this pattern can be exploited to create flexible and interoperable pipelines. Overwriting and updating data Partitioning is also useful when we need to update a portion of a table. Normally a statement of the following form will replace all the data for the destination table: INSERT OVERWRITE INTO <table>… If OVERWRITE is omitted, then each INSERT statement will add additional data to the table. Sometimes, this is desirable, but often, the source data being ingested into a Hive table is intended to fully update a subset of the data and keep the rest untouched. If we perform an INSERT OVERWRITE statement (or a LOAD OVERWRITE statement) into a partition of a table, then only the specified partition will be affected. Thus, if we were inserting user data and only wanted to affect the partitions with data in the source file, we could achieve this by adding the OVERWRITE keyword to our previous INSERT statement. We can also add caveats to the SELECT statement. Say, for example, we only wanted to update data for a certain month: INSERT INTO TABLE partitioned_user PARTITION (created_at_date) SELECT created_at , user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user WHERE to_date(created_at) BETWEEN '2014-03-01' and '2014-03-31'; Bucketing and sorting Partitioning a table is a construct that you take explicit advantage of by using the partition column (or columns) in the WHERE clause of queries against the tables. There is another mechanism called bucketing that can further segment how a table is stored and does so in a way that allows Hive itself to optimize its internal query plans to take advantage of the structure. Let's create bucketed versions of our tweets and user tables; note the following additional CLUSTER BY and SORT BY statements in the CREATE TABLE statements: CREATE table bucketed_tweets ( tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE;   CREATE TABLE bucketed_user ( user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) SORTED BY(name) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; Note that we changed the tweets table to also be partitioned; you can only bucket a table that is partitioned. Just as we need to specify a partition column when inserting into a partitioned table, we must also take care to ensure that data inserted into a bucketed table is correctly clustered. We do this by setting the following flag before inserting the data into the table: SET hive.enforce.bucketing=true; Just as with partitioned tables, you cannot apply the bucketing function when using the LOAD DATA statement; if you wish to load external data into a bucketed table, first insert it into a temporary table, and then use the INSERT…SELECT… syntax to populate the bucketed table. When data is inserted into a bucketed table, rows are allocated to a bucket based on the result of a hash function applied to the column specified in the CLUSTERED BY clause. One of the greatest advantages of bucketing a table comes when we need to join two tables that are similarly bucketed, as in the previous example. So, for example, any query of the following form would be vastly improved: SET hive.optimize.bucketmapjoin=true; SELECT … FROM bucketed_user u JOIN bucketed_tweet t ON u.user_id = t.user_id; With the join being performed on the column used to bucket the table, Hive can optimize the amount of processing as it knows that each bucket contains the same set of user_id columns in both tables. While determining which rows against which to match, only those in the bucket need to be compared against, and not the whole table. This does require that the tables are both clustered on the same column and that the bucket numbers are either identical or one is a multiple of the other. In the latter case, with say one table clustered into 32 buckets and another into 64, the nature of the default hash function used to allocate data to a bucket means that the IDs in bucket 3 in the first table will cover those in both buckets 3 and 35 in the second. Sampling data Bucketing a table can also help while using Hive's ability to sample data in a table. Sampling allows a query to gather only a specified subset of the overall rows in the table. This is useful when you have an extremely large table with moderately consistent data patterns. In such a case, applying a query to a small fraction of the data will be much faster and will still give a broadly representative result. Note, of course, that this only applies to queries where you are looking to determine table characteristics, such as pattern ranges in the data; if you are trying to count anything, then the result needs to be scaled to the full table size. For a non-bucketed table, you can sample in a mechanism similar to what we saw earlier by specifying that the query should only be applied to a certain subset of the table: SELECT max(friends_count) FROM user TABLESAMPLE(BUCKET 2 OUT OF 64 ON name); In this query, Hive will effectively hash the rows in the table into 64 buckets based on the name column. It will then only use the second bucket for the query. Multiple buckets can be specified, and if RAND() is given as the ON clause, then the entire row is used by the bucketing function. Though successful, this is highly inefficient as the full table needs to be scanned to generate the required subset of data. If we sample on a bucketed table and ensure the number of buckets sampled is equal to or a multiple of the buckets in the table, then Hive will only read the buckets in question. The following code is representative of this case: SELECT MAX(friends_count) FROM bucketed_user TABLESAMPLE(BUCKET 2 OUT OF 32 on user_id); In the preceding query against the bucketed_user table, which is created with 64 buckets on the user_id column, the sampling, since it is using the same column, will only read the required buckets. In this case, these will be buckets 2 and 34 from each partition. A final form of sampling is block sampling. In this case, we can specify the required amount of the table to be sampled, and Hive will use an approximation of this by only reading enough source data blocks on HDFS to meet the required size. Currently, the data size can be specified as either a percentage of the table, as an absolute data size, or as a number of rows (in each block). The syntax for TABLESAMPLE is as follows, which will sample 0.5 percent of the table, 1 GB of data or 100 rows per split, respectively: TABLESAMPLE(0.5 PERCENT) TABLESAMPLE(1G) TABLESAMPLE(100 ROWS) If these latter forms of sampling are of interest, then consult the documentation, as there are some specific limitations on the input format and file formats that are supported. Writing scripts We can place Hive commands in a file and run them with the -f option in the hive CLI utility: $ cat show_tables.hql show tables; $ hive -f show_tables.hql We can parameterize HiveQL statements by means of the hiveconf mechanism. This allows us to specify an environment variable name at the point it is used rather than at the point of invocation. For example: $ cat show_tables2.hql show tables like '${hiveconf:TABLENAME}'; $ hive -hiveconf TABLENAME=user -f show_tables2.hql The variable can also be set within the Hive script or an interactive session: SET TABLE_NAME='user'; The preceding hiveconf argument will add any new variables in the same namespace as the Hive configuration options. As of Hive 0.8, there is a similar option called hivevar that adds any user variables into a distinct namespace. Using hivevar, the preceding command would be as follows: $ cat show_tables3.hql show tables like '${hivevar:TABLENAME}'; $ hive -hivevar TABLENAME=user –f show_tables3.hql Or we can write the command interactively: SET hivevar_TABLE_NAME='user'; Hive and Amazon Web Services With ElasticMapReduce as the AWS Hadoop-on-demand service, it is of course possible to run Hive on an EMR cluster. But it is also possible to use Amazon storage services, particularly S3, from any Hadoop cluster, be it within EMR or your own local cluster. Hive and S3 It is possible to specify a default filesystem other than HDFS for Hadoop and S3 is one option. But, it doesn't have to be an all-or-nothing thing; it is possible to have specific tables stored in S3. The data for these tables will be retrieved into the cluster to be processed, and any resulting data can either be written to a different S3 location (the same table cannot be the source and destination of a single query) or onto HDFS. We can take a file of our tweet data and place it onto a location in S3 with a command such as the following: $ aws s3 put tweets.tsv s3://<bucket-name>/tweets/ We firstly need to specify the access key and secret access key that can access the bucket. This can be done in three ways: Set fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey to the appropriate values in the Hive CLI Set the same values in hive-site.xml though note this limits use of S3 to a single set of credentials Specify the table location explicitly in the table URL, that is, s3n://<access key>:<secret access key>@<bucket>/<path> Then we can create a table referencing this data, as follows: CREATE table remote_tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) CLUSTERED BY(user_ID) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION 's3n://<bucket-name>/tweets' This can be an incredibly effective way of pulling S3 data into a local Hadoop cluster for processing. In order to use AWS credentials in the URI of an S3 location regardless of how the parameters are passed, the secret and access keys must not contain /, +, =, or characters. If necessary, a new set of credentials can be generated from the IAM console at https://console.aws.amazon.com/iam/. In theory, you can just leave the data in the external table and refer to it when needed to avoid WAN data transfer latencies (and costs), even though it often makes sense to pull the data into a local table and do future processing from there. If the table is partitioned, then you might find yourself retrieving a new partition each day, for example. Hive on ElasticMapReduce On one level, using Hive within Amazon ElasticMapReduce is just the same as everything discussed in this article. You can create a persistent cluster, log in to the master node, and use the Hive CLI to create tables and submit queries. Doing all this will use the local storage on the EC2 instances for the table data. Not surprisingly, jobs on EMR clusters can also refer to tables whose data is stored on S3 (or DynamoDB). And, not surprisingly, Amazon has made extensions to its version of Hive to make all this very seamless. It is quite simple from within an EMR job to pull data from a table stored in S3, process it, write any intermediate data to the EMR local storage, and then write the output results into S3, DynamoDB, or one of a growing list of other AWS services. The pattern mentioned earlier where new data is added to a new partition directory for a table each day has proved very effective in S3; it is often the storage location of choice for large and incrementally growing datasets. There is a syntax difference when using EMR; instead of the MSCK command mentioned earlier, the command to update a Hive table with new data added to a partition directory is as follows: ALTER TABLE <table-name> RECOVER PARTITIONS; Consult the EMR documentation for the latest enhancements at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html. Also, consult the broader EMR documentation. In particular, the integration points with other AWS services is an area of rapid growth. Extending HiveQL The HiveQL language can be extended by means of plugins and third-party functions. In Hive, there are three types of functions characterized by the number of rows they take as input and produce as output: User Defined Functions (UDFs): These are simpler functions that act on one row at a time. User Defined Aggregate Functions (UDAFs): These functions take multiple rows as input and generate multiple rows as output. These are aggregate functions to be used in conjunction with a GROUP BY statement (similar to COUNT(), AVG(), MIN(), MAX(), and so on). User Defined Table Functions (UDTFs): These take multiple rows as input and generate a logical table comprised of multiple rows that can be used in join expressions. These APIs are provided only in Java. For other languages, it is possible to stream data through a user-defined script using the TRANSFORM, MAP, and REDUCE clauses that act as a frontend to Hadoop's streaming capabilities. Two APIs are available to write UDFs. A simple API org.apache.hadoop.hive.ql.exec.UDF can be used for functions that take read and return basic writable types. A richer API, which provides support for data types other than writable is available in the org.apache.hadoop.hive.ql.udf.generic.GenericUDF package. We'll now illustrate how org.apache.hadoop.hive.ql.exec.UDF can be used to implement a string to ID function similar to the one we used in Iterative Computation with Spark, to map hashtags to integers in Pig. Building a UDF with this API only requires extending the UDF class and writing an evaluate() method, as follows: public class StringToInt extends UDF {    public Integer evaluate(Text input) {        if (input == null)            return null;            String str = input.toString();          return str.hashCode();    } } The function takes a Text object as input and maps it to an integer value with the hashCode() method. The source code of this function can be found at https://github.com/learninghadoop2/book-examples/ch7/udf/ com/learninghadoop2/hive/udf/StringToInt.java. A more robust hash function should be used in production. We compile the class and archive it into a JAR file, as follows: $ javac -classpath $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/* com/learninghadoop2/hive/udf/StringToInt.java $ jar cvf myudfs-hive.jar com/learninghadoop2/hive/udf/StringToInt.class Before being able to use it, a UDF must be registered in Hive with the following commands: ADD JAR myudfs-hive.jar; CREATE TEMPORARY FUNCTION string_to_int AS 'com.learninghadoop2.hive.udf.StringToInt'; The ADD JAR statement adds a JAR file to the distributed cache. The CREATE TEMPORARY FUNCTION <function> AS <class> statement registers as a function in Hive that implements a given Java class. The function will be dropped once the Hive session is closed. As of Hive 0.13, it is possible to create permanent functions whose definition is kept in the metastore using CREATE FUNCTION … . Once registered, StringToInt can be used in a Hive query just as any other function. In the following example, we first extract a list of hashtags from the tweet's text by applying the regexp_extract function. Then, we use string_to_int to map each tag to a numerical ID: SELECT unique_hashtags.hashtag, string_to_int(unique_hashtags.hashtag) AS tag_id FROM    (        SELECT regexp_extract(text,            '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)') as hashtag        FROM tweets        GROUP BY regexp_extract(text,        '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)') ) unique_hashtags GROUP BY unique_hashtags.hashtag, string_to_int(unique_hashtags.hashtag); We can use the preceding query to create a lookup table, as follows: CREATE TABLE lookuptable (tag string, tag_id bigint); INSERT OVERWRITE TABLE lookuptable SELECT unique_hashtags.hashtag,    string_to_int(unique_hashtags.hashtag) as tag_id FROM (    SELECT regexp_extract(text,        '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)') AS hashtag          FROM tweets          GROUP BY regexp_extract(text,            '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)')    ) unique_hashtags GROUP BY unique_hashtags.hashtag, string_to_int(unique_hashtags.hashtag); Programmatic interfaces In addition to the hive and beeline command-line tools, it is possible to submit HiveQL queries to the system via the JDBC and Thrift programmatic interfaces. Support for odbc was bundled in older versions of Hive, but as of Hive 0.12, it needs to be built from scratch. More information on this process can be found at https://cwiki.apache.org/confluence/display/Hive/HiveODBC. JDBC A Hive client written using JDBC APIs looks exactly the same as a client program written for other database systems (for example MySQL). The following is a sample Hive client program using JDBC APIs. The source code for this example can be found at https://github.com/learninghadoop2/book-examples/ch7/clients/ com/learninghadoop2/hive/client/HiveJdbcClient.java. public class HiveJdbcClient {      private static String driverName = " org.apache.hive.jdbc.HiveDriver";           // connection string      public static String URL = "jdbc:hive2://localhost:10000";        // Show all tables in the default database      public static String QUERY = "show tables";        public static void main(String[] args) throws SQLException {          try {                Class.forName (driverName);          }          catch (ClassNotFoundException e) {                e.printStackTrace();                System.exit(1);          }          Connection con = DriverManager.getConnection (URL);          Statement stmt = con.createStatement();                   ResultSet resultSet = stmt.executeQuery(QUERY);          while (resultSet.next()) {                System.out.println(resultSet.getString(1));          }    } } The URL part is the JDBC URI that describes the connection end point. The format for establishing a remote connection is jdbc:hive2:<host>:<port>/<database>. Connections in embedded mode can be established by not specifying a host or port jdbc:hive2://. The hive and hive2 part are the drivers to be used when connecting to HiveServer and HiveServer2. The QUERY statement contains the HiveQL query to be executed. Hive's JDBC interface exposes only the default database. In order to access other databases, you need to reference them explicitly in the underlying queries using the <database>.<table> notation. First we load the HiveServer2 JDBC driver org.apache.hive.jdbc.HiveDriver. Use org.apache.hadoop.hive.jdbc.HiveDriver to connect to HiveServer. Then, like with any other JDBC program, we establish a connection to URL and use it to instantiate a Statement class. We execute QUERY, with no authentication, and store the output dataset into the ResultSet object. Finally, we scan resultSet and print its content to the command line. Compile and execute the example with the following commands: $ javac HiveJdbcClient.java $ java -cp $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*:/opt/cloudera/parcels/CDH/lib/hive/lib/hive-jdbc.jar: com.learninghadoop2.hive.client.HiveJdbcClient Thrift Thrift provides lower-level access to Hive and has a number of advantages over the JDBC implementation of HiveServer. Primarily, it allows multiple connections from the same client, and it allows programming languages other than Java to be used with ease. With HiveServer2, it is a less commonly used option but still worth mentioning for compatibility. A sample Thrift client implemented using the Java API can be found at https://github.com/learninghadoop2/book-examples/ch7/clients/ com/learninghadoop2/hive/client/HiveThriftClient.java. This client can be used to connect to HiveServer, but due to protocol differences, the client won't work with HiveServer2. In the example, we define a getClient() method that takes as input the host and port of a HiveServer service and returns an instance of org.apache.hadoop.hive.service.ThriftHive.Client. A client is obtained by first instantiating a socket connection, org.apache.thrift.transport.TSocket, to the HiveServer service, and by specifying a protocol, org.apache.thrift.protocol.TBinaryProtocol, to serialize and transmit data, as follows:        TSocket transport = new TSocket(host, port);        transport.setTimeout(TIMEOUT);        transport.open();        TBinaryProtocol protocol = new TBinaryProtocol(transport);        client = new ThriftHive.Client(protocol); Finally, we call getClient() from the main method and use the client to execute a query against an instance of HiveServer running on localhost on port 11111, as follows:      public static void main(String[] args) throws Exception {          Client client = getClient("localhost", 11111);          client.execute("show tables");          List<String> results = client.fetchAll();           for (String result : results) { System.out.println(result);           }      } Make sure that HiveServer is running on port 11111, and if not, start an instance with the following command: $ sudo hive --service hiveserver -p 11111 Compile and execute the HiveThriftClient.java example with the following command: $ javac $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/* com/learninghadoop2/hive/client/HiveThriftClient.java $ java -cp $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*: com.learninghadoop2.hive.client.HiveThriftClient Stinger initiative Hive has remained very successful and capable since its earliest releases, particularly in its ability to provide SQL-like processing on enormous datasets. But other technologies did not stand still, and Hive acquired a reputation of being relatively slow, particularly in regard to lengthy startup times on large jobs and its inability to give quick responses to conceptually simple queries. These perceived limitations were less due to Hive itself and more a consequence of how translation of SQL queries into the MapReduce model has much built-in inefficiency when compared to other ways of implementing a SQL query. Particularly, in regard to very large datasets, MapReduce saw lots of I/OI/O (and consequently time) spent writing out the results of one MapReduce job just to have them read by another. Processing - MapReduce and Beyond, this is a major driver in the design of Tez, which can schedule tasks on a Hadoop cluster as a graph of tasks that does not require inefficient writes and reads between tasks in the graph. The following is a query on the MapReduce framework versus Tez: SELECT a.country, COUNT(b.place_id) FROM place a JOIN tweets b ON (a. place_id = b.place_id) GROUP BY a.country; The following figure contrasts the execution plan for the preceding query on the MapReduce framework versus Tez: Hive on MapReduce versus Tez In plain MapReduce, two jobs are created for the GROUP BY and JOIN clauses. The first job is composed of a set of MapReduce tasks that read data from the disk to carry out grouping. The reducers write intermediate results to the disk so that output can be synchronized. The mappers in the second job read the intermediate results from the disk as well as data from table b. The combined dataset is then passed to the reducer where shared keys are joined. Were we to execute an ORDER BY statement, this would have resulted in a third job and further MapReduce passes. The same query is executed on Tez as a single job by a single set of Map tasks that read data from the disk. I/O grouping and joining are pipelined across reducers. Alongside these architectural limitations, there were quite a few areas around SQL language support that could also provide better efficiency, and in early 2013, the Stinger initiative was launched with an explicit goal of making Hive over 100 times as fast and with much richer SQL support. Hive 0.13 has all the features of the three phases of Stinger, resulting in a much more complete SQL dialect. Also, Tez is offered as an execution framework in addition to a more efficient MapReduce-based implementation atop YARN. With Tez as the execution engine, Hive is no longer limited to a series of linear MapReduce jobs and can instead build a processing graph where any given step can, for example, stream results to multiple sub-steps. To take advantage of the Tez framework, there is a new Hive variable setting, as follows: set hive.execution.engine=tez; This setting relies on Tez being installed on the cluster; it is available in source form from http://tez.incubator.apache.org or in several distributions, though at the time of writing, not Cloudera, due to its support of Impala. The alternative value is mr, which uses the classic MapReduce model (atop YARN), so it is possible in a single installation to compare the performance of Hive using Tez. Impala Hive is not the only product providing the SQL-on-Hadoop capability. The second most widely used is likely Impala, announced in late 2012 and released in spring 2013. Though originally developed internally within Cloudera, its source code is periodically pushed to an open source Git repository (https://github.com/cloudera/impala). Impala was created out of the same perception of Hive's weaknesses that led to the Stinger initiative. Impala also took some inspiration from Google Dremel (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf) which was first openly described by a paper published in 2009. Dremel was built at Google to address the gap between the need for very fast queries on very large datasets and the high latency inherent in the existing MapReduce model underpinning Hive at the time. Dremel was a sophisticated approach to this problem that, rather than building mitigations atop MapReduce such as implemented by Hive, instead created a new service that accessed the same data stored in HDFS. Dremel also benefited from significant work to optimize the storage format of the data in a way that made it more amenable to very fast analytic queries. The architecture of Impala The basic architecture has three main components; the Impala daemons, the state store, and the clients. Recent versions have added additional components that improve the service, but we'll focus on the high-level architecture. The Impala daemon (impalad) should be run on each host where a DataNode process is managing HDFS data. Note that impalad does not access the filesystem blocks through the full HDFS FileSystem API; instead, it uses a feature called short-circuit reads to make data access more efficient. When a client submits a query, it can do so to any of the running impalad processes, and this one will become the coordinator for the execution of that query. The key aspect of Impala's performance is that for each query, it generates custom native code, which is then pushed to and executed by all the impalad processes on the system. This highly optimized code performs the query on the local data, and each impalad then returns its subset of the result set to the coordinator node, which performs the final data consolidation to produce the final result. This type of architecture should be familiar to anyone who has worked with any of the (usually commercial and expensive) Massively Parallel Processing (MPP) (the term used for this type of shared scale-out architecture) data warehouse solutions available today. As the cluster runs the state, the store ensures that each impalad process is aware of all the others and provides a view of the overall cluster health. Co-existing with Hive Impala, as a newer product, tends to have a more restricted set of SQL data types and supports a more constrained dialect of SQL than Hive. It is, however, expanding this support with each new release. Refer to the Impala documentation (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Impala/impala.html) to get an overview of the current level of support. Impala supports the Hive metastore mechanism used by Hive as a store of the metadata surrounding its table structure and storage. This means that on a cluster with an existing Hive setup, it should be immediately possible to use Impala as it will access the same metastore and therefore provide access to the same tables available in Hive. But be warned that the differences in SQL dialect and data types might cause unexpected results when working in a combined Hive and Impala environment. Some queries might work on one but not the other, they might show very different performance characteristics (more on this later), or they might actually give different results. This last point might become apparent when using data types such as float and double that are simply treated differently in the underlying systems (Hive is implemented on Java while Impala is written in C++). As of version 1.2, it supports UDFs written both in C++ and Java, although C++ is strongly recommended as a much faster solution. Keep this in mind if you are looking to share custom functions between Hive and Impala. A different philosophy When Impala was first released, its greatest benefit was in how it truly enabled what is often called speed of thought analysis. Queries could be returned sufficiently fast that an analyst could explore a thread of analysis in a completely interactive fashion without having to wait for minutes at a time for each query to complete. It's fair to say that most adopters of Impala were at times stunned by its performance, especially when compared to the version of Hive shipping at the time. The Impala focus has remained mostly on these shorter queries, and this does impose some limitations on the system. Impala tends to be quite memory-heavy as it relies on in-memory processing to achieve much of its performance. If a query requires a dataset to be held in memory rather than being available on the executing node, then that query will simply fail in versions of Impala before 2.0. Comparing the work on Stinger to Impala, it could be argued that Impala has a much stronger focus on excelling in the shorter (and arguably more common) queries that support interactive data analysis. Many business intelligence tools and services are now certified to directly run on Impala. The Stinger initiative has put less effort into making Hive just as fast in the area where Impala excels but has instead improved Hive (to varying degrees) for all workloads. Impala is still developing at a fast pace and Stinger has put additional momentum into Hive, so it is most likely wise to consider both products and determine which best meets the performance and functionality requirements of your projects and workflows. It should also be kept in mind that there are competitive commercial pressures shaping the direction of Impala and Hive. Impala was created and is still driven by Cloudera, the most popular vendor of Hadoop distributions. The Stinger initiative, though contributed to by many companies as diverse as Microsoft (yes, really!) and Intel, was lead by Hortonworks, probably the second largest vendor of Hadoop distributions. The fact is that if you are using the Cloudera distribution of Hadoop, then some of the core features of Hive might be slower to arrive, whereas Impala will always be up-to-date. Conversely, if you use another distribution, you might get the latest Hive release, but that might either have an older Impala or, as is currently the case, you might have to download and install it yourself. A similar situation has arisen with the Parquet and ORC file formats mentioned earlier. Parquet is preferred by Impala and developed by a group of companies led by Cloudera, while ORC is preferred by Hive and is championed by Hortonworks. Unfortunately, the reality is that Parquet support is often very quick to arrive in the Cloudera distribution but less so in say the Hortonworks distribution, where the ORC file format is preferred. These themes are a little concerning since, although competition in this space is a good thing, and arguably the announcement of Impala helped energize the Hive community, there is a greater risk that your choice of distribution might have a larger impact on the tools and file formats that will be fully supported, unlike in the past. Hopefully, the current situation is just an artifact of where we are in the development cycles of all these new and improved technologies, but do consider your choice of distribution carefully in relation to your SQL-on-Hadoop needs. Drill, Tajo, and beyond You should also consider that SQL on Hadoop no longer only refers to Hive or Impala. Apache Drill (http://drill.incubator.apache.org) is a fuller implementation of the Dremel model first described by Google. Although Impala implements the Dremel architecture across HDFS data, Drill looks to provide similar functionality across multiple data sources. It is still in its early stages, but if your needs are broader than what Hive or Impala provides, it might be worth considering. Tajo (http://tajo.apache.org) is another Apache project that seeks to be a full data warehouse system on Hadoop data. With an architecture similar to that of Impala, it offers a much richer system with components such as multiple optimizers and ETL tools that are commonplace in traditional data warehouses but less frequently bundled in the Hadoop world. It has a much smaller user base but has been used by certain companies very successfully for a significant length of time, and might be worth considering if you need a fuller data warehousing solution. Other products are also emerging in this space, and it's a good idea to do some research. Hive and Impala are awesome tools, but if you find that they don't meet your needs, then look around—something else might. Summary In its early days, Hadoop was sometimes erroneously seen as the latest supposed relational database killer. Over time, it has become more apparent that the more sensible approach is to view it as a complement to RDBMS technologies and that, in fact, the RDBMS community has developed tools such as SQL that are also valuable in the Hadoop world. HiveQL is an implementation of SQL on Hadoop and was the primary focus of this article. In regard to HiveQL and its implementations, we covered the following topics: How HiveQL provides a logical model atop data stored in HDFS in contrast to relational databases where the table structure is enforced in advance How HiveQL supports many standard SQL data types and commands including joins and views The ETL-like features offered by HiveQL, including the ability to import data into tables and optimize the table structure through partitioning and similar mechanisms How HiveQL offers the ability to extend its core set of operators with user-defined code and how this contrasts to the Pig UDF mechanism The recent history of Hive developments, such as the Stinger initiative, that have seen Hive transition to an updated implementation that uses Tez The broader ecosystem around HiveQL that now includes products such as Impala, Tajo, and Drill and how each of these focuses on specific areas in which to excel With Pig and Hive, we've introduced alternative models to process MapReduce data, but so far we've not looked at another question: what approaches and tools are required to actually allow this massive dataset being collected in Hadoop to remain useful and manageable over time? In the next article, we'll take a slight step up the abstraction hierarchy and look at how to manage the life cycle of this enormous data asset. Resources for Article: Further resources on this subject: Big Data Analysis [Article] Understanding MapReduce [Article] Amazon DynamoDB - Modelling relationships, Error handling [Article]
Read more
  • 0
  • 0
  • 6611

article-image-playing-swift
Packt
23 Dec 2014
23 min read
Save for later

Playing with Swift

Packt
23 Dec 2014
23 min read
Xcode ships with both a command line interpreter and a graphical interface called playground that can be used to prototype and test Swift code snippets. Code typed into the playground is compiled and executed interactively, which permits a fluid style of development. In addition, the user interface can present a graphical view of variables as well as a timeline, which can show how loops are executed. Finally, playgrounds can mix and match code and documentation, leading to the possibility of providing example code as playgrounds and using playgrounds to learn how to use existing APIs and frameworks. This article by Alex Blewitt, the author of Swift Essentials, will present the following topics: How to create a playground Displaying values in the timeline Presenting objects with Quick Look Running asynchronous code Using playground live documentation Generating playgrounds with Markdown and AsciiDoc Limitations of playgrounds (For more resources related to this topic, see here.) Getting started with playgrounds When Xcode is started, a welcome screen is shown with various options, including the ability to create a playground. Playgrounds can also be created from the File | New | Playground menu. Creating a playground Using either the Xcode welcome screen (which can be opened by navigating to Window | Welcome to Xcode) or navigating to File | New | Playground, create MyPlayground in a suitable location targeting iOS. Creating the playground on the Desktop will allow easy access to test Swift code, but it can be located anywhere on the filesystem. Playgrounds can be targeted either towards OS X applications or towards iOS applications. This can be configured when the playground is created, or by switching to the Utilities view by navigating to View | Utilities | Show File Inspector or pressing Command + Option + 1 and changing the dropdown from OS X to iOS or vice versa.   When initially created, the playground will have a code snippet that looks as follows: // Playground - noun: a place where people can play import UIKit var str = "Hello, playground" Playgrounds targeting OS X will read import Cocoa instead. On the right-hand side, a column will show the value of the code when each line is executed. In the previous example, the word Hello, playgr... is seen, which is the result of the string assignment. By grabbing the vertical divider between the Swift code and the output, the output can be resized to show the full text message:   Alternatively, by moving the mouse over the right-hand side of the playground, the Quick Look icon (the eye symbol) will appear; if clicked on, a pop-up box will show the full details:   Viewing the console output The console output can be viewed on the right-hand side by opening the Assistant Editor. This can be opened by pressing Command + Option + Enter or by navigating to View | Assistant Editor | Show Assistant Editor. This will show the result of any println statements executed in the code. Add a simple for loop to the playground and show the Assistant Editor: for i in 1...12 { println("I is (i)") } The output is shown on the right-hand side:   The assistant editor can be configured to be displayed in different locations, such as at the bottom, or stacked horizontally or vertically by navigating to the View | Assistant Editor menu. Viewing the timeline The timeline shows what other values are displayed as a result of executing the code. In the case of the print loop shown previously, the output was displayed as Console Output in the timeline. However, it is possible to use the playground to inspect the value of an expression on a line, without having to display it directly. In addition, results can be graphed to show how the values change over time. Add another line above the println statement to calculate the result of executing an expression, (i-6)*(i-7), and store it in a variable, j: for i in 1...12 { var j = (i-7) * (i-6) println("I is (i)") } On the line next to the variable definition, click on the add variable history symbol (+), which is in the right-hand column (visible when the mouse moves over that area). After it is clicked on, it will change to a (o) symbol and display the graph on the right-hand side. The same can be done for the println statement as well:   The slider at the bottom, indicated by the red tick mark, can be used to slide the vertical bar to see the exact value at certain points:   To show several values at once, use additional variables to hold the values and display them in the timeline as well: for i in 1...12 { var j = (i-7) * (i-6) var k = i println("I is (i)") }   When the slider is dragged, both values will be shown at the same time. Displaying objects with QuickLook The playground timeline can display objects as well as numbers and simple strings. It is possible to load and view images in a playground using classes such as UIImage (or NSImage on OS X). These are known as QuickLook supported objects, and by default include: Strings (attributed and unattributed) Views Class and struct types (members are shown) Colors It is possible to build support for custom types in Swift, by implementing a debugQuickLookObject method that returns a graphical view of the data. Showing colored labels To show a colored label, a color needs to be obtained first. When building against iOS, this will be UIColor; but when building against OS X, it will be NSColor. The methods and types are largely equivalent between the two, but this article will focus on the iOS types. A color can be acquired with an initializer or by using one of the predefined colors that are exposed in Swift using methods: import UIKit // AppKit for OS X let blue = UIColor.blueColor() // NSColor.blueColor() for OS X The color can be used in a UILabel, which displays a text string in a particular size and color. The UILabel needs a size, which is represented by a CGRect, and can be defined with an x and y position along with a width and height. The x and y positions are not relevant for playgrounds and so can be left as zero: let size = CGRect(x:0,y:0,width:200,height:100) let label = UILabel(frame:size)// NSLabel for OS X Finally, the text needs to be displayed in blue and with a larger font size: label.text = str // from the first line of the code label.textColor = blue label.font = UIFont.systemFontOfSize(24) // NSFont for OS X When the playground is run, the color and font are shown in the timeline and available for quick view. Even though the same UILabel instance is being shown, the timeline and the QuickLook values show a snapshot of the state of the object at each point, making it easy to see what has happened between changes.   Showing images Images can be created and loaded into a playground using the UIImage constructor (or NSImage on OS X). Both take a named argument, which is used to find an image with the given name from the playground's Resources folder. To download a logo, open Terminal.app and run the following commands: $ mkdir MyPlayground.playground/Resources $ curl http://alblue.bandlem.com/images/AlexHeadshotLeft.png > MyPlayground.playground/Resources/logo.png An image can now be loaded in Swift with: let logo = UIImage(named:"logo") The location of the Resources associated with a playground can be seen in the File Inspector utilities view, which can be opened by pressing Command + Option + 1. The loaded image can be displayed using QuickLook or by adding it to the "value history:   It is possible to use a URL to acquire an image by creating an NSURL with NSURL(string:"http://..."), then loading the contents of the URL with NSData(contentsOfURL:), and finally using UIImage(data:) to convert it to an image. However, as Swift will keep re-executing the code over and over again, the URL will be hit multiple times in a single debugging session without caching. It is recommended that NSData(contentsOfURL:) and similar networking classes be avoided in playgrounds. Advanced techniques The playground has its own framework, XCPlayground, which can be used to perform certain tasks. For example, individual values can be captured during loops for later analysis. It also permits asynchronous code to continue to execute once the playground has finished running. Capturing values explicitly It is possible to explicitly add values to the timeline by importing the XCPlayground framework and calling XCPCaptureValue with a value that should be displayed in the timeline. This takes an identifier, which is used both as the title and for group-related data values in the same series. When the value history button is selected, it essentially inserts a call to XCPCaptureValue with the value of the expression as "the identifier. For example, to add the logo to the timeline automatically: import XCPlayground XCPCaptureValue("logo",logo)   It is possible to use an identifier to group the data that is being shown in a loop with the identifier representing categories of the values. For example, to display a list of all even and odd numbers between 1 and 6, the following code could be used: for n in 1...6 { if n % 2 == 0 {    XCPCaptureValue("even",n)    XCPCaptureValue("odd",0) } else {    XCPCaptureValue("odd",n)    XCPCaptureValue("even",0) } } The result, when executed, will look as follows:   Running asynchronous code By default, when the execution hits the bottom of the playground, the execution stops. In most cases, this is desirable, but when asynchronous code is involved, execution might need to run even if the main code has finished executing. This might be the case if networking data is involved or if there are multiple tasks whose results need to be synchronized. For example, wrapping the previous even/odd split in an asynchronous call will result in no data being displayed: dispatch_async(dispatch_get_main_queue()) { for n in 1...6 {    // as before } } This uses one of Swift's language features: the dispatch_async method is actually a two-argument method that takes a queue and a block type. However, if the last argument is a block type, then it can be represented as a trailing closure rather than an argument. To allow the playground to continue executing after reaching the bottom, add the following call: XCPSetExecutionShouldContinueIndefinitely() Although this suggests that the execution will run forever, it is limited to 30 seconds of runtime, or whatever is the value displayed at the bottom-right corner of the screen. This timeout can be changed by typing in a new value or using the + and – buttons to increase/decrease time by one second.   Playgrounds and documentation Playgrounds can contain a mix of code and documentation. This allows a set of code samples and explanations to be mixed in with the playground itself. Although there is no way of using Xcode to add sections in the UI at present, the playground itself is an XML file that can be edited using an external text editor such as TextEdit.app. Learning with playgrounds As playgrounds can contain a mixture of code and documentation, it makes them an ideal format for viewing annotated code snippets. In fact, Apple's Swift Tour book can be opened as a playground file. Xcode documentation can be searched by navigating to the Help | Documentation and API Reference menu, or by pressing Command + Shift + 0. In the search field that is presented, type Swift Tour and then select the first result. The Swift Tour book should be presented in Xcode's help system:   A link to download and open the documentation as a playground is given in the first section; if this is downloaded, it can be opened in Xcode as a standalone playground. This provides the same information, but allows the code examples to be dynamic and show the results in the window:   A key advantage of learning through playground-based documentation is that the code can be experimented with. In the Simple Values section of the documentation, where myVariable is assigned, the right-hand side of the playground shows the values. If the literal numbers are changed, the new values will be recalculated and shown on the right-hand side. Some examples are presented solely in playground form; for example, the "Balloons demo, which was used in the introduction of Swift in the WWDC 2014 keynote, is downloadable as a playground from https://developer.apple.com/swift/resources/. Note that the Balloons playground requires OS X 10.10 and Xcode 6.1 to run. Understanding the playground format A playground is an OS X bundle, which means that it is a directory that looks like a single file. If a playground is selected either in TextEdit.app or in Finder, then it looks like a regular file:   Under the covers, it is actually a directory: $ ls -FMyPlayground.playground/ Inside the directory, there are a number of files: $ ls -1 MyPlayground.playground/* MyPlayground.playground/Resources MyPlayground.playground/contents.xcplayground MyPlayground.playground/section-1.swift MyPlayground.playground/timeline.xctimeline The files are as follows: The Resources directory, which was created earlier to hold the logo image The contents.xcplayground file, which is an XML table of contents of the files that make up the playground The section-1.swift file, which is the Swift file created by default when a new playground is created, and contains the code that is typed in for any new playground content The timeline.xctimeline file, which is an automatically generated file containing timestamps of execution, which the runtime generates when executing a Swift file and the timeline is open The table of contents file defines which runtime environment is being targeted (for example, iOS or OS X), a list of sections, and a reference to the timeline file: <playground version='3.0' sdk='iphonesimulator'> <sections>    <code source-file-name='section-1.swift'/> </sections> <timeline fileName='timeline.xctimeline'/> </playground> This file can be edited to add new sections, provided that it is not open in Xcode at the same time. An Xcode playground directory is deleted and recreated whenever changes are made in Xcode. Any Terminal.app windows open in that directory will no longer show any files. As a result, using external tools and editing the files in place might result in changes being lost. In addition, if you are using ancient versions of control systems, such as SVN and CVS, you might find your version control metadata being wiped out between saves. Xcode ships with the industry standard Git version control system, which should be preferred instead. Adding a new documentation section To add a new documentation section, ensure that the playground is not open in Xcode and then edit the contents.xcplayground file. The file itself can be opened by right-clicking on the playground in Finder and choosing Show Package Contents:   This will open up a new Finder window, with the contents displayed as a top-level set of elements. The individual files can then be opened for editing by right-clicking on the contents.xcplayground file, choosing Open With | Other..., and selecting an application, such as TextEdit.app.   Alternatively, the file can be edited from the command line using an editor such as pico, vi, or emacs. Although there are few technology debates more contentious than whether vi or emacs is better, the recommended advice is to learn how to be productive in at least one of them. Like learning to touch-type, being productive in a command-line editor is something that will pay dividends in the future if the initial learning challenge can be overcome. For those who don't have time, pico (also known as nano) can be a useful tool in command-line situations, and the on-screen help makes it easier to learn to use. Note that the carat symbol (^) means control, so ^X means Control + X. To add a new documentation section, create a directory called Documentation, and inside it, create a file called hello.html. The HTML file is an HTML5 document, with a declaration and a body. A minimal file looks like: <!DOCTYPE html> <html> <body>    <h1>Welcome to Swift Playground</h1> </body> </html> The content needs to be added to the table of contents (contents.xcplayground) in order to display it in the playground itself, by adding a documentation element under the sections element: <playground version='3.0' sdk='iphonesimulator'> <sections>    <code source-file-name='section-1.swift'/>    <documentation relative-path='hello.html'/> </sections> <timeline fileName='timeline.xctimeline'/> </playground> The relative-path attribute is relative to the Documentation directory. All content in the Documentation directory is copied between saves in the timeline and can be used to store other text content such as CSS files. Binary content, including images, should be stored in the Resources directory. When viewed as a playground, the content will be shown in the same window as "the documentation:   If the content is truncated in the window, then a horizontal rule can be added at the bottom with <hr/>, or the documentation can be styled, as shown in the next section. Styling the documentation As the documentation is written in HTML, it is possible to style it using CSS. For example, the background of the documentation is transparent, which results in the text overlapping both the margins as well as the output. To add a style sheet to the documentation, create a file called stylesheet.css in the Documentation directory and add the following content: body { background-color: white } To add the style sheet to the HTML file, add a style sheet link reference to the head element in hello.html: <head> <link rel="stylesheet" type="text/css" href="stylesheet.css"/> </head> Now when the playground is opened, the text will have a solid white background and will not be obscured by the margins:   Adding resources to a playground Images and other resources can also be added to a playground. Resources need to be added to a directory called Resources, which is copied as is between different versions of the playground. To add an image to the document, create a Resources folder and then insert an image. For example, earlier in this article, an image was downloaded by using the following commands: $ mkdir MyPlayground.playground/Resources $ curl http://alblue.bandlem.com/images/AlexHeadshotLeft.png > MyPlayground.playground/Resources/logo.png The image can then be referred to in the documentation using an img tag and a relative path from the Documentation directory: <img src="../Resources/logo.png" alt="Logo"/> Other supported resources (such as JPEG and GIF) can be added to the Resources folder as well. It is also possible to add other content (such as a ZIP file of examples) to the Resources folder and provide hyperlinks from the documentation to the resource files. <a href="../Resources/AlexBlewitt.vcf">Download contact card</a> Additional entries in the header The previous example showed the minimum amount of content required for playground documentation. However, there are other meta elements that can be added to the document that have specific purposes and which might be found in other playground examples on the internet. Here is a more comprehensive example of using meta elements: <!DOCTYPE html> <html lang="en"> <head>    <meta charset="utf-8"/>    <link rel="stylesheet" type="text/css" href="stylesheet.css"/>    <title>Welcome to Swift Playground</title>    <meta name="xcode-display" content="render"/>    <meta name="apple-mobile-web-app-capable" content="yes"/>    <meta name="viewport" content="width=device-width,maximum-scale=1.0"/> </head> <body>...</body> </html> In this example, the document is declared as being written in English (lang="en" on the html element) and in the UTF-8 character set. The <meta charset="utf-8"/> should always be the first element in the HTML head section, and the UTF-8 encoding should always be preferred for writing documents. If this is missed, it will default to a different encoding, such as ISO-8859-1, which can lead to strange characters appearing. Always use UTF-8 for writing HTML documents. The link and title are standard HTML elements that associate the style sheet "(from before) and the title of the document. The title is not displayed in Xcode, but it can be shown if the HTML document is opened in a browser instead. As the documentation is reusable between playgrounds and the web, it makes sense to "give it a sensible title. The link should be the second element after the charset definition. In fact, all externally linked resources—such as style sheets and scripts—should occur near the top of the document. This allows the HTML parser to initiate the download of external resources as soon as possible. This also includes the HTML5 prefetch link type, which is not supported in Safari or playground at the time of writing. The meta tags are instructions to Safari to render it in different ways (Safari is the web engine that is used to present the documentation content in playground). Safari-specific meta tags are described at https://developer.apple.com/library/safari/documentation/AppleApplications/Reference/SafariHTMLRef/Articles/MetaTags.html and include the following: The xcode-display=render meta tag, which indicates that Xcode should show the content of the document instead of the HTML source code when opening in Xcode The apple-mobile-web-app-capable=yes meta tag, which indicates that Safari should show this fullscreen if necessary when running on a "mobile device The viewport=width=device-width,maximum-scale=1.0 meta tag, which "allows the document body to be resized to fit the user's viewable area without scaling Generating playgrounds automatically The format of the playground files are well known, and several utilities have been created to generate playgrounds from documentation formats, such as Markdown or AsciiDoc. These are text-based documentation formats that provide a standard means to generate output documents, particularly HTML-based ones. Markdown Markdown (a word play on markup) was created to provide a standard syntax to generate web page documentation with links and references in a plain text format. More information about Markdown can be found at the home page (http://daringfireball.net/projects/markdown/), and more about the standardization of Markdown into CommonMark (used by StackOverflow, GitHub, Reddit, and others) can be found at http://commonmark.org. Embedding code in documentation is fairly common in Markdown. The file is treated as a top-level document, with sections to separate out the documentation and the code blocks. In CommonMark, these are separated with back ticks (```), often with the name of the language to add different script rendering types: ## Markdown Example ##This is an example CommonMark document. Blank lines separate paragraphs. Code blocks are introduced with three back-ticks and closed with back-ticks:```swift println("Welcome to Swift") ```Other text and other blocks can follow below. The most popular tool for converting Markdown/CommonMark documents into playgrounds (at the time of writing) is Jason Sandmeyer's swift-playground-builder at https://github.com/jas/swift-playground-builder/. The tool uses Node to execute JavaScript and can be installed using the npm install -g swift-playground-builder command. Both Node and npm can be installed from "http://nodejs.org. Once installed, documents can be translated using playground --platform ios --destination outdir --stylesheet stylesheet.css. If code samples should not be editable, then the --no-refresh argument should be added. AsciiDoc AsciiDoc is similar in intent to Markdown, except that it can render to more backends than just HTML5. AsciiDoc is growing in popularity for documenting code, primarily because the standard is much more well defined than Markdown is. The de facto standard translation tool for AsciiDoc is written in Ruby and can be installed using the sudo gem install asciidoctor command. Code blocks in AsciiDoc are represented by a [source] block. For Swift, this will be [source, swift]. The block starts and ends with two hyphens (--): .AsciiDoc ExampleThis is an example AsciiDoc document. Blank lines separate paragraphs. Code blocks are introduced with a source block and two hyphens:[source, swift] -- println("Welcome to Swift") -- Other text and other code blocks can follow below --. AsciiDoc files typically use the ad extension, and the ad2play tool can be installed from James Carlson's repository at https://github.com/jxxcarlson/ad2play. Saving the preceding example as example.ad and running ad2play example.ad will result in the generation of the example.playground file. More information about AsciiDoc, including the syntax and backend, can be found at the AsciiDoc home page at http://www.methods.co.nz/asciidoc/ or on the Asciidoctor home page at http://asciidoctor.org. Limitations of playgrounds Although playgrounds can be very powerful for interacting with code, there are some limitations that are worth being aware of. There is no debugging support in the playground. It is not possible to add a breakpoint and use the debugger and find out what the values are. Given that the UI allows tracking values—and that it's very easy to add new lines with just the value to be tracked—this is not much of a hardship. Other limitations of playgrounds include: Only the simulator can be used for the execution of iOS-based playgrounds. This prevents the use of hardware-specific features that might only be present on a device. The performance of playground scripts is mainly driven based on how many lines are executed and how much output is saved by the debugger. It should not be used to test the performance of performance-sensitive code. Although the playground is well suited to present user interface components, it cannot be used for user input. Anything requiring entitlements (such as in-app purchases or access to iCloud) is not possible in playground at the time of writing. Note that while earlier releases of playground did not support custom frameworks, Xcode 6.1 permits frameworks to be loaded into playground, provided that the framework is built and marked as public and that it is in the same workspace as the playground Summary This article presented playgrounds, an innovative way of running Swift code with graphical representations of values and introspection of running code. Both expressions and the timeline were presented as a way of showing the state of the program at any time, as well as graphically inspecting objects using QuickLook. The XCPlayground framework can also be used to record specific values and allow asynchronous code to be executed. Being able to mix code and documentation into the same playground is also a great way of showing what functions exist, and how to create self-documenting playgrounds was presented. In addition, tools for the creation of such playgrounds using either AsciiDoc or Markdown (CommonMark) were introduced. Resources for Article: Further resources on this subject: Using OpenStack Swift [article] Sparrow iOS Game Framework - The Basics of Our Game [article] Adding Real-time Functionality Using Socket.io [article]
Read more
  • 0
  • 0
  • 9669
article-image-sneak-peek-ios-touch-id
Packt
23 Dec 2014
4 min read
Save for later

Sneak peek into iOS Touch ID

Packt
23 Dec 2014
4 min read
This article, created by Mayank Birani, the author of Learning iOS 8 for Enterprise, Apple introduced a new feature in iOS 7 called Touch ID authentication. Previously, there was only four-digit passcode security in iPhones; now, Apple has extended security and introduced a new security pattern in iPhones. In Touch ID authentication, our fingerprint acts as a password. After launching the Touch ID fingerprint-recognition technology in the iPhone 5S last year, Apple is now providing it for developers with iOS 8. Now, third-party apps will be able to use Touch ID for authentication in the new iPhone and iPad OSes. Accounting apps, and other apps that contain personal and important data, will be protected with Touch ID. Now, you can protect all your apps with your fingerprint password. (For more resources related to this topic, see here.) There are two ways to use Touch ID as an authentication mechanism in our iOS 8 applications. They are explained in the following sections. Touch ID through touch authentication The Local Authentication API is an API that returns a Boolean value to accept and decline the fingerprint. If there is an error, then an error code gets executed and tells us what the issue is. Certain conditions have to be met when using Local Authentication. They are as follows: The application must be in the foreground (this doesn't work with background processes) If you're using the straight Local Authentication method, you will be responsible for handling all the errors and properly responding with your UI to ensure that there is an alternative method to log in to your apps Touch ID through Keychain Access Keychain Access includes the new Touch ID integration in iOS 8. In Keychain Access, we don't have to work on implementation details; it automatically handles the passcode implementation using the user's passcode. Several keychain items can be chosen to use Touch ID to unlock the item when requested in code through the use of the new Access Control Lists (ACLs). ACL is a feature of iOS 8. If Touch ID has been locked out, then it will allow the user to enter the device's passcode to proceed without any interruption. There are some features of Keychain Access that make it the best option for us. They are listed here: Keychain Access uses Touch ID, and its attributes won't be synced by any cloud services. So, these features make it very safe to use. If users overlay more than one query, then the system gets confused about correct user, and it will pop up a dialog box with multiple touch issues. How to use the Local Authentication framework Apple provides a framework to use Touch ID in our app called Local Authentication. This framework was introduced for iOS 8. To make an app, including the Touch ID authentication, we need to import this framework in our code. It is present in the framework library of Apple. Let's see how to use the Local Authentication framework: Import the Local Authentication framework as follows: #import<localAuthentication/localAuthentication.h> This framework will work on Xcode 6 and above. To use this API, we have to create a Local Authentication context, as follows: LAContext *passcode = [[LAContext alloc] init]; Now, check whether Touch ID is available or not and whether it can be used for authentication: - (BOOL)canEvaluatePolicy:(LAPolicy)policy error: (NSError * __autoreleasing *)error; To display Touch ID, use the following code: - (void)evaluatePolicy:(LAPolicy)policy localizedReason: (NSString *)localizedReason    reply:(void(^)(BOOL success, NSError *error))reply; Take a look at the following example of Touch ID: LAContext *passcode = [[LAContext alloc] init]; NSError *error = nil; NSString *Reason = <#String explaining why our app needs authentication#>; if ([passcode canEvaluatePolicy: LAPolicyDeviceOwnerAuthenticationWithBiometrics error:&error]) { [passcode evaluatePolicy:      LAPolicyDeviceOwnerAuthenticationWithBiometrics    localizedReason:Reason reply:^(BOOL success, NSError      *error) {        if (success)        {          // User authenticated successfully        } else        {          // User did not authenticate successfully,            go through the error        }    }]; } else {    // could not go through policy look at error and show an appropriate message to user } Summary In this article, we focused on the Touch ID API, which was introduced in iOS 8. We also discussed how Apple has improved its security feature using this API. Resources for Article: Further resources on this subject: Sparrow iOS Game Framework - The Basics of Our Game [article] Updating data in the background [article] Physics with UIKit Dynamics [article]
Read more
  • 0
  • 0
  • 4765

article-image-getting-started-selenium-webdriver-and-python
Packt
23 Dec 2014
19 min read
Save for later

GETTING STARTED WITH SELENIUM WEBDRIVER AND PYTHON

Packt
23 Dec 2014
19 min read
In this article by UnmeshGundecha, author of the book Learning Selenium Testing Tools with Python, we will introduce you to the Selenium WebDriver client library for Python by demonstrating its installation, basic features, and overall structure. Selenium automates browsers. It automates the interaction we do in a browser window such as navigating to a website, clicking on links, filling out forms, submitting forms, navigating through pages, and so on. It works on every major browser available out there. In order to use Selenium WebDriver, we need a programing language to write automation scripts. The language that we select should also have a Selenium client library available. Python is a widely used general-purpose, high-level programming language. It's easy and its syntax allows us to express concepts in fewer lines of code. It emphasizes code readability and provides constructs that enable us to write programs on both the small and large scale. It also provides a number of in-built and user-written libraries to achieve complex tasks quite easily. The Selenium WebDriver client library for Python provides access to all the Selenium WebDriver features and Selenium standalone server for remote and distributed testing of browser-based applications. Selenium Python language bindings are developed and maintained by David Burns, Adam Goucher, MaikRöder, Jason Huggins, Luke Semerau, Miki Tebeka, and Eric Allenin. The Selenium WebDriver client library is supported on Python Version 2.6, 2.7, 3.2, and 3.3. In this article, we will cover the following topics: Installing Python and Selenium package Selecting and setting up a Python editor Implementing a sample script using the Selenium WebDriver Python client library Implementing cross-browser support with Internet Explorer and Google Chrome (For more resources related to this topic, see here.) Preparing your machine As a first step of using Selenium with Python, we'll need to install it on our computer with the minimum requirements possible. Let's set up the basic environment with the steps explained in the following sections. Installing Python You will find Python installed by default on most Linux distributions, Mac OS X, and other Unix machines. On Windows, you will need to install it separately. Installers for different platforms can be found at http://python.org/download/. Installing the Selenium package The Selenium WebDriver Python client library is available in the Selenium package. To install the Selenium package in a simple way, use the pip installer tool available at https://pip.pypa.io/en/latest/. With pip, you can simply install or upgrade the Selenium package using the following command: pip install -U selenium This is a fairly simple process. This command will set up the Selenium WebDriver client library on your machine with all modules and classes that we will need to create automated scripts using Python. The pip tool will download the latest version of the Selenium package and install it on your machine. The optional –U flag will upgrade the existing version of the installed package to the latest version. You can also download the latest version of the Selenium package source from https://pypi.python.org/pypi/selenium. Just click on the Download button on the upper-right-hand side of the page, unarchive the downloaded file, and install it with following command: python setup.py install Browsing the Selenium WebDriver Python documentation The Selenium WebDriver Python client library documentation is available at http://selenium.googlecode.com/git/docs/api/py/api.html as shown in the following screenshot:   It offers detailed information on all core classes and functions of Selenium WebDriver. Also note the following links for Selenium documentation: The official documentation at http://docs.seleniumhq.org/docs/ offers documentation for all the Selenium components with examples in supported languages Selenium Wiki at https://code.google.com/p/selenium/w/list lists some useful topics. Selecting an IDE Now that we have Python and Selenium WebDriver set up, we will need an editor or an Integrated Development Environment (IDE) to write automation scripts. A good editor or IDE increases the productivity and helps in doing a lot of other things that make the coding experience simple and easy. While we can write Python code in simple editors such as Emacs, Vim, or Notepad, using an IDE will make life a lot easier. There are many IDEs to choose from. Generally, an IDE provides the following features to accelerate your development and coding time: A graphical code editor with code completion and IntelliSense A code explorer for functions and classes Syntax highlighting Project management Code templates Tools for unit testing and debugging Source control support If you're new to Python, or you're a tester working for the first time in Python, your development team will help you to set up the right IDE. However, if you're starting with Python for the first time and don't know which IDE to select, here are a few choices that you might want to consider. PyCharm PyCharm is developed by JetBrains, a leading vendor of professional development tools and IDEs such as IntelliJ IDEA, RubyMine, PhpStorm, and TeamCity. PyCharm is a polished, powerful, and versatile IDE that works pretty well. It brings best of the JetBrains experience in building powerful IDEs with lots of other features for a highly productive experience. PyCharm is supported on Windows, Linux, and Mac. To know more about PyCharm and its features visit http://www.jetbrains.com/pycharm/. PyCharm comes in two versions—a community edition and a professional edition. The community edition is free, whereas you have to pay for the professional edition. Here is the PyCharm community edition running a sample Selenium script in the following screenshot:   The community edition is great for building and running Selenium scripts with its fantastic debugging support. We will use PyCharm in the rest of this Article. Later in this article, we will set up PyCharm and create our first Selenium script. All the examples in this article are built using PyCharm; however, you can easily use these examples in your choice of editor or IDE. The PyDev Eclipse plugin The PyDev Eclipse plugin is another widely used editor among Python developers. Eclipse is a famous open source IDE primarily built for Java; however, it also offers support to various other programming languages and tools through its powerful plugin architecture. Eclipse is a cross-platform IDE supported on Windows, Linux, and Mac. You can get the latest edition of Eclipse at http://www.eclipse.org/downloads/. You need to install the PyDev plugin separately after setting up Eclipse. Use the tutorial from Lars Vogel to install PyDev at http://www.vogella.com/tutorials/Python/article.html to install PyDev. Installation instructions are also available at http://pydev.org/. Here's the Eclipse PyDev plugin running a sample Selenium script as shown in the following screenshot:   PyScripter For the Windows users, PyScripter can also be a great choice. It is open source, lightweight, and provides all the features that modern IDEs offer such as IntelliSense and code completion, testing, and debugging support. You can find more about PyScripter along with its download information at https://code.google.com/p/pyscripter/. Here's PyScripter running a sample Selenium script as shown in the following screenshot:   Setting up PyCharm Now that we have seen IDE choices, let's set up PyCharm. All examples in this article are created with PyCharm. However, you can set up any other IDE of your choice and use examples as they are. We will set up PyCharm with following steps to get started with Selenium Python: Download and install the PyCharm Community Edition from JetBrains site http://www.jetbrains.com/pycharm/download/index.html. Launch the PyCharm Community Edition. Click on the Create New Project option on the PyCharm Community Edition dialog box as shown in the following screenshot: On the Create New Project dialog box, as shown in next screenshot, specify the name of your project in the Project name field. In this example, setests is used as the project name. We need to configure the interpreter for the first time. Click on the button to set up the interpreter, as shown in the following screenshot: On the Python Interpreter dialog box, click on the plus icon. PyCharm will suggest the installed interpreter similar to the following screenshot. Select the interpreter from Select Interpreter Path. PyCharm will configure the selected interpreter as shown in the following screenshot. It will show a list of packages that are installed along with Python. Click on the Apply button and then on the OK button: On the Create New Project dialog box, click on the OK button to create the project: Taking your first steps with Selenium and Python We are now ready to start with creating and running automated scripts in Python. Let's begin with Selenium WebDriver and create a Python script that uses Selenium WebDriver classes and functions to automate browser interaction. We will use a sample web application for most of the examples in this artricle. This sample application is built on a famous e-commerce framework—Magento. You can find the application at http://demo.magentocommerce.com/. In this sample script, we will navigate to the demo version of the application, search for products, and list the names of products from the search result page with the following steps: Let's use the project that we created earlier while setting up PyCharm. Create a simple Python script that will use the Selenium WebDriver client library. In Project Explorer, right-click on setests and navigate to New | Python File from the pop-up menu: On the New Python file dialog box, enter searchproducts in the Name field and click on the OK button: PyCharm will add a new tab searchproducts.py in the code editor area. Copy the following code in the searchproduct.py tab: from selenium import webdriver   # create a new Firefox session driver = webdriver.Firefox() driver.implicitly_wait(30) driver.maximize_window()   # navigate to the application home page driver.get("http://demo.magentocommerce.com/")   # get the search textbox search_field = driver.find_element_by_name("q") search_field.clear()   # enter search keyword and submit search_field.send_keys("phones") search_field.submit()   # get all the anchor elements which have product names displayed # currently on result page using find_elements_by_xpath method products = driver.find_elements_by_xpath("//h2[@class='product-name']/a")   # get the number of anchor elements found print "Found " + str(len(products)) + " products:"   # iterate through each anchor element and print the text that is # name of the product for product in products: printproduct.text   # close the browser window driver.quit() If you're using any other IDE or editor of your choice, create a new file, copy the code to the new file, and save the file as searchproducts.py. To run the script, press the Ctrl + Shift + F10 combination in the PyCharm code window or select Run 'searchproducts' from the Run menu. This will start the execution and you will see a new Firefox window navigating to the demo site and the Selenium commands getting executed in the Firefox window. If all goes well, at the end, the script will close the Firefox window. The script will print the list of products in the PyCharm console as shown in the following screenshot: We can also run this script through the command line with the following command. Open the command line, then open the setests directory, and run following command: python searchproducts.py We will use command line as the preferred method in the rest of the article to execute the tests. We'll spend some time looking into the script that we created just now. We will go through each statement and understand Selenium WebDriver in brief. The selenium.webdriver module implements the browser driver classes that are supported by Selenium, including Firefox, Chrome, Internet Explorer, Safari, and various other browsers, and RemoteWebDriver to test on browsers that are hosted on remote machines. We need to import webdriver from the Selenium package to use the Selenium WebDriver methods: from selenium import webdriver Next, we need an instance of a browser that we want to use. This will provide a programmatic interface to interact with the browser using the Selenium commands. In this example, we are using Firefox. We can create an instance of Firefox as shown in following code: driver = webdriver.Firefox() During the run, this will launch a new Firefox window. We also set a few options on the driver: driver.implicitly_wait(30) driver.maximize_window() We configured a timeout for Selenium to execute steps using an implicit wait of 30 seconds for the driver and maximized the Firefox window through the Selenium APINext, we will navigate to the demo version of the application using its URL by calling the driver.get() method. After the get() method is called, WebDriver waits until the page is fully loaded in the Firefox window and returns the control to the script. After loading the page, Selenium will interact with various elements on the page, like a human user. For example, on the Home page of the application, we need to enter a search term in a textbox and click on the Search button. These elements are implemented as HTML input elements and Selenium needs to find these elements to simulate the user action. Selenium WebDriver provides a number of methods to find these elements and interact with them to perform operations such as sending values, clicking buttons, selecting items in dropdowns, and so on. In this example, we are finding the Search textbox using the find_element_by_name method. This will return the first element matching the name attribute specified in the find method. The HTML elements are defined with tag and attributes. We can use this information to find an element, by following the given steps: In this example, the Search textbox has the name attribute defined as q and we can use this attribute as shown in the following code example: search_field = driver.find_element_by_name("q") Once the Search textbox is found, we will interact with this element by clearing the previous value (if entered) using the clear() method and enter the specified new value using the send_keys() method. Next, we will submit the search request by calling the submit() method: search_field.clear() search_field.send_keys("phones") search_field.submit() After submission of the search request, Firefox will load the result page returned by the application. The result page has a list of products that match the search term, which is phones. We can read the list of results and specifically the names of all the products that are rendered in the anchor <a> element using the find_elements_by_xpath() method. This will return more than one matching element as a list: products =   driver.find_elements_by_xpath("//h2[@class= 'product-name']/a") Next, we will print the number of products (that is the number of anchor <a> elements) that are found on the page and the names of the products using the .text property of all the anchor <a> elements: print "Found " + str(len(products)) + " products:" for product in products: printproduct.text At end of the script, we will close the Firefox browser using the driver.quit() method: driver.quit() This example script gives us a concise example of using Selenium WebDriver and Python together to create a simple automation script. We are not testing anything in this script yet. We will extend this simple script into a set of tests and use various other libraries and features of Python. Cross-browser support So far we have built and run our script with Firefox. Selenium has extensive support for cross-browser testing where you can automate on all the major browsers including Internet Explorer, Google Chrome, Safari, Opera, and headless browsers such as PhantomJS. In this section, we will set up and run the script that we created in the previous section with Internet Explorer and Google Chrome to see the cross-browser capabilities of Selenium WebDriver. Setting up Internet Explorer There is a little more to run scripts on Internet Explorer. To run tests on Internet Explorer, we need to download and set up the InternetExplorerDriver server. The InternetExplorerDriver server is a standalone server executable that implements WebDriver's wire protocol to work as glue between the test script and Internet Explorer. It supports major IE versions on Windows XP, Vista, Windows 7, and Windows 8 operating systems. Let's set up the InternetExplorerDriver server with the following steps: Download the InternetExplorerDriver server from http://www.seleniumhq.org/download/. You can download 32- or 64-bit versions based on the system configuration that you are using. After downloading the InternetExplorerDriver server, unzip and copy the file to the same directory where scripts are stored. On IE 7 or higher, the Protected Mode settings for each zone must have the same value. Protected Mode can either be on or off, as long as it is for all the zones. To set the Protected Mode settings: Choose Internet Options from the Tools menu. On the Internet Options dialog box, click on the Security tab. Select each zone listed in Select a zone to view or change security settings and make sure Enable Protected Mode (requires restarting Internet Explorer) is either checked or unchecked for all the zones. All the zones should have the same settings as shown in the following screenshot: While using the InternetExplorerDriver server, it is also important to keep the browser zoom level set to 100 percent so that the native mouse events can be set to the correct coordinates. Finally, modify the script to use Internet Explorer. Instead of creating an instance of the Firefox class, we will use the IE class in the following way: importos from selenium import webdriver   # get the path of IEDriverServer dir = os.path.dirname(__file__) ie_driver_path = dir + "IEDriverServer.exe"   # create a new Internet Explorer session driver = webdriver.Ie(ie_driver_path) driver.implicitly_wait(30) driver.maximize_window()   # navigate to the application home page driver.get("http://demo.magentocommerce.com/")   # get the search textbox search_field = driver.find_element_by_name("q") search_field.clear()   # enter search keyword and submit search_field.send_keys("phones") search_field.submit()   # get all the anchor elements which have product names displayed # currently on result page using find_elements_by_xpath method products = driver.find_elements_by_xpath("//h2[@class='product-name']/a")   # get the number of anchor elements found print "Found " + str(len(products)) + " products:"   # iterate through each anchor element and print the text that is # name of the product for product in products: printproduct.text   # close the browser window driver.quit() In this script, we passed the path of the InternetExplorerDriver server while creating the instance of an IE browser class. Run the script and Selenium will first launch the InternetExplorerDriver server, which launches the browser, and execute the steps. The InternetExplorerDriver server acts as an intermediary between the Selenium script and the browser. Execution of the actual steps is very similar to what we observed with Firefox. Read more about the important configuration options for Internet Explorer at https://code.google.com/p/selenium/wiki/InternetExplorerDriver and the DesiredCapabilities article at https://code.google.com/p/selenium/wiki/DesiredCapabilities. Setting up Google Chrome Setting up and running Selenium scripts on Google Chrome is similar to Internet Explorer. We need to download the ChromeDriver server similar to InternetExplorerDriver. The ChromeDriver server is a standalone server developed and maintained by the Chromium team. It implements WebDriver's wire protocol for automating Google Chrome. It is supported on Windows, Linux, and Mac operating systems. Set up the ChromeDriver server using the following steps: Download the ChromeDriver server from http://chromedriver.storage.googleapis.com/index.html. After downloading the ChromeDriver server, unzip and copy the file to the same directory where the scripts are stored. Finally, modify the sample script to use Chrome. Instead of creating an instance of the Firefox class, we will use the Chrome class in the following way: importos from selenium import webdriver   # get the path of chromedriver dir = os.path.dirname(__file__) chrome_driver_path = dir + "chromedriver.exe" #remove the .exe extension on linux or mac platform   # create a new Chrome session driver = webdriver.Chrome(chrome_driver_path) driver.implicitly_wait(30) driver.maximize_window()   # navigate to the application home page driver.get("http://demo.magentocommerce.com/")   # get the search textbox search_field = driver.find_element_by_name("q") search_field.clear()   # enter search keyword and submit search_field.send_keys("phones") search_field.submit()   # get all the anchor elements which have product names displayed # currently on result page using find_elements_by_xpath method products = driver.find_elements_by_xpath("//h2[@class='product-name']/a")   # get the number of anchor elements found print "Found " + str(len(products)) + " products:"   # iterate through each anchor element and print the text that is # name of the product for product in products: printproduct.text   # close the browser window driver.quit() In this script, we passed the path of the ChromeDriver server while creating an instance of the Chrome browser class. Run the script. Selenium will first launch the Chromedriver server, which launches the Chrome browser, and execute the steps. Execution of the actual steps is very similar to what we observed with Firefox. Read more about ChromeDriver at https://code.google.com/p/selenium/wiki/ChromeDriver and https://sites.google.com/a/chromium.org/chromedriver/home. Summary In this article, we introduced you to Selenium and its components. We installed the selenium package using the pip tool. Then we looked at various Editors and IDEs to ease our coding experience with Selenium and Python and set up PyCharm. Then we built a simple script on a sample application covering some of the high-level concepts of Selenium WebDriver Python client library using Firefox. We ran the script and analyzed the outcome. Finally, we explored the cross-browser testing support of Selenium WebDriver by configuring and running the script with Internet Explorer and Google Chrome. Resources for Article: Further resources on this subject: Quick Start into Selenium Tests [article] Exploring Advanced Interactions of WebDriver [article] Mobile Devices [article]
Read more
  • 0
  • 0
  • 10098

article-image-importance-hyper-v-security
Packt
23 Dec 2014
19 min read
Save for later

The importance of Hyper-V Security

Packt
23 Dec 2014
19 min read
In this article, by Eric Siron and Andy Syrewicze, authors of the book, Hyper-V Security, we will be introduced to one of the most difficult tribulations in the entire realm of computing—security. Computers are tools, and just like any tool, they are designed to be used. Unfortunately, not every usage is proper, and not every computer should be accessed by just anyone. A computer really has no way to classify proper usage against improper usage, or differentiate between a valid user and an unauthorized user any more than a hammer would. The act of securing them is quite literally an endeavor to turn them against their purpose. Hyper-V adds new dimensions to the security problem. Virtual machines have protection options that mirror their physical counterparts, but present unique challenges. The hypervisor presents challenges of its own, both in its role as the host for those virtual machines and through the management operating system that manifests it. In this article, we'll cover: The importance of Hyper-V security Basic security concerns A starting point to security The terminology of Hyper-V Acquiring Hyper-V (For more resources related to this topic, see here.) For many, security seems like a blatantly obvious necessity. For others, the need isn't as clear. Many decision-makers don't believe that their organization's product requires in-depth protection. Many administrators believe that the default protections are sufficient. There are certainly some institutions whose needs don't require an elaborate regimen of protections, but no one can skip due diligence. Your clients expect it The exact definition of a "client" varies from organization to organization, but every organization type provides some sort of service to someone. Whether you are a retail outlet or a non-profit organization that provides intangible services to individuals in need that cannot pay for them, your institution has an implicit agreement to protect the information relevant to those who depend on you. They most likely won't have any idea what Hyper-V is or what you use it for, but they will know enough to be displeased if it is revealed that any of your computer systems are not secure. Your organization could be vulnerable to litigation if clients believe their data is not being treated with sufficient importance. Your stakeholders expect it As with clients, stakeholders can mean many things. Simplistically, it's anyone who has a "stake" in the well-being of your organization. This could be members of the board of directors who aren't privy to day-to-day operations. It could be external investors. It could even include the previously mentioned clients. Even if they have no way to understand what's necessary or unnecessary to secure, they expect that it's being handled. Furthermore, they may disagree with you on what data is important to protect. If it's later discovered that something wasn't fully guarded that they assumed was being treated as highly confidential, the response could have extremely negative consequences. Your employees and volunteers expect it Almost all organizations have digitized some vital information of its employees and volunteers. They expect that this data is held in the highest confidentiality and is well guarded against theft and espionage. Even if the rest of your institution's data requires no particular protection, personnel data must always be safeguarded. In many jurisdictions, this is a legal requirement. Even if you aren't under the rule of law, civil litigation is always a possibility. Experience has taught us that security is important In the past, it was believed that attackers came from outside the institution and were simply after quick and easy money sources, such as credit card numbers. However, reality has shown that breaches occur for a wide variety of reasons, and many aren't obvious until after it's too late to do anything about it. The next section, Basic Security Concerns, will highlight a number of both common and unexpected attack types. Weak points aren't always obvious You know that you need to protect access to sensitive backend data with frontend passwords. You know that information traveling between the two needs to be encrypted. However, are you aware of every single point that the data will travel through? Is the storage location unprotected? Has there been a recent audit of individuals with access? Is there another application on one of the component systems that allows for unencrypted communications or remote access? Treating any system as though it doesn't need to be secured could allow it to become a gateway for others. The costs of repair exceeds the costs of prevention The summary of this section's message is that failing to enact security measures is not an acceptable option. It's not unusual to find people who understand that security is important, but believe that it's simply too expensive and that the systems to be protected are just not worth the effort. In reality, the costs of a breach can be catastrophic. Just adding up the previous points can lead you to that conclusion. Between lawyer bills, court costs, and any awards, litigation costs can be unbearably high. Of course, a breach might directly result in a financial loss of some kind. Beyond that, a loss of trust inevitably follows the compromise of systems, and this can have a greater long-term impact than anything else. Even when all those problems are taken care of, it's still necessary to clean up any damage to the systems and close the exploited breach points. Basic security concerns With a topic as large as computer security, it's always tough to know where to start. The best place is generally to begin by getting an idea of where and what your largest risk factors are. Every organization will have its own specific areas of concern, but there are a number of common elements that everyone needs to worry about. Attack motivations To understand what risks you face, it helps to know the reasons for which you might find yourself under attack. For many malware generators, there isn't a lot of reason involved. They write destructive code because they like destruction; they might be working from a place of genuine malice or a simple disregard for the well-being of others. For many others, their work comes from a need for vengeance over a real or perceived slight. The trespasses they seek revenge for could be relatively petty things, but some attacks are carried out over much more serious events, even major political affairs. Some authors seek a degree of notoriety, perhaps not among the public at large as much as a small group or subculture. Financial motivation can be the source of both the most benign and the most dangerous security compromise. For instance, someone may want to prove eligibility for a job by showing that they possess the necessary skills to secure a system. One possible way is by demonstrating an ability to compromise that system. Such breaches generally require a deep understanding of the relevant technology, so they can effectively illustrate thorough knowledge. As long as these examples are never released "into the wild" and are instead disclosed to the system manufacturer so that a fix can be engineered, they are ultimately harmless. Unfortunately, a great many attackers seek a shorter-term gain through methods such as extortion from the manufacturer or owners of compromised systems or theft of sensitive data. Data theft is often thought of in terms of financial information, such as credit card data. However, intellectual property should also be kept heavily guarded. Data that seems relatively benign might also be a target; if an attacker discovers that your company uses a specific e-mail template and can also obtain a list of customer e-mail accounts; they have enough information to launch a very convincing phishing campaign. Untargeted attacks The untargeted attack is likely the most common of all attacks, and can be the most disruptive. These generally manifest as viruses and worms. In the earlier days of computing, the most common distribution methods were, surprisingly, media that had been created by software makers for distribution of applications. Someone would modify the image data during the duplication process and ship malware to customers. As the Internet rose in popularity, it introduced new ways for malware to make the rounds. First came e-mail. Next, websites became pick-up locations for all types of malicious software. New technologies that allowed for enhanced interactivity and the embedding of rich media, such as JavaScript and Adobe's (originally Shockwave's) Flash, were also used as vehicles for destructive software. Most of the early malware was simply destructive. It wreaked havoc on data, corrupted systems, and locked users out of their own hardware. Later, they became money-making avenues for the unscrupulous. An example is key loggers, which capture key presses and sometimes mouse movements and clicks in an attempt to compromise logins and other sensitive data, such as credit card numbers. Another much more recent introduction is ransomware, which encrypts or deletes information with a promise to restore the data on payment. Some of the most surreptitious untargeted attacks are relatively low-tech. One such attack is called phishing. This involves using some form of convincing technique, usually through e-mail, to lure users into volunteering sensitive information. An attack vector related to phishing is spam e-mail. Most people just consider spam to be annoying, untargeted e-mail advertisements, but results from an experiment conducted in 2008 by McAfee, Inc., called Spammed Persistently All Month (SPAM), would seem to indicate that most spam also qualifies as a scam in some form or another. Another untargeted attack vector is any connection that a computer system makes into a public network. In the modern era, this is generally through a system's entry point into the Internet. With a limited number of Internet-accessible IP addresses available, attackers can simply scan large ranges of them, seeking systems that respond. Using automated tools, they can attempt to break through any security barriers that are in place. Untargeted attacks pose few risks that are specific to Hyper-V, so this book won't spend a great deal of time on that topic. While no defense can be perfect, they are generally mitigated effectively through standard practices. Targeted attacks The most common attacks are untargeted, but targeted attacks can be the most dangerous. These come in a variety of forms but often use similar techniques to untargeted attacks. One example would be a phishing e-mail that appears to have been sent from your internal IT department, asking you to confirm your user name and password. Another would be a website that looks like an internal corporate site, such as a payroll page, which captures your login information instead of displaying your latest pay stub. Some targeted attacks work against an organization's exposed faces. An immediately recognizable example is online banking. Most banks provide some method for their customers to access their accounts online, and they almost invariably include powerful tools such as money transfer systems. Of course, theft isn't necessarily the goal of a target attack. One well-known activity is the denial-of-service attack, in which an immense number of bogus requests are sent to a target system in a short amount of time, causing its services to be unavailable to legitimate users. The computing device Most of the compromises you are likely to deal with occur at the level of the computing device. Some of the most complex software in use today is the operating system. With thousands of programmers working on millions of lines of code, much of it left over from previous versions and programmers, it's just an unavoidable fact that all major operating systems contain security flaws. With millions of people working to locate these holes, regardless of their intentions, it's equally inevitable that these faults will be discovered and they will be compromised. The advent and rising popularity of smartphones and tablets has increased the number of potential attack sources. As more and more devices become "smart," such as common environmental controls and food storage equipment, they too introduce new entry points from which an entire network can be compromised. The network The true risk of the single compromised device is the network that it's attached to. By breaching the network itself, an attacker potentially gains the ability to eavesdrop on all communications or launch a direct attack against specific computers or groups of systems. Since many organizations consider some areas to be secured since they are behind measures such as firewalls, breaching the protecting devices exposes everything that they are intended to protect. Data-processing points Raw data is rarely useful to end users. There are many systems in place whose jobs are to sort, process, retrieve, and organize information, and they often use well-known techniques to do this. Anything that's well-known is open to assault. Common examples are SQL database servers, e-mail systems, content management applications, and customer relationship management software. When these systems are broken into, the data they work with is ripe for the taking. Data storage A lot of effort is poured into securing end points, processing systems, and networks, but a disturbingly high amount of data storage locations are left relatively unprotected. Many administrators simply believe that all paths to the storage are well protected, so the storage location itself is of little concern. What this often means is that a breach farther up the line results in an easily compromised storage system. For best resistance against attack, care must be taken at all levels. People By and large, the most vulnerable aspect of any computer system is its users. This includes not just the users who don't understand technology, but also the administrators who have grown lax. Passwords are written down; convincing requests for sensitive information are erroneously granted; inappropriate shortcuts are taken. One of the easiest and most common ways in which computers are breached is social engineering. Before undertaking a lot of complicated steps to steal your information, an attacker may try to simply ask you for it. People are trusting by nature, and often naively believe that anyone who asks has a legitimate reason to do so. On the other side, malicious internal staff can be a serious threat. Disgruntled employees, especially those in the IT department, already have access to sensitive areas and information. If they have vengeance in mind, their goal may be disruption and destruction more than theft. A starting point to security Now that you have some idea of what you're up against, you can start thinking of how you want to approach the problems. The easiest thing to do is look over the preceding items and identify what your current configuration is weakest against. You'll also want to identify what your organization considers the most important points and data to protect. Once that's done, it's a good idea to perform some sort of an inventory in an attempt to discover sensitive points that may not have made the list for some reason or another. Sometimes, this can be done simply by asking questions such as "What would the impact be if someone saw that file?". At all times, it's important to remember that there is no way a system can be truly secured without making it completely inaccessible to anyone. If even one person can get into the system, it's also possible for someone else. Computer security is not a one-time event; it is an ongoing process of re-evaluation. It's also important to remember that computers are just machines. No matter how advanced the hardware and software is, the computer does not think. If an instruction makes it all the way to the CPU, it won't stop to ponder if the user or program that submitted it should be allowed to do so. It won't consider the moral implications of carrying out the instruction. It will simply do as it's told. Security is a human endeavor. This book advocates both for taking specific steps to secure specific systems and for a defense in depth approach. The defense in depth style recognizes that not all attacks can be known or planned for in advance, so it attempts to mitigate them by using a layered strategy. If the firewall is penetrated, an internal network access control list may halt a break-in. If that doesn't work, intrusion prevention software may stop the attack. If that also fails, a simple password challenge may keep the intruder out. Hyper-V terminology Before we can properly discuss how to secure Hyper-V, we must reach an agreement on the words that we use. Terminology is a common point of confusion when it comes to Hyper-V and related technologies. This section will provide a definitive explanation for these terms, not only as they are used within this book, but also how they are generally used in official documentation and by experts. Term Definition Hyper-V The lone word Hyper-V represents the type 1 hypervisor technology developed and provided by Microsoft. This term does not refer to any particular product. It appears as an installable feature in Windows Server beginning with Version 2008, and in Professional and Enterprise desktop Windows operating system starting with version 8. Hyper-V Server Hyper-V Server is a standalone product available directly from Microsoft. It is a no-cost distribution of the hypervisor that is packaged in a heavily modified version of Windows Server. Client Hyper-V Client Hyper-V is the name given to Hyper-V as it appears in the desktop editions of Windows. The distinction is necessary as it has requirements and limitations that set it apart from Hyper-V as it exists in the server editions. Host The physical computer system that runs Hyper-V is called the host. Guest The term guest is often used interchangeably with "virtual machine." It is most commonly used to refer to the operating system inside the virtual machine. Management operating system As a type 1 hypervisor, Hyper-V is in direct control of the host's hardware and has no interface of its own. A management operating system is a special virtual machine that can interact with the hypervisor to control it and the hardware. In other hypervisors, this is known as the parent partition. The commonly used term Hyper-V Core and variants have no official meaning. Core is a special mode for Windows Server that does not include a GUI. It is often used to refer to Hyper-V Server, as that product also has no GUI. Crossing Hyper-V Server with the core modifier should be avoided as it leads to confusion. Acquiring Hyper-V This book expects that you have some familiarity with Hyper-V and will therefore not provide an installation walkthrough. The purpose of this section is to provide a basic comparison of the delivery methods for Hyper-V so that you can make an informed decision in light of the security concerns. Hyper-V Server Hyper-V Server is freely available from Microsoft. It is a complete product and installs directly to the host computer. You can download it from the evaluation center on Technet at the following URL: http://www.microsoft.com/en-us/evalcenter/evaluate-hyper-v-server-2012-r2. Despite being listed alongside evaluation software, Hyper-V Server does not expire and does not require any product keys. Before installing, please read the system requirements, which are linked to the download page. The reason why Hyper-V Server is often (erroneously) referred to as core is because it has no graphical interface of any kind. The only control options available on the console are the command-line and PowerShell. This is not the same thing as a Core installation of Windows as most of the Windows roles and features are not available. There are a number of benefits and disadvantages to using Hyper-V in this fashion. The primary benefit in the realm of security is that there are fewer components in the base installation image and there are fewer potential weak points for an attacker to compromise. Windows Server Windows Server is Microsoft's general-purpose server software. Out of the box, it contains a great many server technologies and can fit into just about any conceivable server role. Among those offerings, you'll find Hyper-V. Windows Server comes in two major editions with full Hyper-V support: Standard and Datacenter. The primary difference between these two is the licensing granted to guests that run Windows Server operating systems. Please consult a Microsoft licensing expert for more information. Technologically, the two editions are nearly identical. The lone difference is the presence of Automatic Virtual Machine Activation in the Datacenter edition, which allows it to activate Windows Server guests using its own license. Windows Server can be installed in three separate modes: Core, Minimal Server Interface, and full GUI mode. Each of these modes affects the actions you must take to secure the system. Like Hyper-V Server, each has advantages and disadvantages. Client Hyper-V Client Hyper-V is only available in Professional and higher desktop editions of Windows, but that's not all that makes it distinct from its cousin on the Server platforms. It requires a processor that can perform Second Level Address Translation (SLAT). It also has a smaller feature set. Among the technologies not included are RemoteFX, Hyper-V Replica, and Live Migration. Client Hyper-V is also less inclined to consume all available host memory for the purpose of running guests. While Client Hyper-V is not the focus of this book, many of the same concepts still apply. A very common use for Client Hyper-V is application development. Most software development firms consider their in-development programs to be highly valuable assets, so they should be as protected as any server-based asset. Summary This article introduced you to the "whys" of Hyper-V security and provided a brief introduction to the overall risks that almost all security systems face, and discussed generic responses. It also covered Hyper-V terminology and the available installation modes for the hypervisor. Resources for Article: Further resources on this subject: Your first step towards Hyper-V Replica [Article] Insight into Hyper-V Storage [Article] Disaster Recovery for Hyper-V [Article]
Read more
  • 0
  • 0
  • 4575
article-image-beagle-boards
Packt
23 Dec 2014
10 min read
Save for later

Beagle Boards

Packt
23 Dec 2014
10 min read
In this article by Hunyue Yau, author of Learning BeagleBone, we will provide a background on the entire family of Beagle boards with brief highlights of what is unique about every member, such as things that favor one member over the other. This article will help you identify Beagle boards that might have been mislabeled. The following topics will be covered here: What are Beagle boards How do they relate to other development boards BeagleBoard Classic BeagleBoard-xM BeagleBone White BeagleBone Black (For more resources related to this topic, see here.) The Beagle board family The Beagle boards are a family of low-cost, open development boards that provide everyday students, developers, and other interested people with access to the current mobile processor technology on a path toward developing ideas. Prior to the invention of the Beagle family of boards, the available options with the user were primarily limited to either low-computing power boards, such as the 8-bit microcontroller-based Arduino boards, or dead-end options, such as repurposing existing products. There were even other options such as compromising the physical size or electrical power consumption by utilizing the nonmobile-oriented technology, for example, embedding a small laptop or desktop into a project. The Beagle boards attempt to address these points and more. The Beagle board family provides you with access to the technologies that were originally developed for mobile devices, such as phones and tablets, and use them to develop projects and for educational purposes. By leveraging the same technology for education, students can be less reliant on obsolete technologies. All this access comes affordably. Prior to the Beagle boards being available, development boards of this class easily exceeded thousands of dollars. In contrast, the initial Beagle board offering was priced at a mere 150 dollars! The Beagle boards The Beagle family of boards began in late 2008 with the original Beagle board. The original board has quite a few characteristics similar to all members of the Beagle board family. All the current boards are based on an ARM core and can be powered by a single 5V source or by varying degrees from a USB port. All boards have a USB port for expansion and provide direct access to the processor I/O for advance interfacing and expansion. Examples of the processor I/O available for expansion include Serial Peripheral Interface (SPI), I2C, pulse width modulation (PWM), and general-purpose input/output (GPIO). The USB expansion path was introduced at an early stage providing a cheap path to add features by leveraging the existing desktop and laptop accessories. All the boards are designed keeping the beginner board in mind and, as such, are impossible to brick on software basis. To brick a board is a common slang term that refers to damaging a board beyond recovery, thus, turning the board from an embedded development system to something as useful for embedded development as a brick. This doesn't mean that they cannot be damaged electrically or physically. For those who are interested, the design and manufacturing material is also available for all the boards. The bill of material is designed to be available via the distribution so that the boards themselves can be customized and manufactured even in small quantities. This allows projects to be manufactured if desired. Do not power up the board on any conductive surfaces or near conductive materials, such as metal tools or exposed wires. The board is fully exposed and doing so can subject your board to electrical damage. The only exception is a proper ESD mat design for use with electrons. The proper ESD mats are designed to be only conductive enough to discharge static electricity without damaging the circuits. The following sections highlight the specifications for each member presented in the order they were introduced. They are based on the latest revision of the board. As these boards leverage mobile technology, the availability changes and the designs are partly revised to accommodate the available parts. The design information for older versions is available at http://www.beagleboard.org/. BeagleBoard Classic The initial member of the Beagle board family is the BeagleBoard Classic (BBC), which features the following specs: OMAP3530 clocked up to 720 MHz, featuring an ARM Cortex-A8 core along with integrated 3D and video decoding accelerators 256 MB of LPDDR (low-power DDR) memory with 512 MB of integrated (NAND) flash memory on board; older revisions had less memory USB OTG (switchable between a USB device and a USB host) along with a pure USB high-speed host only port A low-level debug port accessible using a common desktop DB-9 adapter Analog audio in and out DVI-D video output to connect to a desktop monitor or a digital TV A full-size SD card interface A 28-pin general expansion header along with two 20-pin headers for video expansion 1.8V I/O Only a nominal 5V is available on the expansion connector. Expansion boards should have their own regulator. At the original release of the BBC in 2008, OMAP3530 was comparable to the processors of mobile phones of that time. The BBC is the only member to feature a full-size SD card interface. You can see the BeagleBoard Classic in the following image: BeagleBoard-xM As an upgrade to the BBC, the BeagleBoard-xM (BBX) was introduced later. It features the following specs: DM3730 clocked up to 1 GHz, featuring an ARM Cortex-A8 core along with integrated 3D and video decoding accelerators compared to 720 MHz of the BBC. 512 MB of LPDDR but no onboard flash memory compared to 256 MB of LPDDR with up to 512 MB of onboard flash memory. USB OTG (switchable between a USB device and a USB host) along with an onboard hub to provide four USB host ports and an onboard USB connected the Ethernet interface. The hub and Ethernet connect to the same port as the only high-speed port of the BBC. The hub allows low-speed devices to work with the BBX. A low-level debug port accessible with a standard DB-9 serial cable. An adapter is no longer needed. Analog audio in and out. This is the same analog audio in and out as that of the BBC. DVI-D video output to connect to a desktop monitor or a digital TV. This is the same DVI-D video output as used in the BBC. A microSD interface. It replaces the full-size SD interface on the BBC. The difference is mainly the physical size. A 28-pin expansion interface and two 20-pin video expansion interfaces along with an additional camera interface board. The 28-pin and two 20-pin interfaces are physically and electrically compatible with the BBC. 1.8V I/O. Only a nominal 5V is available on the expansion connector. Expansion boards should have their own regulator. The BBX has a faster processor and added capabilities when compared to the BBC. The camera interface is a unique feature for the BBX and provides a direct interface for raw camera sensors. The 28-pin interface, along with the two 20-pin video interfaces, is electrically and mechanically compatible with the BBC. Mechanical mounting holes were purposely made backward compatible. Beginning with the BBX, boards were shipped with a microSD card containing the Angström Linux distribution. The latest version of the kernel and bootloader are shared between the BBX and BBC. The software can detect and utilize features available on each board as the DM3730 and the OMAP3530 processors are internally very similar. You can see the BeagleBoard-xM in the following image: BeagleBone To simplify things and to bring in a low-entry cost, the BeagleBone subfamily of boards was introduced. While many concepts in this article can be shared with the entire Beagle family, this article will focus on this subfamily. All current members of BeagleBone can be purchased for less than 100 dollars. BeagleBone White The initial member of this subfamily is the BeagleBone White (BBW). This new form factor has a footprint to allow the board itself to be stored inside an Altoids tin. The Altoids tin is conductive and can electrically damage the board if an operational BeagleBone without additional protection is placed inside it. The BBW features the following specs: AM3358 clocked at up to 720 MHz, featuring an ARM Cortex-A8 core along with a 3D accelerator, an ARM Cortex-M3 for power management, and a unique feature—the Programmable Real-time Unit Subsystem (PRUSS) 256 MB of DDR2 memory Two USB ports, namely, a dedicated USB host and dedicated USB device An onboard JTAG debugger An onboard USB interface to access the low-level serial interfaces 10/100 MB Ethernet interfaces Two 46-pin expansion interfaces with up to eight channels of analog input 10-pin power expansion interface A microSD interface 3.3V digital I/O 1.8V analog I/O As with the BBX, the BBW ships with the Angström Linux distribution. You can see the BeagleBone White in the following image: BeagleBone Black Intended as a lower cost version of the BeagleBone, the BeagleBone Black (BBB) features the following specs: AM3358 clocked at up to 1 GHz, featuring an ARM Cortex-A8 core along with a 3D accelerator, an ARM Cortex-M3 for power management, and a unique feature: the PRUSS. This is an improved revision of the same processor in BBW. 512 MB of DDR3 memory compared to 256 MB of DDR2 memory on the BBW. 4 GB of onboard flash embedded MMC (eMMC) memory for the latest version compared to a complete lack of onboard flash memory on the BBW. Two USB ports, namely, a dedicated USB host and dedicated USB device. A low-level serial interface is available as a dedicated 6-pin header. 10/100 MB Ethernet interfaces. Two 46-pin expansion interfaces with up to eight channels of analog input. A microSD interface. A micro HDMI interface to connect to a digital monitor or a digital TV. A digital audio is available on the same interface. This is new to the BBB. 3.3V digital I/O. 1.8V analog I/O. The overall mechanical form factor of the BBB is the same as that of the BBW. However, due to the added features, there are some slight electrical changes in the expansion interface. The power expansion header was removed to make room for added features. Unlike other boards, the BBB is shipped with a Linux distribution on the internal flash memory. Early revisions shipped with Angström Linux and later revisions shipped with Debian Linux as an attempt to simplify things for new users. Unlike the BBW, the BBB does not provide an onboard JTAG debugger or an onboard USB to serial converter. Both these features were provided by a single chip on the BBW and were removed from the BBB for cost reasons. JTAG debugging is possible on the BBB by soldering a connector to the back of the BBB and using an external debugger. A serial port access on the BBB is provided by a serial header. This article will focus solely on the BeagleBone subfamily (BBW and BBB). The difference between them will be noted where applicable. It should be noted that for more advanced projects, the BBC/BBX should be considered as they offer additional unique features that are not available in the BBW/BBB. Most concepts learned on the BBB/BBW boards are entirely applicable to the BBC/BBX boards. You can see the BeagleBone Black in the following image: Summary In this article, we looked at the Beagle board offerings and a few unique features of each offering. Then, we went through the process of setting up a BeagleBone board and understood the basics to access it from a laptop/desktop. Resources for Article: Further resources on this subject: Protecting GPG Keys in BeagleBone [article] Making the Unit Very Mobile - Controlling Legged Movement [article] Pulse width modulator [article]
Read more
  • 0
  • 0
  • 9003

article-image-learning-qgis-python-api
Packt
23 Dec 2014
44 min read
Save for later

Learning the QGIS Python API

Packt
23 Dec 2014
44 min read
In this article, we will take a closer look at the Python libraries available for the QGIS Python developer, and also look at the various ways in which we can use  these libraries to perform useful tasks within QGIS. In particular, you will learn: How the QGIS Python libraries are based on the underlying C++ APIs How to use the C++ API documentation as a reference to work with the Python APIs How the PyQGIS libraries are organized The most important concepts and classes within the PyQGIS libraries and how to use them Some practical examples of performing useful tasks using PyQGIS About the QGIS Python APIs The QGIS system itself is written in C++ and has its own set of APIs that are also written in C++. The Python APIs are implemented as wrappers around these C++ APIs. For example, there is a Python class named QgisInterface that acts as a wrapper around a C++ class of the same name. All the methods, class variables, and the like, which are implemented by the C++ version of QgisInterface are made available through the Python wrapper. What this means is that when you access the Python QGIS APIs, you aren't accessing the API directly. Instead, the wrapper connects your code to the underlying C++ objects and methods, as follows :  Fortunately, in most cases, the QGIS Python wrappers simply hide away the complexity of the underlying C++ code, so the PyQGIS libraries work as you would expect them to. There are some gotchas, however, and we will cover these as they come up. Deciphering the C++ documentation As QGIS is implemented in C++, the documentation for the QGIS APIs is all based on C++. This can make it difficult for Python developers to understand and work with the QGIS APIs. For example, the API documentation for the QgsInterface.zoomToActiveLayer() method:  If you're not familiar with C++, this can be quite confusing. Fortunately, as a Python programmer, you can skip over much of this complexity because it doesn't apply to you. In particular: The virtual keyword is an implementation detail you don't need to worry about The word void indicates that the method doesn't return a value The double colons in QgisInterface::zoomToActiveLayer are simply a C++ convention for separating the class name from the method name Just like in Python, the parentheses show that the method doesn't take any parameters. So if you have an instance of QgisInterface (for example, as the standard iface variable available in the Python Console), you can call this method simply by typing the following: iface.zoomToActiveLayer() Now, let's take a look at a slightly more complex example: the C++ documentation for the QgisInterface.addVectorLayer() method looks like the following:  Notice how the virtual keyword is followed by QgsVectorLayer* instead of void. This is the return value for this method; it returns a QgsVector object. Technically speaking, * means that the method returns a pointer to an object of type QgsVectorLayer. Fortunately, Python wrappers automatically handle pointers, so you don't need to worry about this.  Notice the brief description at the bottom of the documentation for this method; while many of the C++ methods have very little, if any, additional information, other methods have quite extensive descriptions. Obviously, you should read these descriptions carefully as they tell you more about what the method does. Even without any description, the C++ documentation is still useful as it tells you what the method is called, what parameters it accepts, and what type of data is being returned. In the preceding method, you can see that there are three parameters listed in between the parentheses. As C++ is a strongly typed language, you have to define the type of each parameter when you define a function. This is helpful for Python programmers as it tells you what type of value to supply. Apart from QGIS objects, you might also encounter the following data types in the C++ documentation: Data type Description int A standard Python integer value long A standard Python long integer value float A standard Python floating point (real) number bool A Boolean value (true or false) QString A string value. Note that the QGIS Python wrappers automatically convert Python strings to C++ strings, so you don't need to deal with QString objects directly QList This object is used to encapsulate a list of other objects. For example, QList<QString*> represents a list of strings Just as in Python, a method can take default values for each parameter. For example, the QgisInterface.newProject() method looks like the following:  In this case, the thePromptToSaveFlag parameter has a default value, and this default value will be used if no value is supplied. In Python, classes are initialized using the __init__ method. In C++, this is called a constructor. For example, the constructor for the QgsLabel class looks like the following:  Just as in Python, C++ classes inherit the methods defined in their superclass. Fortunately, QGIS doesn't have an extensive class hierarchy, so most of the classes don't have a superclass. However, don't forget to check for a superclass if you can't find the method you're looking for in the documentation for the class itself. Finally, be aware that C++ supports the concept of method overloading. A single method can be defined more than once, where each version accepts a different set of parameters. For example, take a look at the constructor for the QgsRectangle class—you will see that there are four different versions of this method. The first version accepts the four coordinates as floating point numbers:  The second version constructs a rectangle using two QgsPoint objects:  The third version copies the coordinates from QRectF (which is a Qt data type) into a QgsRectangle object:  The final version copies the coordinates from another QgsRectangle object:  The C++ compiler chooses the correct method to use based on the parameters that have been supplied. Python has no concept of method overloading; just choose the version of the method that accepts the parameters you want to supply, and the QGIS Python wrappers will automatically choose the correct method for you. If you keep these guidelines in mind, deciphering the C++ documentation for QGIS isn't all that hard. It just looks more complicated than it really is, thanks to all the complexity specific to C++. However, it won't take long for your brain to start filtering out C++ and use the QGIS reference documentation almost as easily as if it was written for Python rather than C++. Organization of the QGIS Python libraries Now that we can understand the C++-oriented documentation, let's see how the PyQGIS libraries are structured. All of the PyQGIS libraries are organized under a package named qgis. You wouldn't normally import qgis directly, however, as all the interesting libraries are subpackages within this main package; here are the five packages that make up the PyQGIS library: qgis.core This provides access to the core GIS functionality used throughout QGIS. qgis.gui This defines a range of GUI widgets that you can include in your own programs. qgis.analysis This provides spatial analysis tools to analyze vector and raster format data. qgis.networkanalysis This provides tools to build and analyze topologies. qgis.utils This implements miscellaneous functions that allow you to work with the QGIS application using Python.  The first two packages (qgis.core and qgis.gui) implement the most important parts of the PyQGIS library, and it's worth spending some time to become more familiar with the concepts and classes they define. Now let's take a closer look at these two packages. The qgis.core package The qgis.core package defines fundamental classes used throughout the QGIS system. A large part of this package is dedicated to working with vector and raster format geospatial data, and displaying these types of data within a map. Let's take a look at how this is done. Maps and map layers A map consists of multiple layers drawn one on top of the other:  There are three types of map layers supported by QGIS: Vector layer: This layer draws geospatial features such as points, lines, and polygons Raster layer: This layer draws raster (bitmapped) data onto a map Plugin layer: This layer allows a plugin to draw directly onto a map Each of these types of map layers has a corresponding class within the qgis.core library. For example, a vector map layer will be represented by an object of type qgis.core.QgsVectorLayer. We will take a closer look at vector and raster map layers shortly. Before we do this, though, we need to learn how geospatial data (both vector and raster data) is positioned on a map. Coordinate reference systems Since the Earth is a three-dimensional object, maps will only represent the Earth's surface as a two-dimensional plane, so there has to be a way of translating points on the Earth's surface into (x,y) coordinates within a map. This is done using a Coordinate Reference System (CRS):  Globe image courtesy Wikimedia (http://commons.wikimedia.org/wiki/File:Rotating_globe.gif) A CRS has two parts: an ellipsoid, which is a mathematical model of the Earth's surface, and a projection, which is a formula that converts points on the surface of the spheroid into (x,y) coordinates on a map. Generally you won't need to worry about all these details. You can simply select the appropriate CRS that matches the CRS of the data you are using. However, as there are many different coordinate reference systems that have been devised over the years, it is vital that you use the correct CRS when plotting your geospatial data. If you don't do this, your features will be displayed in the wrong place or have the wrong shape. The majority of geospatial data available today uses the EPSG 4326 coordinate reference system (sometimes also referred to as WGS84). This CRS defines coordinates as latitude and longitude values. This is the default CRS used for new data imported into QGIS. However, if your data uses a different coordinate reference system, you will need to create and use a different CRS for your map layer. The qgis.core.QgsCoordinateReferenceSystem class represents a CRS. Once you create your coordinate reference system, you can tell your map layer to use that CRS when accessing the underlying data. For example: crs = QgsCoordinateReferenceSystem(4326,           QgsCoordinateReferenceSystem.EpsgCrsId))layer.setCrs(crs) Note that different map layers can use different coordinate reference systems. Each layer will use its CRS when drawing the contents of the layer onto the map. Vector layers A vector layer draws geospatial data onto a map in the form of points, lines, polygons, and so on. Vector-format geospatial data is typically loaded from a vector data source such as a shapefile or database. Other vector data sources can hold vector data in memory, or load data from a web service across the internet. A vector-format data source has a number of features, where each feature represents a single record within the data source. The qgis.core.QgsFeature class represents a feature within a data source. Each feature has the following principles: ID: This is the feature's unique identifier within the data source. Geometry: This is the underlying point, line, polygon, and so on, which represents the feature on the map. For example, a data source representing cities would have one feature for each city, and the geometry would typically be either a point that represents the center of the city, or a polygon (or a multipolygon) that represents the city's outline. Attributes: These are key/value pairs that provide additional information about the feature. For example, a city data source might have attributes such as total_area, population, elevation, and so on. Attribute values can be strings, integers, or floating point numbers. In QGIS, a data provider allows the vector layer to access the features within the data source. The data provider, an instance of qgis.core.QgsVectorDataProvider, includes: A geometry type that is stored in the data source A list of fields that provide information about the attributes stored for each feature The ability to search through the features within the data source, using the getFeatures() method and the QgsFeatureRequest class You can access the various vector (and also raster) data providers by using the qgis.core.QgsProviderRegistry class. The vector layer itself is represented by a qgis.core.QgsVectorLayer object. Each vector layer includes: Data provider: This is the connection to the underlying file or database that holds the geospatial information to be displayed Coordinate reference system: This indicates which CRS the geospatial data uses Renderer: This chooses how the features are to be displayed Let's take a closer look at the concept of a renderer and how features are displayed within a vector map layer. Displaying vector data The features within a vector map layer are displayed using a combination of renderer and symbol objects. The renderer chooses the symbol to use for a given feature, and the symbol that does the actual drawing. There are three basic types of symbols defined by QGIS: Marker symbol: This displays a point as a filled circle Line symbol: This draws a line using a given line width and color Fill symbol: This draws the interior of a polygon with a given color These three types of symbols are implemented as subclasses of the qgis.core.QgsSymbolV2 class: qgis.core.QgsMarkerSymbolV2 qgis.core.QgsLineSymbolV2 qgis.core.QgsFillSymbolV2 Internally, symbols are rather complex, as "symbol layers" allow multiple elements to be drawn on top of each other. In most cases, however, you can make use of the "simple" version of the symbol. This makes it easier to create a new symbol without having to deal with the internal complexity of symbol layers. For example: symbol = QgsMarkerSymbolV2.createSimple({'width' : 1.0, 'color' : "255,0,0"}) While symbols draw the features onto the map, a renderer is used to choose which symbol to use to draw a particular feature. In the simplest case, the same symbol is used for every feature within a layer. This is called a single symbol renderer, and is represented by the qgis.core.QgsSingleSymbolRenderV2 class. Other possibilities include: Categorized symbol renderer (qgis.core.QgsCategorizedSymbolRendererV2): This renderer chooses a symbol based on the value of an attribute. The categorized symbol renderer has a mapping from attribute values to symbols. Graduated symbol renderer (qgis.core.QgsGraduatedSymbolRendererV2): This type of renderer has a range of attribute, values, and maps each range to an appropriate symbol. Using a single symbol renderer is very straightforward: symbol = ...renderer = QgsSingleSymbolRendererV2(symbol)layer.setRendererV2(renderer) To use a categorized symbol renderer, you first define a list of qgis.core. QgsRendererCategoryV2 objects, and then use that to create the renderer. For example: symbol_male = ...symbol_female = ... categories = []categories.append(QgsRendererCategoryV2("M", symbol_male, "Male"))categories.append(QgsRendererCategoryV2("F", symbol_female, "Female"))renderer = QgsCategorizedSymbolRendererV2("", categories)renderer.setClassAttribute("GENDER")layer.setRendererV2(renderer) Notice that the QgsRendererCategoryV2 constructor takes three parameters: the desired value, the symbol used, and a label used to describe that category. Finally, to use a graduated symbol renderer, you define a list of qgis.core.QgsRendererRangeV2 objects and then use that to create your renderer. For example: symbol1 = ...symbol2 = ... ranges = []ranges.append(QgsRendererRangeV2(0, 10, symbol1, "Range 1"))ranges.append(QgsRendererRange(11, 20, symbol2, "Range 2")) renderer = QgsGraduatedSymbolRendererV2("", ranges)renderer.setClassAttribute("FIELD")layer.setRendererV2(renderer) Accessing vector data In addition to displaying the contents of a vector layer within a map, you can use Python to directly access the underlying data. This can be done using the data provider's getFeatures() method. For example, to iterate over all the features within the layer, you can do the following: provider = layer.dataProvider()for feature in provider.getFeatures(QgsFeatureRequest()):  ... If you want to search for features based on some criteria, you can use the QgsFeatureRequest object's setFilterExpression() method, as follows: provider = layer.dataProvider()request = QgsFeatureRequest()request.setFilterExpression('"GENDER" = "M"')for feature in provider.getFeatures(QgsFeatureRequest()):  ... Once you have the features, it's easy to get access to the feature's geometry, ID, and attributes. For example: geometry = feature.geometry()  id = feature.id()  name = feature.attribute("NAME") The object returned by the feature.geometry() call, which will be an instance of qgis.core.QgsGeometry, represents the feature's geometry. This object has a number of methods you can use to extract the underlying data and perform various geospatial calculations. Spatial indexes In the previous section, we searched for features based on their attribute values. There are times, though, when you might want to find features based on their position in space. For example, you might want to find all features that lie within a certain distance of a given point. To do this, you can use a spatial index, which indexes features according to their location and extent. Spatial indexes are represented in QGIS by the QgsSpatialIndex class. For performance reasons, a spatial index is not created automatically for each vector layer. However, it's easy to create one when you need it: provider = layer.dataProvider()index = QgsSpatialIndex()for feature in provider.getFeatures(QgsFeatureRequest()):  index.insertFeature(feature) Don't forget that you can use the QgsFeatureRequest.setFilterExpression() method to limit the set of features that get added to the index. Once you have the spatial index, you can use it to perform queries based on the position of the features. In particular: You can find one or more features that are closest to a given point using the nearestNeighbor() method. For example: features = index.nearestNeighbor(QgsPoint(long, lat), 5) Note that this method takes two parameters: the desired point as a QgsPoint object and the number of features to return. You can find all features that intersect with a given rectangular area by using the intersects() method, as follows: features = index.intersects(QgsRectangle(left, bottom, right, top)) Raster layers Raster-format geospatial data is essentially a bitmapped image, where each pixel or "cell" in the image corresponds to a particular part of the Earth's surface. Raster data is often organized into bands, where each band represents a different piece of information. A common use for bands is to store the red, green, and blue component of the pixel's color in a separate band. Bands might also represent other types of information, such as moisture level, elevation, or soil type. There are many ways in which raster information can be displayed. For example: If the raster data only has one band, the pixel value can be used as an index into a palette. The palette maps each pixel value maps to a particular color. If the raster data has only one band but no palette is provided. The pixel values can be used directly as a grayscale value; that is, larger numbers are lighter and smaller numbers are darker. Alternatively, the pixel values can be passed through a pseudocolor algorithm to calculate the color to be displayed. If the raster data has multiple bands, then typically, the bands would be combined to generate the desired color. For example, one band might represent the red component of the color, another band might represent the green component, and yet another band might represent the blue component. Alternatively, a multiband raster data source might be drawn using a palette, or as a grayscale or a pseudocolor image, by selecting a particular band to use for the color calculation. Let's take a closer look at how raster data can be drawn onto the map. Displaying raster data The drawing style associated with the raster band controls how the raster data will be displayed. The following drawing styles are currently supported: Drawing style Description PalettedColor For a single band raster data source, a palette maps each raster value to a color. SingleBandGray For a single band raster data source, the raster value is used directly as a grayscale value. SingleBandPseudoColor For a single band raster data source, the raster value is used to calculate a pseudocolor. PalettedSingleBandGray For a single band raster data source that has a palette, this drawing style tells QGIS to ignore the palette and use the raster value directly as a grayscale value. PalettedSingleBandPseudoColor For a single band raster data source that has a palette, this drawing style tells QGIS to ignore the palette and use the raster value to calculate a pseudocolor. MultiBandColor For multiband raster data sources, use a separate band for each of the red, green, and blue color components. For this drawing style, the setRedBand(), setGreenBand(), and setBlueBand() methods can be used to choose which band to use for each color component. MultiBandSingleBandGray For multiband raster data sources, choose a single band to use as the grayscale color value. For this drawing style, use the setGrayBand() method to specify the band to use. MultiBandSingleBandPseudoColor For multiband raster data sources, choose a single band to use to calculate a pseudocolor. For this drawing style, use the setGrayBand() method to specify the band to use.  To set the drawing style, use the layer.setDrawingStyle() method, passing in a string that contains the name of the desired drawing style. You will also need to call the various setXXXBand() methods, as described in the preceding table, to tell the raster layer which bands contain the value(s) to use to draw each pixel. Note that QGIS doesn't automatically update the map when you call the preceding functions to change the way the raster data is displayed. To have your changes displayed right away, you'll need to do the following: Turn off raster image caching. This can be done by calling layer.setImageCache(None). Tell the raster layer to redraw itself, by calling layer.triggerRepaint(). Accessing raster data As with vector-format data, you can access the underlying raster data via the data provider's identify() method. The easiest way to do this is to pass in a single coordinate and retrieve the value or values of that coordinate. For example: provider = layer.dataProvider()values = provider.identify(QgsPoint(x, y),              QgsRaster.IdentifyFormatValue)if values.isValid():  for band,value in values.results().items():    ... As you can see, you need to check whether the given coordinate exists within the raster data (using the isValid() call). The values.results() method returns a dictionary that maps band numbers to values. Using this technique, you can extract all the underlying data associated with a given coordinate within the raster layer. You can also use the provider.block() method to retrieve the band data for a large number of coordinates all at once. We will look at how to do this later in this article. Other useful qgis.core classes Apart from all the classes and functionality involved in working with data sources and map layers, the qgis.core library also defines a number of other classes that you might find useful: Class Description QgsProject This represents the current QGIS project. Note that this is a singleton object, as only one project can be open at a time. The QgsProject class is responsible for loading and storing properties, which can be useful for plugins. QGis This class defines various constants, data types, and functions used throughout the QGIS system. QgsPoint This is a generic class that stores the coordinates for a point within a two-dimensional plane. QgsRectangle This is a generic class that stores the coordinates for a rectangular area within a two-dimensional plane. QgsRasterInterface This is the base class to use for processing raster data. This can be used, to reproject a set of raster data into a new coordinate system, to apply filters to change the brightness or color of your raster data, to resample the raster data, and to generate new raster data by rendering the existing data in various ways. QgsDistanceArea This class can be used to calculate distances and areas for a given geometry, automatically converting from the source coordinate reference system into meters. QgsMapLayerRegistry This class provides access to all the registered map layers in the current project. QgsMessageLog This class provides general logging features within a QGIS program. This lets you send debugging messages, warnings, and errors to the QGIS "Log Messages" panel.  The qgis.gui package The qgis.gui package defines a number of user-interface widgets that you can include in your programs. Let's start by looking at the most important qgis.gui classes, and follow this up with a brief look at some of the other classes that you might find useful. The QgisInterface class QgisInterface represents the QGIS system's user interface. It allows programmatic access to the map canvas, the menu bar, and other parts of the QGIS application. When running Python code within a script or a plugin, or directly from the QGIS Python console, a reference to QgisInterface is typically available through the iface global variable. The QgisInterface object is only available when running the QGIS application itself. If you are running an external application and import the PyQGIS library into your application, QgisInterface won't be available. Some of the more important things you can do with the QgisInterface object are: Get a reference to the list of layers within the current QGIS project via the legendInterface() method. Get a reference to the map canvas displayed within the main application window, using the mapCanvas() method. Retrieve the currently active layer within the project, using the activeLayer() method, and set the currently active layer by using the setActiveLayer() method. Get a reference to the application's main window by calling the mainWindow() method. This can be useful if you want to create additional Qt windows or dialogs that use the main window as their parent. Get a reference to the QGIS system's message bar by calling the messageBar() method. This allows you to display messages to the user directly within the QGIS main window. The QgsMapCanvas class The map canvas is responsible for drawing the various map layers into a window. The QgsMapCanvas class represents a map canvas. This class includes: A list of the currently shown map layers. This can be accessed using the layers() method. Note that there is a subtle difference between the list of map layers available within the map canvas and the list of map layers included in the QgisInterface.legendInterface() method. The map canvas's list of layers only includes the list of layers currently visible, while QgisInterface.legendInterface() returns all the map layers, including those that are currently hidden.  The map units used by this map (meters, feet, degrees, and so on). The map's units can be retrieved by calling the mapUnits() method. An extent,which is the area of the map that is currently shown within the canvas. The map's extent will change as the user zooms in and out, and pans across the map. The current map extent can be obtained by calling the extent() method. A current map tool that controls the user's interaction with the contents of the map canvas. The current map tool can be set using the setMapTool() method, and you can retrieve the current map tool (if any) by calling the mapTool() method. A background color used to draw the background behind all the map layers. You can change the map's background color by calling the canvasColor() method. A coordinate transform that converts from map coordinates (that is, coordinates in the data source's coordinate reference system) to pixels within the window. You can retrieve the current coordinate transform by calling the getCoordinateTransform() method. The QgsMapCanvasItem class A map canvas item is an item drawn on top of the map canvas. The map canvas item will appear in front of the map layers. While you can create your own subclass of QgsMapCanvasItem if you want to draw custom items on top of the map canvas, it would be more useful for you to make use of an existing subclass that will do the work for you. There are currently three subclasses of QgsMapCanvasItem that you might find useful: QgsVertexMarker: This draws an icon (an "X", a "+", or a small box) centered around a given point on the map. QgsRubberBand: This draws an arbitrary polygon or polyline onto the map. It is intended to provide visual feedback as the user draws a polygon onto the map. QgsAnnotationItem: This is used to display additional information about a feature, in the form of a balloon that is connected to the feature. The QgsAnnotationItem class has various subclasses that allow you to customize the way the information is displayed. The QgsMapTool class A map tool allows the user to interact with and manipulate the map canvas, capturing mouse events and responding appropriately. A number of QgsMapTool subclasses provide standard map interaction behavior such as clicking to zoom in, dragging to pan the map, and clicking on a feature to identify it. You can also create your own custom map tools by subclassing QgsMapTool and implementing the various methods that respond to user-interface events such as pressing down the mouse button, dragging the canvas, and so on. Once you have created a map tool, you can allow the user to activate it by associating the map tool with a toolbar button. Alternatively, you can activate it from within your Python code by calling the mapCanvas.setMapTool(...) method. We will look at the process of creating your own custom map tool in the section Using the PyQGIS library in the following table: Other useful qgis.gui classes While the qgis.gui package defines a large number of classes, the ones you are most likely to find useful are given in the following table: Class Description QgsLegendInterface This provides access to the map legend, that is, the list of map layers within the current project. Note that map layers can be grouped, hidden, and shown within the map legend. QgsMapTip This displays a tip on a map canvas when the user holds the mouse over a feature. The map tip will show the display field for the feature; you can set this by calling layer.setDisplayField("FIELD"). QgsColorDialog This is a dialog box that allows the user to select a color. QgsDialog This is a generic dialog with a vertical box layout and a button box, making it easy to add content and standard buttons to your dialog. QgsMessageBar This is a user interface widget for displaying non-blocking messages to the user. We looked at the message bar class in the previous article. QgsMessageViewer This is a generic class that displays long messages to the user within a modal dialog. QgsBlendModeComboBox QgsBrushStyleComboBox QgsColorRampComboBox QgsPenCapStyleComboBox QgsPenJoinStyleComboBox QgsScaleComboBox These QComboBox user-interface widgets allow you to prompt the user for various drawing options. With the exception of the QgsScaleComboBox, which lets the user choose a map scale, all the other QComboBox subclasses let the user choose various Qt drawing options.  Using the PyQGIS library In the previous section, we looked at a number of classes provided by the PyQGIS library. Let's make use of these classes to perform some real-world geospatial development tasks. Analyzing raster data We're going to start by writing a program to load in some raster-format data and analyze its contents. To make this more interesting, we'll use a Digital Elevation Model (DEM) file, which is a raster format data file that contains elevation data. The Global Land One-Kilometer Base Elevation Project (GLOBE) provides free DEM data for the world, where each pixel represents one square kilometer of the Earth's surface. GLOBE data can be downloaded from http://www.ngdc.noaa.gov/mgg/topo/gltiles.html. Download the E tile, which includes the western half of the USA. The resulting file, which is named e10g, contains the height information you need. You'll also need to download the e10g.hdr header file so that QGIS can read the file—you can download this from http://www.ngdc.noaa.gov/mgg/topo/elev/esri/hdr. Once you've downloaded these two files, put them together into a convenient directory. You can now load the DEM data into QGIS using the following code: registry = QgsProviderRegistry.instance()provider = registry.provider("gdal", "/path/to/e10g") Unfortunately, there is a slight complexity here. Since QGIS doesn't know which coordinate reference system is used for the data, it displays a dialog box that asks you to choose the CRS. Since the GLOBE DEM data is in the WGS84 CRS, which QGIS uses by default, this dialog box is redundant. To disable it, you need to add the following to the top of your program: from PyQt4.QtCore import QSettingsQSettings().setValue("/Projections/defaultBehaviour", "useGlobal") Now that we've loaded our raster DEM data into QGIS, we can analyze. There are lots of things we can do with DEM data, so let's calculate how often each unique elevation value occurs within the data. Notice that we're loading the DEM data directly using QgsRasterDataProvider. We don't want to display this information on a map, so we don't want (or need) to load it into QgsRasterLayer.  Since the DEM data is in a raster format, you need to iterate over the individual pixels or cells to get each height value. The provider.xSize() and provider.ySize() methods tell us how many cells are in the DEM, while the provider.extent() method gives us the area of the Earth's surface covered by the DEM. Using this information, we can extract the individual elevation values from the contents of the DEM in the following way: raster_extent = provider.extent()raster_width = provider.xSize()raster_height = provider.ySize()block = provider.block(1, raster_extent, raster_width, raster_height) The returned block variable is an object of type QgsRasterBlock, which is essentially a two-dimensional array of values. Let's iterate over the raster and extract the individual elevation values: for x in range(raster_width):  for y in range(raster_height):    elevation = block.value(x, y)    .... Now that we've loaded the individual elevation values, it's easy to build a histogram out of those values. Here is the entire program to load the DEM data into memory, and calculate, and display the histogram: from PyQt4.QtCore import QSettingsQSettings().setValue("/Projections/defaultBehaviour", "useGlobal")registry = QgsProviderRegistry.instance()provider = registry.provider("gdal", "/path/to/e10g") raster_extent = provider.extent()raster_width = provider.xSize()raster_height = provider.ySize()no_data_value = provider.srcNoDataValue(1) histogram = {} # Maps elevation to number of occurrences. block = provider.block(1, raster_extent, raster_width,            raster_height)if block.isValid():  for x in range(raster_width):    for y in range(raster_height):      elevation = block.value(x, y)      if elevation != no_data_value:        try:          histogram[elevation] += 1        except KeyError:          histogram[elevation] = 1 for height in sorted(histogram.keys()):  print height, histogram[height] Note that we've added a no data value check to the code. Raster data often includes pixels that have no value associated with them. In the case of a DEM, elevation data is only provided for areas of land; pixels over the sea have no elevation, and we have to exclude them, or our histogram will be inaccurate. Manipulating vector data and saving it to a shapefile Let's create a program that takes two vector data sources, subtracts one set of vectors from the other, and saves the resulting geometries into a new shapefile. Along the way, we'll learn a few important things about the PyQGIS library. We'll be making use of the QgsGeometry.difference() function. This function performs a geometrical subtraction of one geometry from another, similar to this:  Let's start by asking the user to select the first shapefile and open up a vector data provider for that file: filename_1 = QFileDialog.getOpenFileName(iface.mainWindow(),                     "First Shapefile",                     "~", "*.shp")if not filename_1:  return registry = QgsProviderRegistry.instance()provider_1 = registry.provider("ogr", filename_1) We can then read the geometries from that file into memory: geometries_1 = []for feature in provider_1.getFeatures(QgsFeatureRequest()):  geometries_1.append(QgsGeometry(feature.geometry())) This last line of code does something very important that may not be obvious at first. Notice that we use the following: QgsGeometry(feature.geometry()) We use the preceding line instead of the following: feature.geometry() This creates a new instance of the QgsGeometry object, copying the geometry into a new object, rather than just adding the existing geometry object to the list. We have to do this because of a limitation of the way the QGIS Python wrappers work: the feature.geometry() method returns a reference to the geometry, but the C++ code doesn't know that you are storing this reference away in your Python code. So, when the feature is no longer needed, the memory used by the feature's geometry is also released. If you then try to access that geometry later on, the entire QGIS system will crash. To get around this, we make a copy of the geometry so that we can refer to it even after the feature's memory has been released. Now that we've loaded our first set of geometries into memory, let's do the same for the second shapefile: filename_2 = QFileDialog.getOpenFileName(iface.mainWindow(),                     "Second Shapefile",                     "~", "*.shp")if not filename_2:  return provider_2 = registry.provider("ogr", filename_2) geometries_2 = []for feature in provider_2.getFeatures(QgsFeatureRequest()):  geometries_2.append(QgsGeometry(feature.geometry())) With the two sets of geometries loaded into memory, we're ready to start subtracting one from the other. However, to make this process more efficient, we will combine the geometries from the second shapefile into one large geometry, which we can then subtract all at once, rather than subtracting one at a time. This will make the subtraction process much faster: combined_geometry = Nonefor geometry in geometries_2:  if combined_geometry == None:    combined_geometry = geometry  else:    combined_geometry = combined_geometry.combine(geometry) We can now calculate the new set of geometries by subtracting one from the other: dst_geometries = []for geometry in geometries_1:  dst_geometry = geometry.difference(combined_geometry)  if not dst_geometry.isGeosValid(): continue  if dst_geometry.isGeosEmpty(): continue  dst_geometries.append(dst_geometry) Notice that we check to ensure that the destination geometry is mathematically valid and isn't empty. Invalid geometries are a common problem when manipulating complex shapes. There are options for fixing them, such as splitting apart multi-geometries and performing a buffer operation.  Our last task is to save the resulting geometries into a new shapefile. We'll first ask the user for the name of the destination shapefile: dst_filename = QFileDialog.getSaveFileName(iface.mainWindow(),                      "Save results to:",                      "~", "*.shp")if not dst_filename:  return We'll make use of a vector file writer to save the geometries into a shapefile. Let's start by initializing the file writer object: fields = QgsFields()writer = QgsVectorFileWriter(dst_filename, "ASCII", fields,               dst_geometries[0].wkbType(),               None, "ESRI Shapefile")if writer.hasError() != QgsVectorFileWriter.NoError:  print "Error!"  return We don't have any attributes in our shapefile, so the fields list is empty. Now that the writer has been set up, we can save the geometries into the file: for geometry in dst_geometries:  feature = QgsFeature()  feature.setGeometry(geometry)  writer.addFeature(feature) Now that all the data has been written to the disk, let's display a message box that informs the user that we've finished: QMessageBox.information(iface.mainWindow(), "",            "Subtracted features saved to disk.") As you can see, creating a new shapefile is very straightforward in PyQGIS, and it's easy to manipulate geometries using Python—just so long as you copy QgsGeometry you want to keep around. If your Python code starts to crash while manipulating geometries, this is probably the first thing you should look for. Using different symbols for different features within a map Let's use World Borders Dataset that you downloaded in the previous article to draw a world map, using different symbols for different continents. This is a good example of using a categorized symbol renderer, though we'll combine it into a script that loads the shapefile into a map layer, and sets up the symbols and map renderer to display the map exactly as you want it. We'll then save this map as an image. Let's start by creating a map layer to display the contents of the World Borders Dataset shapefile: layer = iface.addVectorLayer("/path/to/TM_WORLD_BORDERS-0.3.shp",               "continents", "ogr") Each unique region code in the World Borders Dataset shapefile corresponds to a continent. We want to define the name and color to use for each of these regions, and use this information to set up the various categories to use when displaying the map: from PyQt4.QtGui import QColorcategories = []for value,color,label in [(0,   "#660000", "Antarctica"),                          (2,   "#006600", "Africa"),                          (9,   "#000066", "Oceania"),                          (19,  "#660066", "The Americas"),                          (142, "#666600", "Asia"),                          (150, "#006666", "Europe")]:  symbol = QgsSymbolV2.defaultSymbol(layer.geometryType())  symbol.setColor(QColor(color))  categories.append(QgsRendererCategoryV2(value, symbol, label)) With these categories set up, we simply update the map layer to use a categorized renderer based on the value of the region attribute, and then redraw the map: layer.setRendererV2(QgsCategorizedSymbolRendererV2("region",                          categories))layer.triggerRepaint() There's only one more thing to do; since this is a script that can be run multiple times, let's have our script automatically remove the existing continents layer, if it exists, before adding a new one. To do this, we can add the following to the start of our script: layer_registry = QgsMapLayerRegistry.instance() for layer in layer_registry.mapLayersByName("continents"):   layer_registry.removeMapLayer(layer.id()) When our script is running, it will create one (and only one) layer that shows the various continents in different colors. These will appear as different shades of gray in the printed article, but the colors will be visible on the computer screen: Now, let's use the same data set to color each country based on its relative population. We'll start by removing the existing population layer, if it exists: layer_registry = QgsMapLayerRegistry.instance()for layer in layer_registry.mapLayersByName("population"):  layer_registry.removeMapLayer(layer.id()) Next, we open the World Borders Dataset into a new layer called "population": layer = iface.addVectorLayer("/path/to/TM_WORLD_BORDERS-0.3.shp",               "population", "ogr") We then need to set up our various population ranges: from PyQt4.QtGui import QColorranges = []for min_pop,max_pop,color in [(0,        99999,     "#332828"),                              (100000,   999999,    "#4c3535"),                              (1000000,  4999999,   "#663d3d"),                              (5000000,  9999999,   "#804040"),                              (10000000, 19999999,  "#993d3d"),                              (20000000, 49999999,  "#b33535"),                              (50000000, 999999999, "#cc2828")]:  symbol = QgsSymbolV2.defaultSymbol(layer.geometryType())  symbol.setColor(QColor(color))  ranges.append(QgsRendererRangeV2(min_pop, max_pop,                   symbol, "")) Now that we have our population ranges and their associated colors, we simply set up a graduated symbol renderer to choose a symbol based on the value of the pop2005 attribute, and tell the map to redraw itself: layer.setRendererV2(QgsGraduatedSymbolRendererV2("pop2005",                         ranges))layer.triggerRepaint() The result will be a map layer that shades each country according to its population:  Calculating the distance between two user-defined points  In our final example of using the PyQGIS library, we'll write some code that, when run, starts listening for mouse events from the user. If the user clicks on a point, drags the mouse, and then releases the mouse button again, we will display the distance between those two points. This is a good example of how to add your  own map interaction logic to QGIS, using the QgsMapTool class. This is the basic structure for our QgsMapTool subclass: class DistanceCalculator(QgsMapTool):  def __init__(self, iface):    QgsMapTool.__init__(self, iface.mapCanvas())    self.iface = iface   def canvasPressEvent(self, event):    ...   def canvasReleaseEvent(self, event):    ... To make this map tool active, we'll create a new instance of it and pass it to the mapCanvas.setMapTool() method. Once this is done, our canvasPressEvent() and canvasReleaseEvent() methods will be called whenever the user clicks or releases the mouse button over the map canvas. Let's start with the code that handles the user clicking on the canvas. In this method, we're going to convert from the pixel coordinates that the user clicked on to the map coordinates (that is, a latitude and longitude value). We'll then remember these coordinates so that we can refer to them later. Here is the necessary code: def canvasPressEvent(self, event):  transform = self.iface.mapCanvas().getCoordinateTransform()  self._startPt = transform.toMapCoordinates(event.pos().x(),                        event.pos().y()) When the canvasReleaseEvent() method is called, we'll want to do the same with the point at which the user released the mouse button: def canvasReleaseEvent(self, event):  transform = self.iface.mapCanvas().getCoordinateTransform()  endPt = transform.toMapCoordinates(event.pos().x(),                    event.pos().y()) Now that we have the two desired coordinates, we'll want to calculate the distance between them. We can do this using a QgsDistanceArea object:      crs = self.iface.mapCanvas().mapRenderer().destinationCrs()  distance_calc = QgsDistanceArea()  distance_calc.setSourceCrs(crs)  distance_calc.setEllipsoid(crs.ellipsoidAcronym())  distance_calc.setEllipsoidalMode(crs.geographicFlag())  distance = distance_calc.measureLine([self._startPt,                     endPt]) / 1000 Notice that we divide the resulting value by 1000. This is because the QgsDistanceArea object returns the distance in meters, and we want to display the distance in kilometers. Finally, we'll display the calculated distance in the QGIS message bar:   messageBar = self.iface.messageBar()  messageBar.pushMessage("Distance = %d km" % distance,              level=QgsMessageBar.INFO,              duration=2) Now that we've created our map tool, we need to activate it. We can do this by adding the following to the end of our script: calculator = DistanceCalculator(iface)iface.mapCanvas().setMapTool(calculator) With the map tool activated, the user can click and drag on the map. When the mouse button is released, the distance (in kilometers) between the two points will be displayed in the message bar: Summary In this article, we took an in-depth look at the PyQGIS libraries and how you can use them in your own programs. We learned that the QGIS Python libraries are implemented as wrappers around the QGIS APIs implemented in C++. We saw how Python programmers can understand and work with the QGIS reference documentation, even though it is written for C++ developers. We also looked at the way the PyQGIS libraries are organized into different packages, and learned about the most important classes defined in the qgis.core and qgis.gui packages. We then saw how a coordinate reference systems (CRS) is used to translate from points on the three-dimensional surface of the Earth to coordinates within a two-dimensional map plane. We learned that vector format data is made up of features, where each feature has an ID, a geometry, and a set of attributes, and that symbols are used to draw vector geometries onto a map layer, while renderers are used to choose which symbol to use for a given feature. We learned how a spatial index can be used to speed up access to vector features. Next, we saw how raster format data is organized into bands that represent information such as color, elevation, and so on, and looked at the various ways in which a raster data source can be displayed within a map layer. Along the way, we learned how to access the contents of a raster data source. Finally, we looked at various techniques for performing useful tasks using the PyQGIS library. In the next article, we will learn more about QGIS Python plugins, and then go on to use the plugin architecture as a way of implementing a useful feature within a mapping application. Resources for Article:   Further resources on this subject: QGIS Feature Selection Tools [article] Server Logs [article]
Read more
  • 0
  • 0
  • 14536
Modal Close icon
Modal Close icon