How-To Tutorials

article-image-modernizing-our-spring-boot-app

26 Nov 2014

15 min read

Modernizing our Spring Boot app

26 Nov 2014

In this article by Greg L. Turnquist, the author of the book, Learning Spring Boot, we will discuss modernizing our Spring Boot app with JavaScript and adding production-ready support features. (For more resources related to this topic, see here.) Modernizing our app with JavaScript We just saw that, with a single @Grab statement, Spring Boot automatically configured the Thymeleaf template engine and some specialized view resolvers. We took advantage of Spring MVC's ability to pass attributes to the template through ModelAndView. Instead of figuring out the details of view resolvers, we instead channeled our efforts into building a handy template to render data fetched from the server. We didn't have to dig through reference docs, Google, and Stack Overflow to figure out how to configure and integrate Spring MVC with Thymeleaf. We let Spring Boot do the heavy lifting. But that's not enough, right? Any real application is going to also have some JavaScript. Love it or hate it, JavaScript is the engine for frontend web development. See how the following code lets us make things more modern by creating modern.groovy: @Grab("org.webjars:jquery:2.1.1")@Grab("thymeleaf-spring4")@Controllerclass ModernApp {def chapters = ["Quick Start With Groovy","Quick Start With Java","Debugging and Managing Your App","Data Access with Spring Boot","Securing Your App"]@RequestMapping("/")def home(@RequestParam(value="name", defaultValue="World")String n) {new ModelAndView("modern").addObject("name", n).addObject("chapters", chapters)}} A single @Grab statement pulls in jQuery 2.1.1. The rest of our server-side Groovy code is the same as before. There are multiple ways to use JavaScript libraries. For Java developers, it's especially convenient to use the WebJars project (http://webjars.org), where lots of handy JavaScript libraries are wrapped up with Maven coordinates. Every library is found on the /webjars/<library>/<version>/<module> path. To top it off, Spring Boot comes with prebuilt support. Perhaps you noticed this buried in earlier console outputs: ...2014-05-20 08:33:09.062 ... : Mapped URL path [/webjars/**] onto handlerof [...... With jQuery added to our application, we can amp up our template (templates/modern.html) like this: <html><head><title>Learning Spring Boot - Chapter 1</title><script src="webjars/jquery/2.1.1/jquery.min.js"></script><script>$(document).ready(function() {$('p').animate({fontSize: '48px',}, "slow");});</script></head><body><p th_text="'Hello, ' + ${name}"></p><ol><li th_each="chapter : ${chapters}"th:text="${chapter}"></li></ol></body></html> What's different between this template and the previous one? It has a couple extra <script> tags in the head section: The first one loads jQuery from /webjars/jquery/2.1.1/jquery.min.js (implying that we can also grab jquery.js if we want to debug jQuery) The second script looks for the <p> element containing our Hello, world! message and then performs an animation that increases the font size to 48 pixels after the DOM is fully loaded into the browser If we run spring run modern.groovy and visit http://localhost:8080, then we can see this simple but stylish animation. It shows us that all of jQuery is available for us to work with on our application. Using Bower instead of WebJars WebJars isn't the only option when it comes to adding JavaScript to our app. More sophisticated UI developers might use Bower (http://bower.io), a popular JavaScript library management tool. WebJars are useful for Java developers, but not every library has been bundled as a WebJar. There is also a huge community of frontend developers more familiar with Bower and NodeJS that will probably prefer using their standard tool chain to do their jobs. We'll see how to plug that into our app. First, it's important to know some basic options. Spring Boot supports serving up static web resources from the following paths: /META-INF/resources/ /resources/ /static/ /public/ To craft a Bower-based app with Spring Boot, we first need to craft a .bowerrc file in the same folder we plan to create our Spring Boot CLI application. Let's pick public/ as the folder of choice for JavaScript modules and put it in this file, as shown in the following code: {"directory": "public/"} Do I have to use public? No. Again, you can pick any of the folders listed previously and Spring Boot will serve up the code. It's a matter of taste and semantics. Our first step towards a Bower-based app is to define our project by answering a series of questions (this only has to be done once): $ bower init[?] name: app_with_bower[?] version: 0.1.0[?] description: Learning Spring Boot - bower sample[?] main file:[?] what types of modules does this package expose? amd[?] keywords:[?] authors: Greg Turnquist <gturnquist@pivotal.io>[?] license: ASL[?] homepage: http://blog.greglturnquist.com/category/learning-springboot[?] set currently installed components as dependencies? No[?] add commonly ignored files to ignore list? Yes[?] would you like to mark this package as private which prevents it frombeing accidentally published to the registry? Yes...[?] Looks good? Yes Now that we have set our project, let's do something simple such as install jQuery with the following command: $ bower install jquery --savebower jquery#* cached git://github.com/jquery/jquery.git#2.1.1bower jquery#* validate 2.1.1 against git://github.com/jquery/jquery.git#* These two commands will have created the following bower.json file: {"name": "app_with_bower","version": "0.1.0","authors": ["Greg Turnquist <gturnquist@pivotal.io>"],"description": "Learning Spring Boot - bower sample","license": "ASL","homepage": "http://blog.greglturnquist.com/category/learningspring-boot","private": true,"ignore": ["**/.*","node_modules","bower_components","public/","test","tests"],"dependencies": {"jquery": "~2.1.1"}} It will also have installed jQuery 2.1.1 into our app with the following directory structure: public└── jquery├── MIT-LICENSE.txt├── bower.json└── dist├── jquery.js└── jquery.min.js We must include --save (two dashes) whenever we install a module. This ensures that our bower.json file is updated at the same time, allowing us to rebuild things if needed. The altered version of our app with WebJars removed should now look like this: @Grab("thymeleaf-spring4")@Controllerclass ModernApp {def chapters = ["Quick Start With Groovy","Quick Start With Java","Debugging and Managing Your App","Data Access with Spring Boot","Securing Your App"]@RequestMapping("/")def home(@RequestParam(value="name", defaultValue="World")String n) {new ModelAndView("modern_with_bower").addObject("name", n).addObject("chapters", chapters)}} The view name has been changed to modern_with_bower, so it doesn't collide with the previous template if found in the same folder. This version of the template, templates/modern_with_bower.html, should look like this: <html><head><title>Learning Spring Boot - Chapter 1</title><script src="jquery/dist/jquery.min.js"></script><script>$(document).ready(function() {$('p').animate({fontSize: '48px',}, "slow");});</script></head><body><p th_text="'Hello, ' + ${name}"></p><ol><li th_each="chapter : ${chapters}"th:text="${chapter}"></li></ol></body></html> The path to jquery is now jquery/dist/jquery.min.js. The rest is the same as the WebJars example. We just launch the app with spring run modern_with_bower.groovy and navigate to http://localhost:8080. (Might need to refresh the page to ensure loading of the latest HTML.) The animation should work just the same. The options shown in this section can quickly give us a taste of how easy it is to use popular JavaScript tools with Spring Boot. We don't have to fiddle with messy tool chains to achieve a smooth integration. Instead, we can use them the way they are meant to be used. What about an app that is all frontend with no backend? Perhaps we're building an app that gets all its data from a remote backend. In this age of RESTful backends, it's not uncommon to build a single page frontend that is fed data updates via AJAX. Spring Boot's Groovy support provides the perfect and arguably smallest way to get started. We do so by creating pure_javascript.groovy, as shown in the following code: @Controllerclass JsApp { } That doesn't look like much, but it accomplishes a lot. Let's see what this tiny fragment of code actually does for us: The @Controller annotation, like @RestController, causes Spring Boot to auto-configure Spring MVC. Spring Boot will launch an embedded Apache Tomcat server. Spring Boot will serve up static content from resources, static, and public. Since there are no Spring MVC routes in this tiny fragment of code, things will fall to resource resolution. Next, we can create a static/index.html page as follows: <html>Greetings from pure HTML which can, in turn, load JavaScript!</html> Run spring run pure_javascript.groovy and navigate to http://localhost:8080. We will see the preceding plain text shown in our browser as expected. There is nothing here but pure HTML being served up by our embedded Apache Tomcat server. This is arguably the lightest way to serve up static content. Use spring jar and it's possible to easily bundle up our client-side app to be installed anywhere. Spring Boot's support for static HTML, JavaScript, and CSS opens the door to many options. We can add WebJar annotations to JsApp or use Bower to introduce third-party JavaScript libraries in addition to any custom client-side code. We might just manually download the JavaScript and CSS. No matter what option we choose, Spring Boot CLI certainly provides a super simple way to add rich-client power for app development. To top it off, RESTful backends that are decoupled from the frontend can have different iteration cycles as well as different development teams. You might need to configure CORS (http://spring.io/understanding/CORS) to properly handle making remote calls that don't go back to the original server. Adding production-ready support features So far, we have created a Spring MVC app with minimal code. We added views and JavaScript. We are on the verge of a production release. Before deploying our rapidly built and modernized web application, we might want to think about potential issues that might arise in production: What do we do when the system administrator wants to configure his monitoring software to ping our app to see if it's up? What happens when our manager wants to know the metrics of people hitting our app? What are we going to do when the Ops center supervisor calls us at 2:00 a.m. and we have to figure out what went wrong? The last feature we are going to introduce in this article is Spring Boot's Actuator module and CRaSH remote shell support (http://www.crashub.org). These two modules provide some super slick, Ops-oriented features that are incredibly valuable in a production environment. We first need to update our previous code (we'll call it ops.groovy), as shown in the following code: @Grab("spring-boot-actuator")@Grab("spring-boot-starter-remote-shell")@Grab("org.webjars:jquery:2.1.1")@Grab("thymeleaf-spring4")@Controllerclass OpsReadyApp {@RequestMapping("/")def home(@RequestParam(value="name", defaultValue="World")String n) {new ModelAndView("modern").addObject("name", n)}} This app is exactly like the WebJars example with two key differences: it adds @Grab("spring-boot-actuator") and @Grab("spring-boot-starter-remote-shell"). When you run this version of our app, the same business functionality is available that we saw earlier, but there are additional HTTP endpoints available: Actuator endpoint Description /autoconfig This reports what Spring Boot did and didn't auto-configure and why /beans This reports all the beans configured in the application context (including ours as well as the ones auto-configured by Boot) /configprops This exposes all configuration properties /dump This creates a thread dump report /env This reports on the current system environment /health This is a simple endpoint to check life of the app /info This serves up custom content from the app /metrics This shows counters and gauges on web usage /mappings This gives us details about all Spring MVC routes /trace This shows details about past requests Pinging our app for general health Each of these endpoints can be visited using our browser or using other tools such as curl. For example, let's assume we ran spring run ops.groovy and then opened up another shell. From the second shell, let's run the following curl command: $ curl localhost:8080/health{"status":"UP"} This immediately solves our first need listed previously. We can inform the system administrator that he or she can write a management script to interrogate our app's health. Gathering metrics Be warned that each of these endpoints serves up a compact JSON document. Generally speaking, command-line curl probably isn't the best option. While it's convenient on *nix and Mac systems, the content is dense and hard to read. It's more practical to have: A JSON plugin installed in our browser (such as JSONView at http://jsonview.com) A script that uses a JSON parsing library if we're writing a management script (such as Groovy's JsonSlurper at http://groovy.codehaus.org/gapi/groovy/json/JsonSlurper.html or JSONPath at https://code.google.com/p/json-path) Assuming we have JSONView installed, the following screenshot shows a listing of metrics: It lists counters for each HTTP endpoint. According to this, /metrics has been visited four times with a successful 200 status code. Someone tried to access /foo, but it failed with a 404 error code. The report also lists gauges for each endpoint, reporting the last response time. In this case, /metrics took 2 milliseconds. Also included are some memory stats as well as the total CPUs available. It's important to realize that the metrics start at 0. To generate some numbers, you might want to first click on some links before visiting /metrics. The following screenshot shows a trace report: It shows the entire web request and response for curl localhost:8080/health. This provides a basic framework of metrics to satisfy our manager's needs. It's important to understand that metrics gathered by Spring Boot Actuator aren't persistent across application restarts. So to gather long-term data, we have to gather them and then write them elsewhere. With these options, we can perform the following: Write a script that gathers metrics every hour and appends them to a running spreadsheet somewhere else in the filesystem, such as a shared drive. This might be simple, but probably also crude. To step it up, we can dump the data into a Hadoop filesystem for raw collection and configure Spring XD (http://projects.spring.io/spring-xd/) to consume it. Spring XD stands for Spring eXtreme Data. It is an open source product that makes it incredibly easy to chain together sources and sinks comprised of many components, such as HTTP endpoints, Hadoop filesystems, Redis metrics, and RabbitMQ messaging. Unfortunately, there is no space to dive into this subject. With any monitoring, it's important to check that we aren't taxing the system too heavily. The same container responding to business-related web requests is also serving metrics data, so it will be wise to engage profilers periodically to ensure that the whole system is performing as expected. Detailed management with CRaSH So what can we do when we receive that 2:00 a.m. phone call from the Ops center? After either coming in or logging in remotely, we can access the convenient CRaSH shell we configured. Every time the app launches, it generates a random password for SSH access and prints this to the local console: 2014-06-11 23:00:18.822 ... : Configuring property ssh.port=2000 fromproperties2014-06-11 23:00:18.823 ... : Configuring property ssh.authtimeout=600000 fro...2014-06-11 23:00:18.824 ... : Configuring property ssh.idletimeout=600000 fro...2014-06-11 23:00:18.824 ... : Configuring property auth=simple fromproperties2014-06-11 23:00:18.824 ... : Configuring property auth.simple.username=user f...2014-06-11 23:00:18.824 ... : Configuring property auth.simple.password=bdbe4a... We can easily see that there's SSH access on port 2000 via a user if we use this information to log in: $ ssh -p 2000 user@localhostPassword authenticationPassword:. ____ _ __ _ _/\ / ___'_ __ _ _(_)_ __ __ _ ( ( )___ | '_ | '_| | '_ / _' | \/ ___)| |_)| | | | | || (_| | ) ) ) )' |____| .__|_| |_|_| |___, | / / / /=========|_|==============|___/=/_/_/_/:: Spring Boot :: (v1.1.6.RELEASE) on retina> There's a fistful of commands: help: This gets a listing of available commands dashboard: This gets a graphic, text-based display of all the threads, environment properties, memory, and other things autoconfig: This prints out a report of which Spring Boot auto-configuration rules were applied and which were skipped (and why) All of the previous commands have man pages: > man autoconfigNAMEautoconfig - Display auto configuration report fromApplicationContextSYNOPSISautoconfig [-h | --help]STREAMautoconfig <java.lang.Void, java.lang.Object>PARAMETERS[-h | --help]Display this help message... There are many commands available to help manage our application. More details are available at http://www.crashub.org/1.3/reference.html. Summary In this article, we learned about modernizing our Spring Boot app with JavaScript and adding production-ready support features. We plugged in Spring Boot's Actuator module as well as the CRaSH remote shell, configuring it with metrics, health, and management features so that we can monitor it in production by merely adding two lines of extra code. Resources for Article: Further resources on this subject: Getting Started with Spring Security [Article] Spring Roo 1.1: Working with Roo-generated Web Applications [Article] Spring Security 3: Tips and Tricks [Article]

0
0
3132

Packt

26 Nov 2014

26 min read

Ansible – An Introduction

Packt

26 Nov 2014

26 min read

0
0
2276

How-To Tutorials

Packt

26 Nov 2014

25 min read

Concurrency in Practice

Packt

26 Nov 2014

25 min read

0
0
1782

article-image-high-availability-scenarios

Packt

26 Nov 2014

14 min read

High Availability Scenarios

Packt

26 Nov 2014

14 min read

"Live Migration between hosts in a Hyper-V cluster is very straightforward and requires no specific configuration, apart from type and amount of simultaneous Live Migrations. If you add multiple clusters and standalone Hyper-V hosts into the mix, I strongly advise you to configure Kerberos Constrained Delegation for all hosts and clusters involved." Hans Vredevoort – MVP Hyper-V This article written by Benedict Berger, the author of Hyper-V Best Practices, will guide you through the installation of Hyper-V clusters and their best practice configuration. After installing the first Hyper-V host, it may be necessary to add another layer of availability to your virtualization services. With Failover Clusters, you get independence from hardware failures and are protected from planned or unplanned service outages. This article includes prerequirements and implementation of Failover Clusters. (For more resources related to this topic, see here.) Preparing for High Availability Like every project, a High Availability (HA) scenario starts with a planning phase. Virtualization projects are often turning up the question for additional availability for the first time in an environment. In traditional data centers with physical server systems and local storage systems, an outage of a hardware component will only affect one server hosting one service. The source of the outage can be localized very fast and the affected parts can be replaced in a short amount of time. Server virtualization comes with great benefits, such as improved operating efficiency and reduced hardware dependencies. However, a single component failure can impact a lot of virtualized systems at once. By adding redundant systems, these single points of failure can be avoided. Planning a HA environment The most important factor in the decision whether you need a HA environment is your business requirements. You need to find out how often and how long an IT-related production service can be interrupted unplanned, or planned, without causing a serious problem to your business. Those requirements are defined in a central IT strategy of a business as well as in process definitions that are IT-driven. They include Service Level Agreements of critical business services run in the various departments of your company. If those definitions do not exist or are unavailable, talk to the process owners to find out the level of availability needed. High Availability is structured in different classes, measured by the total uptime in a defined timespan, that is 99.999 percent in a year. Every nine in this figure adds a huge amount of complexity and money needed to ensure this availability, so take time to find out the real availability needed by your services and resist the temptation to plan running every service on multi-redundant, geo-spread cluster systems, as it may not fit in the budget. Be sure to plan for additional capacity in a HA environment, so you can lose hardware components without the need to sacrifice application performance. Overview of the Failover Cluster A Hyper-V Failover Cluster consists of two or more Hyper-V Server compute nodes. Technically, it's possible to use a Failover Cluster with just one computing node; however, it will not provide any availability advantages over a standalone host and is typically only used for migration scenarios. A Failover Cluster is hosting roles such as Hyper-V virtual machines on its computing nodes. If one node fails due to a hardware problem, it will not answer any more to cluster heartbeat communication, even though the service interruption is almost instantly detected. The virtual machines running on the particular node are powered off immediately due to the hardware failure on their computing node. The remaining cluster nodes then immediately take over these VMs in an unplanned failover process and start them on their respective own hardware. The virtual machines will be the backup running after a successful boot of their operating systems and applications in just a few minutes. Hyper-V Failover Clusters work under the condition that all compute nodes have access to a shared storage instance, holding the virtual machine configuration data and its virtual hard disks. In case of a planned failover, that is, for patching compute nodes, it's possible to move running virtual machines from one cluster node to another without interrupting the VM. All cluster nodes can run virtual machines at the same time, as long as there is enough failover capacity running all services when a node goes down. Even though a Hyper-V cluster is still called a Failover Cluster—utilizing the Windows Server Failover-Clustering feature—it is indeed capable of running an Active/Active Cluster. To ensure that all these capabilities of a Failover Cluster are indeed working, it demands an accurate planning and implementation process. Failover Cluster prerequirements To successfully implement a Hyper-V Failover Cluster, we need suitable hardware, software, permissions, and network and storage infrastructure as outlined in the following sections. Hardware The hardware used in a Failover Cluster environment needs to be validated against the Windows Server Catalogue. Microsoft will only support Hyper-V clusters when all components are certified for Windows Server 2012 R2. The servers used to run our HA virtual machines should ideally consist of identical hardware models with identical components. It is possible, and supported, to run servers in the same cluster with different hardware components, that is, different size of RAM; however, due to a higher level of complexity, this is not best practice. Special planning considerations are needed to address the CPU requirements of a cluster. To ensure maximum compatibility, all CPUs in a cluster should be exactly the same model. While it's possible from a technical point of view to mix even CPUs from Intel and AMD in the same cluster through to different architecture, you will lose core cluster capabilities such as Live Migration. Choosing a single vendor for your CPUs is not enough, even when using different CPU models your cluster nodes may be using a different set of CPU instruction set extensions. With different instructions sets, Live Migrations won't work either. There is a compatibility mode that disables most of the instruction set on all CPUs on all cluster nodes; however, this leaves you with a negative impact on performance and should be avoided. A better approach to this problem would be creating another cluster from the legacy CPUs running smaller or non-production workloads without affecting your high-performance production workloads. If you want to extend your cluster after some time, you will find yourself with the problem of not having the exact same hardware available to purchase. Choose the current revision of the model or product line you are already using in your cluster and manually compare the CPU instruction sets at http://ark.intel.com/ and http://products.amd.com/, respectively. Choose the current CPU model that best fits the original CPU features of your cluster and have this design validated by your hardware partner. Ensure that your servers are equipped with compatible CPUs, the same amount of RAM, and the same network cards and storage controllers. The network design Mixing different vendors of network cards in a single server is fine and best practice for availability, but make sure all your Hyper-V hosts are using an identical hardware setup. A network adapter should only be used exclusively for LAN traffic or storage traffic. Do not mix these two types of communication in any basic scenario. There are some more advanced scenarios involving converged networking that can enable mixed traffic, but in most cases, this is not a good idea. A Hyper-V Failover Cluster requires multiple layers of communication between its nodes and storage systems. Hyper-V networking and storage options have changed dramatically through the different releases of Hyper-V. With Windows Server 2012 R2, the network design options are endless. In this article, we will work with a typically seen basic set of network designs. We have at least six Network Interface Cards (NICs) available in our servers with a bandwidth of 1 Gb/s. If you have more than five interface cards available per server, use NIC Teaming to ensure the availability of the network or even use converged networking. Converged networking will also be your choice if you have less than five network adapters available. The First NIC will be exclusively used for Host Communication to our Hyper-V host and will not be involved in the VM network traffic or cluster communication at any time. It will ensure Active Directory and management traffic to our Management OS. The second NIC will ensure Live Migration of virtual machines between our cluster nodes. The third NIC will be used for VM traffic. Our virtual machines will be connected to the various production and lab networks through this NIC. The fourth NIC will be used for internal cluster communication. The first four NICs can either be teamed through Windows Server NIC Teaming or can be abstracted from the physical hardware through to Windows Server network virtualization and converged fabric design. The fifth NIC will be reserved for storage communication. As advised, we will be isolating storage and production LAN communication from each other. If you do not use iSCSI or SMB3 storage communication, this NIC will not be necessary. If you use Fibre Channel SAN technology, use a FC-HBA instead. If you leverage Direct Attached Storage (DAS), use the appropriate connector for storage communication. The sixth NIC will also be used for storage communication as a redundancy. The redundancy will be established via MPIO and not via NIC Teaming. There is no need for a dedicated heartbeat network as in older revisions of Windows Server with Hyper-V. All cluster networks will automatically be used for sending heartbeat signals throughout the other cluster members. If you don't have 1 Gb/s interfaces available, or if you use 10 GbE adapters, it’s best practice to implement a converged networking solution. Storage design All cluster nodes must have access to the virtual machines residing on a centrally shared storage medium. This could be a classic setup with a SAN, a NAS, or a more modern concept with Windows Scale Out File Servers hosting Virtual Machine Files SMB3 Fileshares. In this article, we will use a NetApp SAN system that's capable of providing a classic SAN approach with LUNs mapped to our Hosts as well as utilizing SMB3 Fileshares, but any other Windows Server 2012 R2 validated SAN will fulfill the requirements. In our first setup, we will utilize Cluster Shared Volumes (CSVs) to store several virtual machines on the same storage volume. It's not good these days to create a single volume per virtual machine due to a massive management overhead. It's a good rule of thumb to create one CSV per cluster node; in larger environments with more than eight hosts, a CSV per two to four cluster nodes. To utilize CSVs, follow these steps: Ensure that all components (SAN, Firmware, HBAs, and so on) are validated for Windows Server 2012 R2 and are up to date. Connect your SAN physically to all your Hyper-V hosts via iSCSI or Fibre Channel connections. Create two LUNs on your SAN for hosting virtual machines. Activate Hyper-V performance options for these LUNs if possible (that is, on a NetApp, by setting the LUN type to Hyper-V). Size the LUNs for enough capacity to host all your virtual hard disks. Label the LUNs CSV01 and CSV02 with appropriate LUN IDs. Create another small LUN with 1 GB in size and label it Quorum. Make the LUNs available to all Hyper-V hosts in this specified cluster by mapping it on the storage device. Do not make these LUNs available to any other hosts or cluster. Prepare storage DSMs and drivers (that is, MPIO) for Hyper-V host installation. Refresh disk configuration on hosts, install drivers and DSMs, and format volumes as NTFS (quick). Install Microsoft Multipath IO when using redundant storage paths: Install-WindowsFeature -Name Multipath-IO –Computername ElanityHV01, ElanityHV02 In this example, I added the MPIO feature to two Hyper-V hosts with the computer names ElanityHV01 and ElanityHV02. SANs typically are equipped with two storage controllers for redundancy reasons. Make sure to disperse your workloads over both controllers for optimal availability and performance. If you leverage file servers providing SMB3 shares, the preceding steps do not apply to you. Perform the following steps instead: Create a storage space with the desired disk-types, use storage tiering if possible. Create a new SMB3 Fileshare for applications. Customize the Permissions to include all Hyper-V servers from the planned clusters as well as the Hyper-V cluster object itself with full control. Server and software requirements To create a Failover Cluster, you need to install a second Hyper-V host. Use the same unattended file but change the IP address and the hostname. Join both Hyper-V hosts to your Active Directory domain if you have not done this until yet. Hyper-V can be clustered without leveraging Active Directory but it's lacking several key components, such as Live Migration, and shouldn't be done on purpose. The availability to successfully boot up a domain-joined Hyper-V cluster without the need to have any Active Directory domain controller present during boot time is the major benefit from the Active Directory independency of Failover Clusters. Ensure that you create a Hyper-V virtual switch as shown earlier with the same name on both hosts, to ensure cluster compatibility and that both nodes are installed with all updates. If you have System Center 2012 R2 in place, use the System Center Virtual Machine Manager to create a Hyper-V cluster. Implementing Failover Clusters After preparing our Hyper-V hosts, we will now create a Failover Cluster using PowerShell. I'm assuming your hosts are installed, storage and network connections are prepared, and the Hyper-V role is already active utilizing up-to-date drivers and firmware on your hardware. First, we need to ensure that Servername, Date, and Time of our Hosts are correct. Time and Timezone configurations should occur via Group Policy. For automatic network configuration later on, it's important to rename the network connections from default to their designated roles using PowerShell, as seen in the following commands: Rename-NetAdapter -Name "Ethernet" -NewName "Host" Rename-NetAdapter -Name "Ethernet 2" -NewName "LiveMig" Rename-NetAdapter -Name "Ethernet 3" -NewName "VMs" Rename-NetAdapter -Name "Ethernet 4" -NewName "Cluster" Rename-NetAdapter -Name "Ethernet 5" -NewName "Storage" The Network Connections window should look like the following screenshot: Hyper-V host Network Connections Next, IP configuration of the network adapters. If you are not using DHCP for your servers, manually set the IP configuration (different subnets) of the specified network cards. Here is a great blog post on how to automate this step: http://bit.ly/Upa5bJ Next, we need to activate the necessary Failover Clustering features on both of our Hyper-V hosts: Install-WindowsFeature -Name Failover-Clustering-IncludeManagementTools –Computername ElanityHV01, ElanityHV02 Before actually creating the cluster, we are launching a cluster validation cmdlet via PowerShell: Test-Cluster ElanityHV01, ElanityHV02 Test-Cluster cmdlet Open the generated .mht file for more details, as shown in the following screenshot: Cluster validation As you can see, there are some warnings that should be investigated. However, as long as there are no errors, the configuration is ready for clustering and fully supported by Microsoft. However, check out Warnings to be sure you won't run into problems in the long run. After you have fixed potential errors and warnings listed in the Cluster Validation Report, you can finally create the cluster as follows: New-Cluster-Name CN=ElanityClu1,OU=Servers,DC=cloud,DC=local-Node ElanityHV01, ElanityHV02-StaticAddress 192.168.1.49 This will create a new cluster named ElanityClu1 consisting of the nodes ElanityHV01 and ElanityHV02 and using the cluster IP address 192.168.1.49. This cmdlet will create the cluster and the corresponding Active Directory Object in the specified OU. Moving the cluster object to a different OU later on is no problem at all; even renaming is possible when done the right way. After creating the cluster, when you open the Failover Cluster Management console, you should be able to connect to your cluster: Failover Cluster Manager You will see that all your cluster nodes and Cluster Core Resources are online. Rerun the Validation Report and copy the generated .mht files to a secure location if you need them for support queries. Keep in mind that you have to rerun this wizard if any hardware or configuration changes occurring to the cluster components, including any of its nodes. The initial cluster setup is now complete and we can continue with post creation tasks. Summary With the knowledge from this article, you are now able to design and implement Hyper-V Failover Clusters as well as guest clusters. You are aware of the basic concepts of High Availability and the storage and networking options necessary to achieve this. You have seen real-world proven configurations to ensure a stable operating environment. Resources for Article: Further resources on this subject: Planning Desktop Virtualization [Article] Backups in the VMware View Infrastructure [Article] Virtual Machine Design/a> [Article]

0
0
9389

Packt

25 Nov 2014

26 min read

Detecting Beacons – Showing an Advert

Packt

25 Nov 2014

26 min read

0
0
8060

article-image-machine-learning-examples-applicable-businesses

Packt

25 Nov 2014

7 min read

Machine Learning Examples Applicable to Businesses

Packt

25 Nov 2014

7 min read

The purpose of this article by Michele Usuelli, author of the book, R Machine Learning Essentials, is to show how you machine learning helps in solving a business problem. (For more resources related to this topic, see here.) Predicting the output The past marketing campaign targeted part of the customer base. Among other 1,000 clients, how do we identify the 100 that are keener to subscribe? We can build a model that learns from the data and estimates which clients are more similar to the ones that subscribed in the previous campaign. For each client, the model estimates a score that is higher if the client is more likely to subscribe. There are different machine learning models determining the scores and we use two well-performing techniques, as follows: Logistic regression: This is a variation of the linear regression to predict a binary output Random forest: This is an ensemble based on a decision tree that works well in presence of many features In the end, we need to choose one out of the two techniques. There are cross-validation methods that allow us to estimate model accuracy. Starting from that, we can measure the accuracy of both the options and pick the one performing better. After choosing the most proper machine learning algorithm, we can optimize it using cross validation. However, in order to avoid overcomplicating the model building, we don't perform any feature selection or parameter optimization. These are the steps to build and evaluate the models: Load the randomForest package containing the random forest algorithm:library('randomForest') Define the formula defining the output and the variable names. The formula is in the format output ~ feature1 + feature2 + ...: arrayFeatures <- names(dtBank)arrayFeatures <- arrayFeatures[arrayFeatures != 'output']formulaAll <- paste('output', '~')formulaAll <- paste(formulaAll, arrayFeatures[1])for(nameFeature in arrayFeatures[-1]){formulaAll <- paste(formulaAll, '+', nameFeature)}formulaAll <- formula(formulaAll) Initialize the table containing all the testing sets: dtTestBinded <- data.table() Define the number of iterations: nIter <- 10 Start a for loop: for(iIter in 1:nIter){ Define the training and the test datasets: indexTrain <- sample(x = c(TRUE, FALSE),size = nrow(dtBank),replace = T,prob = c(0.8, 0.2))dtTrain <- dtBank[indexTrain]dtTest <- dtBank[!indexTrain] Select a subset from the test set in such a way that we have the same number of output == 0 and output == 1. First, we split dtTest in two parts (dtTest0 and dtTest1) on the basis of the output and we count the number of rows of each part (n0 and n1). Then, as dtTest0 has more rows, we randomly select n1 rows. In the end, we redefine dtTest binding dtTest0 and dtTest1, as follows: dtTest1 <- dtTest[output == 1]dtTest0 <- dtTest[output == 0]n0 <- nrow(dtTest0)n1 <- nrow(dtTest1)dtTest0 <- dtTest0[sample(x = 1:n0, size = n1)]dtTest <- rbind(dtTest0, dtTest1) Build the random forest model using randomForest. The formula argument defines the relationship between variables and the data argument defines the training dataset. In order to avoid overcomplicating the model, all the other parameters are left as their defaults: modelRf <- randomForest(formula = formulaAll,data = dtTrain) Build the logistic regression model using glm, which is a function used to build Generalized Linear Models (GLM). GLMs are a generalization of the linear regression and they allow to define a link function that connects the linear predictor with the outputs. The input is the same as the random forest, with the addition of family = binomial(logit) defining that the regression is logistic: modelLr <- glm(formula = formulaAll,data = dtTest,family = binomial(logit)) Predict the output of the random forest. The function is predict and its main arguments are object defining the model and newdata defining the test set, as follows: dtTest[, outputRf := predict(object = modelRf, newdata = dtTest, type='response')] Predict the output of the logistic regression, using predict similar to the random forest. The other argument is type='response' and it is necessary in the case of the logistic regression: dtTest[, outputLr := predict(object = modelLr, newdata = dtTest, type='response')] Add the new test set to dtTestBinded: dtTestBinded <- rbind(dtTestBinded, dtTest) End the for loop: } We built dtTestBinded containing the output column that defines which clients subscribed and the scores estimated by the models. Comparing the scores with the real output, we can validate the model performances. In order to explore dtTestBinded, we can build a chart showing how the scores of the non-subscribing clients are distributed. Then, we add the distribution of the subscribing clients to the chart and compare them. In this way, we can see the difference between the scores of the two groups. Since we use the same chart for the random forest and for the logistic regression, we define a function building the chart by following the given steps: Define the function and its input that includes the data table and the name of the score column: plotDistributions <- function(dtTestBinded, colPred){ Compute the distribution density for the clients that didn't subscribe. With output == 0, we extract the clients not subscribing, and using density, we define a density object. The adjust parameter defines the smoothing bandwidth that is a parameter of the way we build the curve starting from the data. The bandwidth can be interpreted as the level of detail: densityLr0 <- dtTestBinded[ output == 0, density(get(colPred), adjust = 0.5) ] Compute the distribution density for the clients that subscribed: densityLr1 <- dtTestBinded[ output == 1, density(get(colPred), adjust = 0.5) ] Define the colors in the chart using rgb. The colors are transparent red and transparent blue: col0 <- rgb(1, 0, 0, 0.3)col1 <- rgb(0, 0, 1, 0.3) Build the plot with the density of the clients not subscribing. Here, polygon is a function that adds the area to the chart: plot(densityLr0, xlim = c(0, 1), main = 'density')polygon(densityLr0, col = col0, border = 'black') Add the clients that subscribed to the chart: polygon(densityLr1, col = col1, border = 'black') Add the legend: legend( 'top', c('0', '1'), pch = 16, col = c(col0, col1)) End the function: return()} Now, we can use plotDistributions on the random forest output: par(mfrow = c(1, 1))plotDistributions(dtTestBinded, 'outputRf') The histogram obtained is as follows: The x-axis represents the score and the y-axis represents the density that is proportional to the number of clients that subscribed for similar scores. Since we don't have a client for each possible score, assuming a level of detail of 0.01, the density curve is smoothed in the sense that the density of each score is the average between the data with a similar score. The red and blue areas represent the non-subscribing and subscribing clients respectively. As can be easily noticed, the violet area comes from the overlapping of the two curves. For each score, we can identify which density is higher. If the highest curve is red, the client will be more likely to subscribe, and vice versa. For the random forest, most of the non-subscribing client scores are between 0 and 0.2 and the density peak is around 0.05. The subscribing clients have a more spread score, although higher, and their peak is around 0.1. The two distributions overlap a lot, so it's not easy to identify which clients will subscribe starting from their scores. However, if the marketing campaign targets all customers with a score higher than 0.3, they will likely belong to the blue cluster. In conclusion, using random forest, we are able to identify a small set of customers that will subscribe very likely. Summary In this article, you learned how to predict your output using proper machine learning techniques. Resources for Article: Further resources on this subject: Using R for Statistics, Research, and Graphics [article] Machine Learning in Bioinformatics [article] Learning Data Analytics with R and Hadoop [article]

0
0
1349

Packt

25 Nov 2014

4 min read

No to nodistinct

Packt

25 Nov 2014

4 min read

This article is written by Stephen Redmond, the author of Mastering QlikView. There is a great skill in creating the right expression to calculate the right answer. Being able to do this in all circumstances relies on having a good knowledge of creating advanced expressions. Of course, the best path to mastery in this subject is actually getting out and doing it, but there is a great argument here for regularly practicing with dummy or test datasets. (For more resources related to this topic, see here.) When presented with a problem that needs to be solved, all the QlikView masters will not necessarily know immediately how to answer it. What they will have though is a very good idea of where to start, that is, what to try and what not to try. This is what I hope to impart to you here. Knowing how to create many advanced expressions will arm you to know where to apply them—and where not to apply them. This is one area of QlikView that is alien to many people. For some reason, they fear the whole idea of concepts such as Aggr. However, the reality is that these concepts are actually very simple and supremely logical. Once you get your head around them, you will wonder what all the fuss was about. No to nodistinct The Aggr function has as an optional clause, that is, the possibility of stating that the aggregation will be either distinct or nodistinct. The default option is distinct, and as such, is rarely ever stated. In this default operation, the aggregation will only produce distinct results for every combination of dimensions—just as you would expect from a normal chart or straight table. The nodistinct option only makes sense within a chart, one that has more dimensions than are in the Aggr statement. In this case, the granularity of the chart is lower than the granularity of Aggr, and therefore, QlikView will only calculate that Aggr for the first occurrence of lower granularity dimensions and will return null for the other rows. If we specify nodistinct, the same result will be calculated across all of the lower granularity dimensions. This can be difficult to understand without seeing an example, so let's look at a common use case for this option. We will start with a dataset: ProductSales:Load * Inline [Product, Territory, Year, SalesProduct A, Territory A, 2013, 100Product B, Territory A, 2013, 110Product A, Territory B, 2013, 120Product B, Territory B, 2013, 130Product A, Territory A, 2014, 140Product B, Territory A, 2014, 150Product A, Territory B, 2014, 160Product B, Territory B, 2014, 170]; We will build a report from this data using a pivot table: Now, we want to bring the value in the Total column into a new column under each year, perhaps to calculate a percentage for each year. We might think that, because the total is the sum for each Product and Territory, we might use an Aggr in the following manner: Sum(Aggr(Sum(Sales), Product, Territory)) However, as stated previously, because the chart includes an additional dimension (Year) than Aggr, the expression will only be calculated for the first occurrence of each of the lower granularity dimensions (in this case, for Year = 2013): The commonly suggested fix for this is to use Aggr without Sum and with nodistinct as shown: Aggr(NoDistinct Sum(Sales), Product, Territory) This will allow the Aggr expression to be calculated across all the Year dimension values, and at first, it will appear to solve the problem: The problem occurs when we decide to have a total row on this chart: As there is no aggregation function surrounding Aggr, it does not total correctly at the Product or Territory dimensions. We can't add an aggregation function, such as Sum, because it will break one of the other totals. However, there is something different that we can do; something that doesn't involve Aggr at all! We can use our old friend Total: Sum(Total<Product, Territory> Sales) This will calculate correctly at all the levels: There might be other use cases for using a nodistinct clause in Aggr, but they should be reviewed to see whether a simpler Total will work instead. Summary We discussed an important function, the Aggr function. We now know that the Aggr function is extremely useful, but we don't need to apply it in all circumstances where we have vertical calculations. Resources for Article: Further resources on this subject: Common QlikView script errors [article] Introducing QlikView elements [article] Creating sheet objects and starting new list using Qlikview 11 [article]

0
0
2023

Packt

25 Nov 2014

7 min read

Creating an Apache JMeter™ test workbench

Packt

25 Nov 2014

7 min read

This article is written by Colin Henderson, the author of Mastering GeoServer. This article will give you a brief introduction about how to create an Apache JMeter™ test workbench. (For more resources related to this topic, see here.) Before we can get into the nitty-gritty of creating a test workbench for Apache JMeter™, we must download and install it. Apache JMeter™ is a 100 percent Java application, which means that it will run on any platform provided there is a Java 6 or higher runtime environment present. The binaries can be downloaded from http://jmeter.apache.org/download_jmeter.cgi, and at the time of writing, the latest version is 2.11. No installation is required; just download the ZIP file and decompress it to a location you can access from a command-line prompt or shell environment. To launch JMeter on Linux, simply open shell and enter the following command: $ cd <path_to_jmeter>/bin$ ./jmeter To launch JMeter on Windows, simply open a command prompt and enter the following command: C:> cd <path_to_jmeter>\binC:> jmeter After a short time, JMeter GUI should appear, where we can construct our test plan. For ease and convenience, consider setting your system's PATH environment variable to the location of the JMeter bin directory. In future, you will be able to launch JMeter from the command line without having to CD first. The JMeter workbench will open with an empty configuration ready for us to construct our test strategy: The first thing we need to do is give our test plan a name; for now, let's call it GeoServer Stress Test. We can also provide some comments, which is good practice as it will help us remember for what reason we devised the test plan in future. To demonstrate the use of JMeter, we will create a very simple test plan. In this test plan, we will simulate a certain number of users hitting our GeoServer concurrently and requesting maps. To set this up, we first need to add Thread Group to our test plan. In a JMeter test, a thread is equivalent to a user: In the left-hand side menu, we need to right-click on the GeoServer Stress Test node and choose the Add | Threads (Users) | Thread Group menu option. This will add a child node to the test plan that we right-clicked on. The right-hand side panel provides options that we can set for the thread group to control how the user requests are executed. For example, we can name it something meaningful, such as Web Map Requests. In this test, we will simulate 30 users, making map requests over a total duration of 10 minutes, with a 10-second delay between each user starting. The number of users is set by entering a value for Number of Threads; in this case, 30. The Ramp-Up Period option controls the delay in starting each user by specifying the duration in which all the threads must start. So, in our case, we enter a duration of 300 seconds, which means all 30 users will be started by the end of 300 seconds. This equates to a 10-second delay between starting threads (300 / 30 = 10). Finally, we will set a duration for the test to run over by ticking the box for Scheduler, and then specifying a value of 600 seconds for Duration. By specifying a duration value, we override the End Time setting. Next, we need to provide some basic configuration elements for our test. First, we need to set the default parameters for all web requests. Right-click on the Web Map Requests thread group node that we just created, and then navigate to Add | Config Element | User Defined Variables. This will add a new node in which we can specify the default HTTP request parameters for our test: In the right-hand side panel, we can specify any number of variables. We can use these as replacement tokens later when we configure the web requests that will be sent during our test run. In this panel, we specify all the standard WMS query parameters that we don't anticipate changing across requests. Taking this approach is a good practice as it means that we can create a mix of tests using the same values, so if we change one, we don't have to change all the different test elements. To execute requests, we need to add Logic Controller. JMeter contains a lot of different logic controllers, but in this instance, we will use Simple Controller to execute a request. To add the controller, right-click on the Web Map Requests node and navigate to Add | Logic Controller | Simple Controller. A simple controller does not require any configuration; it is merely a container for activities we want to execute. In our case, we want the controller to read some data from our CSV file, and then execute an HTTP request to WMS. To do this, we need to add a CSV dataset configuration. Right-click on the Simple Controller node and navigate to Add | Config Element | CSV Data Set Config. The settings for the CSV data are pretty straightforward. The filename is set to the file that we generated previously, containing the random WMS request properties. The path can be specified as relative or absolute. The Variable Names property is where we specify the structure of the CSV file. The Recycle on EOF option is important as it means that the CSV file will be re-read when the end of the file is reached. Finally, we need to set Sharing mode to All threads to ensure the data can be used across threads. Next, we need to add a delay to our requests to simulate user activity; in this case, we will introduce a small delay of 5 seconds to simulate a user performing a map-pan operation. Right-click on the Simple Controller node, and then navigate to Add | Timer | Constant Timer: Simply specify the value we want the thread to be paused for in milliseconds. Finally, we need to add a JMeter sampler, which is the unit that will actually perform the HTTP request. Right-click on the Simple Controller node and navigate to Add | Sampler | HTTP Request. This will add an HTTP Request sampler to the test plan: There is a lot of information that goes into this panel; however, all it does is construct an HTTP request that the thread will execute. We specify the server name or IP address along with the HTTP method to use. The important part of this panel is the Parameters tab, which is where we need to specify all the WMS request parameters. Notice that we used the tokens that we specified in the CSV Data Set Config and WMS Request Defaults configuration components. We use the ${token_name} token, and JMeter replaces the token with the appropriate value of the referenced variable. We configured our test plan, but before we execute it, we need to add some listeners to the plan. A JMeter listener is the component that will gather the information from all of the test runs that occur. We add listeners by right-clicking on the thread group node and then navigating to the Add | Listeners menu option. A list of available listeners is displayed, and we can select the one we want to add. For our purposes, we will add the Graph Results, Generate Summary Results, Summary Report, and Response Time Graph listeners. Each listener can have its output saved to a datafile for later review. When completed, our test plan structure should look like the following: Before executing the plan, we should save it for use later. Summary In this article, we looked at how Apache JMeter™ can be used to construct and execute test plans to place loads on our servers so that we can analyze the results and gain an understanding of how well our servers perform. Resources for Article: Further resources on this subject: Geo-Spatial Data in Python: Working with Geometry [article] Working with Geo-Spatial Data in Python [article] Getting Started with GeoServer [article]

0
0
3458

article-image-audio-processing-and-generation-maxmsp

Packt

25 Nov 2014

19 min read

Audio Processing and Generation in Max/MSP

Packt

25 Nov 2014

19 min read

0
0
14435

Packt

25 Nov 2014

14 min read

Components

Packt

25 Nov 2014

14 min read

0
0
2586

article-image-understanding-hbase-ecosystem

Packt

24 Nov 2014

11 min read

Understanding the HBase Ecosystem

Packt

24 Nov 2014

11 min read

This article by Shashwat Shriparv, author of the book, Learning HBase, will introduce you to the world of HBase. (For more resources related to this topic, see here.) HBase is a horizontally scalable, distributed, open source, and a sorted map database. It runs on top of Hadoop file system that is Hadoop Distributed File System (HDFS). HBase is a NoSQL nonrelational database that doesn't always require a predefined schema. It can be seen as a scaling flexible, multidimensional spreadsheet where any structure of data is fit with on-the-fly addition of new column fields, and fined column structure before data can be inserted or queried. In other words, HBase is a column-based database that runs on top of Hadoop distributed file system and supports features such as linear scalability (scale out), automatic failover, automatic sharding, and more flexible schema. HBase is modeled on Google BigTable. It was inspired by Google BigTable, which is a compressed, high-performance, proprietary data store built on the Google filesystem. HBase was developed as a Hadoop subproject to support storage of structural data, which can take advantage of most distributed files systems (typically, the Hadoop Distributed File System known as HDFS). The following table contains key information about HBase and its features: Features Description Developed by Apache Written in Java Type Column oriented License Apache License Lacking features of relational databases SQL support, relations, primary, foreign, and unique key constraints, normalization Website http://hbase.apache.org Distributions Apache, Cloudera Download link http://mirrors.advancedhosters.com/apache/hbase/ Mailing lists The user list: user-subscribe@hbase.apache.org The developer list: dev-subscribe@hbase.apache.org Blog http://blogs.apache.org/hbase/ HBase layout on top of Hadoop The following figure represents the layout information of HBase on top of Hadoop: There is more than one ZooKeeper in the setup, which provides high availability of master status; a RegionServer may contain multiple rations. The RegionServers run on the machines where DataNodes run. There can be as many RegionServers as DataNodes. RegionServers can have multiple HRegions; one HRegion can have one HLog and multiple HFiles with its associate's MemStore. HBase can be seen as a master-slave database where the master is called HMaster, which is responsible for coordination between client application and HRegionServer. It is also responsible for monitoring and recording metadata changes and management. Slaves are called HRegionServers, which serve the actual tables in form of regions. These regions are the basic building blocks of the HBase tables, which contain distribution of tables. So, HMaster and RegionServer work in coordination to serve the HBase tables and HBase cluster. Usually, HMaster is co-hosted with Hadoop NameNode daemon process on a server and communicates to DataNode daemon for reading and writing data on HDFS. The RegionServer runs or is co-hosted on the Hadoop DataNodes. Comparing architectural differences between RDBMs and HBase Let's list the major differences between relational databases and HBase: Relational databases HBase Uses tables as databases Uses regions as databases Filesystems supported are FAT, NTFS, and EXT Filesystem supported is HDFS The technique used to store logs is commit logs The technique used to store logs is Write-Ahead Logs (WAL) The reference system used is coordinate system The reference system used is ZooKeeper Uses the primary key Uses the row key Partitioning is supported Sharding is supported Use of rows, columns, and cells Use of rows, column families, columns, and cells HBase features Let's see the major features of HBase that make it one of the most useful databases for the current and future industry: Automatic failover and load balancing: HBase runs on top of HDFS, which is internally distributed and automatically recovered using multiple block allocation and replications. It works with multiple HMasters and region servers. This failover is also facilitated using HBase and RegionServer replication. Automatic sharding: An HBase table is made up of regions that are hosted by RegionServers and these regions are distributed throughout the RegionServers on different DataNodes. HBase provides automatic and manual splitting of these regions to smaller subregions, once it reaches a threshold size to reduce I/O time and overhead. Hadoop/HDFS integration: It's important to note that HBase can run on top of other filesystems as well. While HDFS is the most common choice as it supports data distribution and high availability using distributed Hadoop, for which we just need to set some configuration parameters and enable HBase to communicate to Hadoop, an out-of-the-box underlying distribution is provided by HDFS. Real-time, random big data access: HBase uses log-structured merge-tree (LSM-tree) as data storage architecture internally, which merges smaller files to larger files periodically to reduce disk seeks. MapReduce: HBase has a built-in support of Hadoop MapReduce framework for fast and parallel processing of data stored in HBase. You can search for the Package org.apache.hadoop.hbase.mapreduce for more details. Java API for client access: HBase has a solid Java API support (client/server) for easy development and programming. Thrift and a RESTtful web service: HBase not only provides a thrift and RESTful gateway but also web service gateways for integrating and accessing HBase besides Java code (HBase Java APIs) for accessing and working with HBase. Support for exporting metrics via the Hadoop metrics subsystem: HBase provides Java Management Extensions (JMX) and exporting matrix for monitoring purposes with tools such as Ganglia and Nagios. Distributed: HBase works when used with HDFS. It provides coordination with Hadoop so that distribution of tables, high availability, and consistency is supported by it. Linear scalability (scale out): Scaling of HBase is not scale up but scale out, which means that we don't need to make servers more powerful but we add more machines to its cluster. We can add more nodes to the cluster on the fly. As soon as a new RegionServer node is up, the cluster can begin rebalancing, start the RegionServer on the new node, and it is scaled up, it is as simple as that. Column oriented: HBase stores each column separately in contrast with most of the relational databases, which uses stores or are row-based storage. So in HBase, columns are stored contiguously and not the rows. More about row- and column-oriented databases will follow. HBase shell support: HBase provides a command-line tool to interact with HBase and perform simple operations such as creating tables, adding data, and scanning data. This also provides full-fledged command-line tool using which we can interact with HBase and perform operations such as creating table, adding data, removing data, and a few other administrative commands. Sparse, multidimensional, sorted map database: HBase is a sparse, multidimensional, sorted map-based database, which supports multiple versions of the same record. Snapshot support: HBase supports taking snapshots of metadata for getting the previous or correct state form of data. HBase in the Hadoop ecosystem Let's see where HBase sits in the Hadoop ecosystem. In the Hadoop ecosystem, HBase provides a persistent, structured, schema-based data store. The following figure illustrates the Hadoop ecosystem: HBase can work as a separate entity on the local filesystem (which is not really effective as no distribution is provided) as well as in coordination with Hadoop as a separate but connected entity. As we know, Hadoop provides two services, a distributed files system (HDFS) for storage and a MapReduce framework for processing in a parallel mode. When there was a need to store structured data (data in the form of tables, rows and columns), which most of the programmers are already familiar with, the programmers were finding it difficult to process the data that was stored on HDFS as an unstructured flat file format. This led to the evolution of HBase, which provided a way to store data in a structural way. Consider that we have got a CSV file stored on HDFS and we need to query from it. We would need to write a Java code for this, which wouldn't be a good option. It would be better if we could specify the data key and fetch the data from that file. So, what we can do here is create a schema or table with the same structure of CSV file to store the data of the CSV file in the HBase table and query using HBase APIs, or HBase shell using key. Data representation in HBase Let's look into the representation of rows and columns in HBase table: An HBase table is divided into rows, column families, columns, and cells. Row keys are unique keys to identify a row, column families are groups of columns, columns are fields of the table, and the cell contains the actual value or the data. So, we have been through the introduction of HBase; now, let's see what Hadoop and its components are in brief. It is assumed here that you are already familiar with Hadoop; if not, following a brief introduction about Hadoop will help you to understand it. Hadoop Hadoop is an underlying technology of HBase, providing high availability, fault tolerance, and distribution. It is an Apache-sponsored, free, open source, Java-based programming framework which supports large dataset storage. It provides distributed file system and MapReduce, which is a distributed programming framework. It provides a scalable, reliable, distributed storage and development environment. Hadoop makes it possible to run applications on a system with tens to tens of thousands of nodes. The underlying distributed file system provides large-scale storage, rapid data access. It has the following submodules: Hadoop Common: This is the core component that supports the other Hadoop modules. It is like the master components facilitating communication and coordination between different Hadoop modules. Hadoop distributed file system: This is the underlying distributed file system, which is abstracted on the top of the local filesystem that provides high throughput of read and write operations of data on Hadoop. Hadoop YARN: This is the new framework that is shipped with newer releases of Hadoop. It provides job scheduling and job and resource management. Hadoop MapReduce: This is the Hadoop-based processing system that provides parallel processing of large data and datasets. Other Hadoop subprojects are HBase, Hive, Ambari, Avro, Cassandra (Cassandra isn't a Hadoop subproject, it's a related project; they solve similar problems in different ways), Mahout, Pig, Spark, ZooKeeper (ZooKeeper isn't a Hadoop subproject. It's a dependency shared by many distributed systems), and so on. All of these have different usability and the combination of all these subprojects forms the Hadoop ecosystem. Core daemons of Hadoop The following are the core daemons of Hadoop: NameNode: This stores and manages all metadata about the data present on the cluster, so it is the single point of contact to Hadoop. In the new release of Hadoop, we have an option of more than one NameNode for high availability. JobTracker: This runs on the NameNode and performs the MapReduce of the jobs submitted to the cluster. SecondaryNameNode: This maintains the backup of metadata present on the NameNode, and also records the file system changes. DataNode: This will contain the actual data. TaskTracker: This will perform tasks on the local data assigned by the JobTracker. The preceding are the daemons in the case of Hadoop v1 or earlier. In newer versions of Hadoop, we have ResourceManager instead of JobTracker, the node manager instead of TaskTrackers, and the YARN framework instead of a simple MapReduce framework. The following is the comparison between daemons in Hadoop 1 and Hadoop 2: Hadoop 1 Hadoop 2 HDFS NameNode Secondary NameNode DataNode NameNode (more than one active/standby) Checkpoint node DataNode Processing MapReduce v1 JobTracker TaskTracker YARN (MRv2) ResourceManager NodeManager Application Master Comparing HBase with Hadoop As we now know what HBase and what Hadoop are, let's have a comparison between HDFS and HBase for better understanding: Hadoop/HDFS HBase This provide filesystem for distributed storage This provides tabular column-oriented data storage This is optimized for storage of huge-sized files with no random read/write of these files This is optimized for tabular data with random read/write facility This uses flat files This uses key-value pairs of data The data model is not flexible Provides a flexible data model This uses file system and processing framework This uses tabular storage with built-in Hadoop MapReduce support This is mostly optimized for write-once read-many This is optimized for both read/write many Summary So in this article, we discussed the introductory aspects of HBase and it's features. We have also discussed HBase's components and their place in the HBase ecosystem. Resources for Article: Further resources on this subject: The HBase's Data Storage [Article] HBase Administration, Performance Tuning [Article] Comparative Study of NoSQL Products [Article]

0
0
15533

Packt

24 Nov 2014

10 min read

Configuring vShield App

Packt

24 Nov 2014

10 min read

In this article by Mike Greer, the author of vSphere Security Cookbook, we will cover the following topics: Installing vShield App Configuring vShield App Configuring vShield App Flow Monitoring (For more resources related to this topic, see here.) Introduction In most modern operating systems exists the capability to install a firewall on the host itself. The rules configured in a host-based firewall manage the traffic at the host level, and provide an additional layer of defense along with network firewalls and intrusion detection systems. Multiple layers of security provide a complete defense-in-depth architecture. The concept of defense-in-depth builds layers of security providing protection, should another layer fail or be compromised. The second component of the vShield family to be configured that we'll discuss is vShield App. vShield App is a host-based layer 2 firewall that is implemented at the vNIC level of the hypervisor. vShield App presents itself as a virtual appliance in the vCenter management tool. For each protected ESXi host, there is an associated vShield App virtual machine that runs on the said host. To protect the entire virtualization environment managed by vCenter, it is important to install vShield App on each ESXi host in the datacenter. Failure to protect each host will allow the opportunity for virtual machines to be moved to an unprotected host either by vMotion or manually. In the event where DRS is being used, it is very likely that virtual machines will be moved to an unprotected host, assuming that it has resources available and there is a high load on adjacent hosts. Installing vShield App The vShield App is required to provide host-level security and firewall services to each individual ESXi host. This process must be completed on each ESXi host individually. Getting ready A Core Infrastructure Suite (CIS) or vCloud Networking and Security (vCNS) license must be installed prior to installing vShield App and vShield Edge. vShield App is installed per ESXi host, and vShield Manager must have been previously installed as a prerequisite. In order to proceed, we require access to vShield Web Console. The client can be run on any modern Windows or Mac desktop operating system or server operating system. vShield Web Console requires Adobe Flash, which is not supported on Linux operating systems at this time. Ensure the account used for login has administrative rights to vShield Manager. How to do it… Perform the following steps: We'll open the vShield Manager Web Console and log in with an administrative account. Navigate to Datacenter | Lab Cluster | esx5501.training.lab within vShield Manager. Ensure the ESXi host targeted for installation is not hosting the VM running vCenter. A connection is required to vCenter and the installation of vShield App will disrupt the network connection to the ESXi host. Locate vShield App from the Summary tab. Click on Install next to vShield App. Select the Datastore to hold the vShield App service information. In our example, we'll use datastore1. Select the available Management Port Group that can communicate with the vShield Manager installed previously. In our example, we'll use Internal Network. Enter the vShield App IP address. In our example, we'll use 192.168.10.30. The IP address of the vShield App must be unique and not previously assigned. Enter the Netmask. In our example, we'll use 255.255.255.0. Enter the Default Gateway. In our example, we'll use 192.168.10.1. Ensure the vShield Endpoint checkbox is cleared. Click on the Install button. The status will be shown during the installation, as shown in the following screenshot: Verify the completion of the setup with no errors. If an error occurs, the details regarding the error will be highlighted in yellow and begin with vShield App installation encountered error while installing service VM <error details>. Repeat this process for additional ESXi hosts. If vCenter is running on an ESXi host, use vMotion to migrate that VM to another host prior to installation. How it works… vShield App provides firewall functionality to an ESXi host by installing a virtual appliance that is tied to the local host. The virtual appliance is stored on the local datastore to the host. Each firewall appliance is named with the host name included for clarity. In our example, we installed on the ESXi host named esx5501.training.lab and the corresponding firewall appliance is named vShield-FW-esx5501.training.lab. One important point to consider is if the vShield App fails on a particular ESXi host and the default rule is set to deny, then all traffic will be denied to the host which can make troubleshooting difficult. If a vShield App installation fails and requires manual removal, the process to remove the failed install will require the ESXi host to be rebooted in the process. As a result, all virtual machines running on the host will need to be migrated to other nodes of the cluster or powered down. Configuring vShield App There are several important configurations to set in the vShield App management Web Console. Configuring Fail Safe Mode sets the actions that will be taken if the vShield App fails or is down. The Fail Safe Mode can either be set to allow or block. Excluding virtual machines such as vCenter is key to allow proper functionality since it will exclude any virtual machine from firewall rules. Getting ready In order to proceed, we require access to vShield Web Console. The client can be run on any modern Windows or Mac desktop operating system or server operating system. vShield Web Console requires Adobe Flash, which is not supported on Linux operating systems at this time. Ensure the account used for log in has administrative rights to vShield Manager. How to do it… Viewing current status is the first step in assessing the state of the vShield App. Once the app is verified to be in a healthy state, additional configurations can be accomplished by performing the following steps: Launch vSphere Client using an account with administrative rights. For our example, view the following: Choose Home | Inventory | Hosts and Clusters from the menu bar. Navigate to Datacenter | Lab Cluster | esx5501.training.lab. Select the vShield tab. Expand vShield-FW-esx5501.training.lab (192.168.10.30). Note the Status: In Sync status, there are two options to either Force Sync or Restart. Current Management Port Information is displayed including the packet, byte, and error information. Syslog Servers can be added by an IP Address. Configuring the Fail Safe Policy allows traffic to flow or be blocked should the vShield App firewall be down or offline for any reason. Navigate to Settings & Reports | vShield App within vShield Manager. Click on Change under Fail Safe. This step will lead to the following screen: Click on Yes. Note that Default Fail Safe Configuration set to Block is now changed to Allow. In a production situation, there are few times the fail safe setting will be changed to Allow. In a small test environment, should the vShield App be unavailable; all connectivity to the ESXi host will be blocked by default. Configuring the Exclusion List allows certain virtual machines to function without host based firewall rules being applied to them. This can be done by performing the following steps: Navigate to Settings & Reports | vShield App within vShield Manager. Click on Add under Exclusion List. Select a virtual machine to exclude from vShield App (in our case, it is a Linked vCenter server). Click on Add. Click on OK. Click on OK in the next dialog box to confirm. The selected virtual machine is now excluded from protection How it works… The vShield App host firewall is installed per ESXi host and automatically named by the installation program to include the name of the host. It is also important to note that when a host is put into the maintenance mode, the vShield firewall must be shut down in order to let the host successfully achieve the maintenance mode. Current Status provides a single view of the firewall associated with the host including the traffic status displayed by packet count, link, and admin status. One important status check is if the firewall is in sync with vShield Manager. Should the firewall fall out of sync, it can be forced to sync and if that fails, the option for restart is also present on the status page. Fail Safe Policy is an important consideration should the vShield App virtual appliance fail for any reason. The default setting is to block all traffic and this might seem like a good idea at the outset. Careful consideration should be given on setting this flag, depending on what type of virtual machines are running on a specific host or cluster. In the situation where mission critical applications are running on virtual machines within an internal cluster or host, it will make sense to allow traffic if the vShield App were down. The average time it will take to identify and remediate the failure could cause a significant impact on the amount of business lost. Exclusion List, as the name implies, allows certain virtual machines to remain outside the protection of the vShield App firewall. Critical infrastructure such as DNS or Domain Controllers are good candidates to be added to the exclusion list. vCenter servers should always be added to the exclusion list. Configuring vShield App Flow Monitoring The vShield App Flow Monitoring is a traffic analysis tool that provides statistics and graphs of the traffic on the virtual network as it passes through a host running vShield App. The information collected and displayed by Flow Monitoring is detailed to the protocol level and is very useful in spotting unwanted traffic flows. Getting ready In order to proceed, we require access to the vShield App through the vSphere Client plugin. The plugin can be enabled through the Plug-ins menu in the vSphere Client. This client can be run on any modern Windows desktop operating system or server operating system. The vShield vSphere Client plugin requires Adobe Flash, which is not supported on Linux operating systems at this time. Ensure the vCenter account used for login has administrative rights to vShield Manager. How to do it… To view the current traffic flow, launch vSphere Client using an account with administrative rights. For our example view the following: Navigate to Home | Inventory | Hosts and Clusters from the menu bar. Navigate to Datacenter andclick on the vShield tab. Select Flow Monitoring. Note that the Summary information is displayed by default. Summary Information Click on Details to view detailed information. Allowed Flows and Blocked Flows are available to view. Select DNS-UDP to identify the host lookup traffic on port 53. Click on Add Rule for Rule Id 1002. By changing Action from Allow to Block, a firewall rule can be modified to block the DNS traffic from the web server to the DNS Server. Click on Cancel. How it works… The Flow Monitoring component of vShield App, in addition to providing great detail, is able to create vShield App firewall rules on the fly. As shown in the preceding example, we were able to identify the DNS traffic from our web server accessing an internal DNS server. Due to our governance rules, servers accessed in the DMZ are not allowed to request DNS from internal servers. Implementing a firewall rule adds a control to this policy and gets easily implemented once the administrator noticed the request. The ability to view traffic by Top Flows, Top Destinations, and Top Sources is very valuable when troubleshooting a problem or tracking down a virus or trojan that is attempting to send valuable information outside the organization during a breach. Summary This article has thus covered how to set up and configure vShield App. Resources for Article: Further resources on this subject: Introduction to Veeam® Backup & Replication for VMware [article] VMware vCenter Operations Manager Essentials - Introduction to vCenter Operations Manager [article] Introduction to vSphere Distributed switches [article]

0
0
3805

How-To Tutorials

article-image-decoupling-units-unittestmock

Packt

24 Nov 2014

27 min read

Decoupling Units with unittest.mock

Packt

24 Nov 2014

27 min read

In this article by Daniel Arbuckle, author of the book Learning Python Testing, you'll learn how by using the unittest.mock package, you can easily perform the following: Replace functions and objects in your own code or in external packages. Control how replacement objects behave. You can control what return values they provide, whether they raise an exception, even whether they make any calls to other functions, or create instances of other objects. Check whether the replacement objects were used as you expected: whether functions or methods were called the correct number of times, whether the calls occurred in the correct order, and whether the passed parameters were correct. (For more resources related to this topic, see here.) Mock objects in general All right, before we get down to the nuts and bolts of unittest.mock, let's spend a few moments talking about mock objects overall. Broadly speaking, mock objects are any objects that you can use as substitutes in your test code, to keep your tests from overlapping and your tested code from infiltrating the wrong tests. However, like most things in programming, the idea works better when it has been formalized into a well-designed library that you can call on when you need it. There are many such libraries available for most programming languages. Over time, the authors of mock object libraries have developed two major design patterns for mock objects: in one pattern, you can create a mock object and perform all of the expected operations on it. The object records these operations, and then you put the object into playback mode and pass it to your code. If your code fails to duplicate the expected operations, the mock object reports a failure. In the second pattern, you can create a mock object, do the minimal necessary configuration to allow it to mimic the real object it replaces, and pass it to your code. It records how the code uses it, and then you can perform assertions after the fact to check whether your code used the object as expected. The second pattern is slightly more capable in terms of the tests that you can write using it but, overall, either pattern works well. Mock objects according to unittest.mock Python has several mock object libraries; as of Python 3.3, however, one of them has been crowned as a member of the standard library. Naturally that's the one we're going to focus on. That library is, of course, unittest.mock. The unittest.mock library is of the second sort, a record-actual-use-and-then-assert library. The library contains several different kinds of mock objects that, between them, let you mock almost anything that exists in Python. Additionally, the library contains several useful helpers that simplify assorted tasks related to mock objects, such as temporarily replacing real objects with mocks. Standard mock objects The basic element of unittest.mock is the unittest.mock.Mock class. Even without being configured at all, Mock instances can do a pretty good job of pretending to be some other object, method, or function. There are many mock object libraries for Python; so, strictly speaking, the phrase "mock object" could mean any object that was created by any of these libraries. Mock objects can pull off this impersonation because of a clever, somewhat recursive trick. When you access an unknown attribute of a mock object, instead of raising an AttributeError exception, the mock object creates a child mock object and returns that. Since mock objects are pretty good at impersonating other objects, returning a mock object instead of the real value works at least in the common case. Similarly, mock objects are callable; when you call a mock object as a function or method, it records the parameters of the call and then, by default, returns a child mock object. A child mock object is a mock object in its own right, but it knows that it's connected to the mock object it came from—its parent. Anything you do to the child is also recorded in the parent's memory. When the time comes to check whether the mock objects were used correctly, you can use the parent object to check on all of its descendants. Example: Playing with mock objects in the interactive shell (try it for yourself!): $ python3.4 Python 3.4.0 (default, Apr 2 2014, 08:10:08) [GCC 4.8.2] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from unittest.mock import Mock, call >>> mock = Mock() >>> mock.x <Mock name='mock.x' id='140145643647832'> >>> mock.x <Mock name='mock.x' id='140145643647832'> >>> mock.x('Foo', 3, 14) <Mock name='mock.x()' id='140145643690640'> >>> mock.x('Foo', 3, 14) <Mock name='mock.x()' id='140145643690640'> >>> mock.x('Foo', 99, 12) <Mock name='mock.x()' id='140145643690640'> >>> mock.y(mock.x('Foo', 1, 1)) <Mock name='mock.y()' id='140145643534320'> >>> mock.method_calls [call.x('Foo', 3, 14), call.x('Foo', 3, 14), call.x('Foo', 99, 12), call.x('Foo', 1, 1), call.y(<Mock name='mock.x()' id='140145643690640'>)] >>> mock.assert_has_calls([call.x('Foo', 1, 1)]) >>> mock.assert_has_calls([call.x('Foo', 1, 1), call.x('Foo', 99, 12)]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 792, in assert_has_ calls ) from cause AssertionError: Calls not found. Expected: [call.x('Foo', 1, 1), call.x('Foo', 99, 12)] Actual: [call.x('Foo', 3, 14), call.x('Foo', 3, 14), call.x('Foo', 99, 12), call.x('Foo', 1, 1), call.y(<Mock name='mock.x()' id='140145643690640'>)] >>> mock.assert_has_calls([call.x('Foo', 1, 1), ... call.x('Foo', 99, 12)], any_order = True) >>> mock.assert_has_calls([call.y(mock.x.return_value)]) There are several important things demonstrated in this interactive session. First, notice that the same mock object was returned each time that we accessed mock.x. This always holds true: if you access the same attribute of a mock object, you'll get the same mock object back as the result. The next thing to notice might seem more surprising. Whenever you call a mock object, you get the same mock object back as the return value. The returned mock isn't made new for every call, nor is it unique for each combination of parameters. We'll see how to override the return value shortly but, by default, you get the same mock object back every time you call a mock object. This mock object can be accessed using the return_value attribute name, as you might have noticed from the last statement of the example. The unittest.mock package contains a call object that helps to make it easier to check whether the correct calls have been made. The call object is callable, and takes note of its parameters in a way similar to mock objects, making it easy to compare it to a mock object's call history. However, the call object really shines when you have to check for calls to descendant mock objects. As you can see in the previous example, while call('Foo', 1, 1) will match a call to the parent mock object, if the call used these parameters, call.x('Foo', 1, 1), it matches a call to the child mock object named x. You can build up a long chain of lookups and invocations. For example: >>> mock.z.hello(23).stuff.howdy('a', 'b', 'c') <Mock name='mock.z.hello().stuff.howdy()' id='140145643535328'> >>> mock.assert_has_calls([ ... call.z.hello().stuff.howdy('a', 'b', 'c') ... ]) >>> Notice that the original invocation included hello(23), but the call specification wrote it simply as hello(). Each call specification is only concerned with the parameters of the object that was finally called after all of the lookups. The parameters of intermediate calls are not considered. That's okay because they always produce the same return value anyway unless you've overridden that behavior, in which case they probably don't produce a mock object at all. You might not have encountered an assertion before. Assertions have one job, and one job only: they raise an exception if something is not as expected. The assert_has_calls method, in particular, raises an exception if the mock object's history does not include the specified calls. In our example, the call history matches, so the assertion method doesn't do anything visible. You can check whether the intermediate calls were made with the correct parameters, though, because the mock object recorded a call immediately to mock.z.hello(23) before it recorded a call to mock.z.hello().stuff.howdy('a', 'b', 'c'): >>> mock.mock_calls.index(call.z.hello(23)) 6 >>> mock.mock_calls.index(call.z.hello().stuff.howdy('a', 'b', 'c')) 7 This also points out the mock_calls attribute that all mock objects carry. If the various assertion functions don't quite do the trick for you, you can always write your own functions that inspect the mock_calls list and check whether things are or are not as they should be. We'll discuss the mock object assertion methods shortly. Non-mock attributes What if you want a mock object to give back something other than a child mock object when you look up an attribute? It's easy; just assign a value to that attribute: >>> mock.q = 5 >>> mock.q 5 There's one other common case where mock objects' default behavior is wrong: what if accessing a particular attribute is supposed to raise an AttributeError? Fortunately, that's easy too: >>> del mock.w >>> mock.w Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 563, in __getattr__ raise AttributeError(name) AttributeError: w Non-mock return values and raising exceptions Sometimes, actually fairly often, you'll want mock objects posing as functions or methods to return a specific value, or a series of specific values, rather than returning another mock object. To make a mock object always return the same value, just change the return_value attribute: >>> mock.o.return_value = 'Hi' >>> mock.o() 'Hi' >>> mock.o('Howdy') 'Hi' If you want the mock object to return different value each time it's called, you need to assign an iterable of return values to the side_effect attribute instead, as follows: >>> mock.p.side_effect = [1, 2, 3] >>> mock.p() 1 >>> mock.p() 2 >>> mock.p() 3 >>> mock.p() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 885, in __call__ return _mock_self._mock_call(*args, **kwargs) File "/usr/lib64/python3.4/unittest/mock.py", line 944, in _mock_call result = next(effect) StopIteration If you don't want your mock object to raise a StopIteration exception, you need to make sure to give it enough return values for all of the invocations in your test. If you don't know how many times it will be invoked, an infinite iterator such as itertools.count might be what you need. This is easily done: >>> mock.p.side_effect = itertools.count() If you want your mock to raise an exception instead of returning a value, just assign the exception object to side_effect, or put it into the iterable that you assign to side_effect: >>> mock.e.side_effect = [1, ValueError('x')] >>> mock.e() 1 >>> mock.e() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 885, in __call__ return _mock_self._mock_call(*args, **kwargs) File "/usr/lib64/python3.4/unittest/mock.py", line 946, in _mock_call raise result ValueError: x The side_effect attribute has another use, as well that we'll talk about. Mocking class or function details Sometimes, the generic behavior of mock objects isn't a close enough emulation of the object being replaced. This is particularly the case when it's important that they raise exceptions when used improperly, since mock objects are usually happy to accept any usage. The unittest.mock package addresses this problem using a technique called speccing. If you pass an object into unittest.mock.create_autospec, the returned value will be a mock object, but it will do its best to pretend that it's the same object you passed into create_autospec. This means that it will: Raise an AttributeError if you attempt to access an attribute that the original object doesn't have, unless you first explicitly assign a value to that attribute Raise a TypeError if you attempt to call the mock object when the original object wasn't callable Raise a TypeError if you pass the wrong number of parameters or pass a keyword parameter that isn't viable if the original object was callable Trick isinstance into thinking that the mock object is of the original object's type Mock objects made by create_autospec share this trait with all of their children as well, which is usually what you want. If you really just want a specific mock to be specced, while its children are not, you can pass the template object into the Mock constructor using the spec keyword. Here's a short demonstration of using create_autospec: >>> from unittest.mock import create_autospec >>> x = Exception('Bad', 'Wolf') >>> y = create_autospec(x) >>> isinstance(y, Exception) True >>> y <NonCallableMagicMock spec='Exception' id='140440961099088'> Mocking function or method side effects Sometimes, for a mock object to successfully take the place of a function or method means that the mock object has to actually perform calls to other functions, or set variable values, or generally do whatever a function can do. This need is less common than you might think, and it's also somewhat dangerous for testing purposes because, when your mock objects can execute arbitrary code, there's a possibility that they stop being a simplifying tool for enforcing test isolation, and become a complex part of the problem instead. Having said that, there are still times when you need a mocked function to do something more complex than simply returning a value, and we can use the side_effect attribute of mock objects to achieve this. We've seen side_effect before, when we assigned an iterable of return values to it. If you assign a callable to side_effect, this callable will be called when the mock object is called and passed the same parameters. If the side_effect function raises an exception, this is what the mock object does as well; otherwise, the side_effect return value is returned by the mock object. In other words, if you assign a function to a mock object's side_effect attribute, this mock object in effect becomes that function with the only important difference being that the mock object still records the details of how it's used. The code in a side_effect function should be minimal, and should not try to actually do the job of the code the mock object is replacing. All it should do is perform any expected externally visible operations and then return the expected result.Mock object assertion methods As we saw in the Standard mock objects section, you can always write code that checks the mock_calls attribute of mock objects to see whether or not things are behaving as they should. However, there are some particularly common checks that have already been written for you, and are available as assertion methods of the mock objects themselves. As is normal for assertions, these assertion methods return None if they pass, and raise an AssertionError if they fail. The assert_called_with method accepts an arbitrary collection of arguments and keyword arguments, and raises an AssertionError unless these parameters were passed to the mock the last time it was called. The assert_called_once_with method behaves like assert_called_with, except that it also checks whether the mock was only called once and raises AssertionError if that is not true. The assert_any_call method accepts arbitrary arguments and keyword arguments, and raises an AssertionError if the mock object has never been called with these parameters. We've already seen the assert_has_calls method. This method accepts a list of call objects, checks whether they appear in the history in the same order, and raises an exception if they do not. Note that "in the same order" does not necessarily mean "next to each other." There can be other calls in between the listed calls as long as all of the listed calls appear in the proper sequence. This behavior changes if you assign a true value to the any_order argument. In that case, assert_has_calls doesn't care about the order of the calls, and only checks whether they all appear in the history. The assert_not_called method raises an exception if the mock has ever been called. Mocking containers and objects with a special behavior One thing the Mock class does not handle is the so-called magic methods that underlie Python's special syntactic constructions: __getitem__, __add__, and so on. If you need your mock objects to record and respond to magic methods—in other words, if you want them to pretend to be container objects such as dictionaries or lists, or respond to mathematical operators, or act as context managers or any of the other things where syntactic sugar translates it into a method call underneath—you're going to use unittest.mock.MagicMock to create your mock objects. There are a few magic methods that are not supported even by MagicMock, due to details of how they (and mock objects) work: __getattr__, __setattr__, __init__ , __new__, __prepare__, __instancecheck__, __subclasscheck__, and __del__. Here's a simple example in which we use MagicMock to create a mock object supporting the in operator: >>> from unittest.mock import MagicMock >>> mock = MagicMock() >>> 7 in mock False >>> mock.mock_calls [call.__contains__(7)] >>> mock.__contains__.return_value = True >>> 8 in mock True >>> mock.mock_calls [call.__contains__(7), call.__contains__(8)] Things work similarly with the other magic methods. For example, addition: >>> mock + 5 <MagicMock name='mock.__add__()' id='140017311217816'> >>> mock.mock_calls [call.__contains__(7), call.__contains__(8), call.__add__(5)] Notice that the return value of the addition is a mock object, a child of the original mock object, but the in operator returned a Boolean value. Python ensures that some magic methods return a value of a particular type, and will raise an exception if that requirement is not fulfilled. In these cases, MagicMock's implementations of the methods return a best-guess value of the proper type, instead of a child mock object. There's something you need to be careful of when it comes to the in-place mathematical operators, such as += (__iadd__) and |= (__ior__), and that is the fact that MagicMock handles them somewhat strangely. What it does is still useful, but it might well catch you by surprise: >>> mock += 10 >>> mock.mock_calls [] What was that? Did it erase our call history? Fortunately, no, it didn't. What it did was assign the child mock created by the addition operation to the variable called mock. That is entirely in accordance with how the in-place math operators are supposed to work. Unfortunately, it has still cost us our ability to access the call history, since we no longer have a variable pointing at the parent mock object. Make sure that you have the parent mock object set aside in a variable that won't be reassigned, if you're going to be checking in-place math operators. Also, you should make sure that your mocked in-place operators return the result of the operation, even if that just means return self.return_value, because otherwise Python will assign None to the left-hand variable. There's another detailed way in which in-place operators work that you should keep in mind: >>> mock = MagicMock() >>> x = mock >>> x += 5 >>> x <MagicMock name='mock.__iadd__()' id='139845830142216'> >>> x += 10 >>> x <MagicMock name='mock.__iadd__().__iadd__()' id='139845830154168'> >>> mock.mock_calls [call.__iadd__(5), call.__iadd__().__iadd__(10)] Because the result of the operation is assigned to the original variable, a series of in-place math operations builds up a chain of child mock objects. If you think about it, that's the right thing to do, but it is rarely what people expect at first. Mock objects for properties and descriptors There's another category of things that basic Mock objects don't do a good job of emulating: descriptors. Descriptors are objects that allow you to interfere with the normal variable access mechanism. The most commonly used descriptors are created by Python's property built-in function, which simply allows you to write functions to control getting, setting, and deleting a variable. To mock a property (or other descriptor), create a unittest.mock.PropertyMock instance and assign it to the property name. The only complication is that you can't assign a descriptor to an object instance; you have to assign it to the object's type because descriptors are looked up in the type without first checking the instance. That's not hard to do with mock objects, fortunately: >>> from unittest.mock import PropertyMock >>> mock = Mock() >>> prop = PropertyMock() >>> type(mock).p = prop >>> mock.p <MagicMock name='mock()' id='139845830215328'> >>> mock.mock_calls [] >>> prop.mock_calls [call()] >>> mock.p = 6 >>> prop.mock_calls [call(), call(6)] The thing to be mindful of here is that the property is not a child of the object named mock. Because of this, we have to keep it around in its own variable because otherwise we'd have no way of accessing its history. The PropertyMock objects record variable lookup as a call with no parameters, and variable assignment as a call with the new value as a parameter. You can use a PropertyMock object if you actually need to record variable accesses in your mock object history. Usually you don't need to do that, but the option exists. Even though you set a property by assigning it to an attribute of a type, you don't have to worry about having your PropertyMock objects bleed over into other tests. Each Mock you create has its own type object, even though they all claim to be of the same class: >>> type(Mock()) is type(Mock()) False Thanks to this feature, any changes that you make to a mock object's type object are unique to that specific mock object. Mocking file objects It's likely that you'll occasionally need to replace a file object with a mock object. The unittest.mock library helps you with this by providing mock_open, which is a factory for fake open functions. These functions have the same interface as the real open function, but they return a mock object that's been configured to pretend that it's an open file object. This sounds more complicated than it is. See for yourself: >>> from unittest.mock import mock_open >>> open = mock_open(read_data = 'moose') >>> with open('/fake/file/path.txt', 'r') as f: ... print(f.read()) ... moose If you pass a string value to the read_data parameter, the mock file object that eventually gets created will use that value as the data source when its read methods get called. As of Python 3.4.0, read_data only supports string objects, not bytes. If you don't pass read_data, read method calls will return an empty string. The problem with the previous code is that it makes the real open function inaccessible, and leaves a mock object lying around where other tests might stumble over it. Read on to see how to fix these problems. Replacing real code with mock objects The unittest.mock library gives a very nice tool for temporarily replacing objects with mock objects, and then undoing the change when our test is done. This tool is unittest.mock.patch. There are a lot of different ways in which that patch can be used: it works as a context manager, a function decorator, and a class decorator; additionally, it can create a mock object to use for the replacement or it can use the replacement object that you specify. There are a number of other optional parameters that can further adjust the behavior of the patch. Basic usage is easy: >>> from unittest.mock import patch, mock_open >>> with patch('builtins.open', mock_open(read_data = 'moose')) as mock: ... with open('/fake/file.txt', 'r') as f: ... print(f.read()) ... moose >>> open <built-in function open> As you can see, patch dropped the mock open function created by mock_open over the top of the real open function; then, when we left the context, it replaced the original for us automatically. The first parameter of patch is the only one that is required. It is a string describing the absolute path to the object to be replaced. The path can have any number of package and subpackage names, but it must include the module name and the name of the object inside the module that is being replaced. If the path is incorrect, patch will raise an ImportError, TypeError, or AttributeError, depending on what exactly is wrong with the path. If you don't want to worry about making a mock object to be the replacement, you can just leave that parameter off: >>> import io >>> with patch('io.BytesIO'): ... x = io.BytesIO(b'ascii data') ... io.BytesIO.mock_calls [call(b'ascii data')] The patch function creates a new MagicMock for you if you don't tell it what to use for the replacement object. This usually works pretty well, but you can pass the new parameter (also the second parameter, as we used it in the first example of this section) to specify that the replacement should be a particular object; or you can pass the new_callable parameter to make patch use the value of that parameter to create the replacement object. We can also force the patch to use create_autospec to make the replacement object, by passing autospec=True: >>> with patch('io.BytesIO', autospec = True): ... io.BytesIO.melvin Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/lib64/python3.4/unittest/mock.py", line 557, in __getattr__ raise AttributeError("Mock object has no attribute %r" % name) AttributeError: Mock object has no attribute 'melvin' The patch function will normally refuse to replace an object that does not exist; however, if you pass it create=True, it will happily drop a mock object wherever you like. Naturally, this is not compatible with autospec=True. The patch function covers the most common cases. There are a few related functions that handle less common but still useful cases. The patch.object function does the same thing as patch, except that, instead of taking the path string, it accepts an object and an attribute name as its first two parameters. Sometimes this is more convenient than figuring out the path to an object. Many objects don't even have valid paths (for example, objects that exist only in a function local scope), although the need to patch them is rarer than you might think. The patch.dict function temporarily drops one or more objects into a dictionary under specific keys. The first parameter is the target dictionary; the second is a dictionary from which to get the key and value pairs to put into the target. If you pass clear=True, the target will be emptied before the new values are inserted. Notice that patch.dict doesn't create the replacement values for you. You'll need to make your own mock objects, if you want them. Mock objects in action That was a lot of theory interspersed with unrealistic examples. Let's take a look at what we've learned and apply it for a more realistic view of how these tools can help us. Better PID tests The PID tests suffered mostly from having to do a lot of extra work to patch and unpatch time.time, and had some difficulty breaking the dependence on the constructor. Patching time.time Using patch, we can remove a lot of the repetitiveness of dealing with time.time; this means that it's less likely that we'll make a mistake somewhere, and saves us from spending time on something that's kind of boring and annoying. All of the tests can benefit from similar changes: >>> from unittest.mock import Mock, patch >>> with patch('time.time', Mock(side_effect = [1.0, 2.0, 3.0, 4.0, 5.0])): ... import pid ... controller = pid.PID(P = 0.5, I = 0.5, D = 0.5, setpoint = 0, ... initial = 12) ... assert controller.gains == (0.5, 0.5, 0.5) ... assert controller.setpoint == [0.0] ... assert controller.previous_time == 1.0 ... assert controller.previous_error == -12.0 ... assert controller.integrated_error == 0.0 Apart from using patch to handle time.time, this test has been changed. We can now use assert to check whether things are correct instead of having doctest compare the values directly. There's hardly any difference between the two approaches, except that we can place the assert statements inside the context managed by patch. Decoupling from the constructor Using mock objects, we can finally separate the tests for the PID methods from the constructor, so that mistakes in the constructor cannot affect the outcome: >>> with patch('time.time', Mock(side_effect = [2.0, 3.0, 4.0, 5.0])): ... pid = imp.reload(pid) ... mock = Mock() ... mock.gains = (0.5, 0.5, 0.5) ... mock.setpoint = [0.0] ... mock.previous_time = 1.0 ... mock.previous_error = -12.0 ... mock.integrated_error = 0.0 ... assert pid.PID.calculate_response(mock, 6) == -3.0 ... assert pid.PID.calculate_response(mock, 3) == -4.5 ... assert pid.PID.calculate_response(mock, -1.5) == -0.75 ... assert pid.PID.calculate_response(mock, -2.25) == -1.125 What we've done here is set up a mock object with the proper attributes, and pass it into calculate_response as the self-parameter. We could do this because we didn't create a PID instance at all. Instead, we looked up the method's function inside the class and called it directly, allowing us to pass whatever we wanted as the self-parameter instead of having Python's automatic mechanisms handle it. Never invoking the constructor means that we're immune to any errors it might contain, and guarantees that the object state is exactly what we expect here in our calculate_response test. Summary In this article, we've learned about a family of objects that specialize in impersonating other classes, objects, methods, and functions. We've seen how to configure these objects to handle corner cases where their default behavior isn't sufficient, and we've learned how to examine the activity logs that these mock objects keep, so that we can decide whether the objects are being used properly or not. Resources for Article: Further resources on this subject: Installing NumPy, SciPy, matplotlib, and IPython [Article] Machine Learning in IPython with scikit-learn [Article] Python 3: Designing a Tasklist Application [Article]

0
0
7112

article-image-managing-heroku-command-line

Packt

20 Nov 2014

27 min read

Managing Heroku from the Command Line

Packt

20 Nov 2014

27 min read

0
0
33833

Packt

20 Nov 2014

16 min read

Amazon Web Services

Packt

20 Nov 2014

16 min read

0
0
23415

Modernizing our Spring Boot app

Ansible – An Introduction

Concurrency in Practice

High Availability Scenarios

Detecting Beacons – Showing an Advert

Machine Learning Examples Applicable to Businesses

No to nodistinct

Creating an Apache JMeter™ test workbench

Audio Processing and Generation in Max/MSP

Components

Trending Topics

Understanding the HBase Ecosystem

Configuring vShield App

Decoupling Units with unittest.mock

Managing Heroku from the Command Line

Amazon Web Services

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access